Aussie AI

Appendix 1: C++ Slug Catalog

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

 

 

“Life moves pretty fast.
If you don’t stop and look around
once in a while, you could miss it.”

Ferris Bueller’s Day Off, 1986.

 

 

Slug Hunting Advice

This appendix is about speeding up your C++ programs through general improvements to sequential or parallel coding. For AI-specific techniques for speeding up your Transformer's inference of your model (e.g. quantization, pruning), see the various chapters in Part V (Chapters 28-36).

Before we begin with anything that's actually useful, I have to introduce the obligatory wrist-slapping politically-correct deslugging advice for programmers. Hence, here are some general nuggets of advice when attempting to speed up your program:

  • Profile twice, code once. Performance profiling tools exist for a reason.
  • Don't micro-optimize. Unless you're into that kind of thing. But really, try to sit on your hands.
  • Do macro-optimize. Think about your data structures and algorithms.
  • Optimizing introduces new bugs. 100% guaranteed! Don't optimize the night before your release. Re-run your test suite.
  • Don't optimize exception handling. Tweaking rarely-executed code is a poor use of your geniousness.
  • Use open source third-party libraries that have already been optimized by others.

Or just ignore that advice and go crazy. It's just too much fun optimizing when the alternative is dreary debugging. Pro tip: it's even more fun writing a book on optimizing!

Where to hunt slugs? Some of the common large-scale issues with coding inefficiency in typical C++ programs include:

  • Function call hierarchies
  • Nested loops
  • Overuse of memory allocation
  • Constructor and destructor inefficiencies
  • Inefficient algorithms (e.g. linear search of arrays)
  • Unnecessary overhead or wrappers
  • Recursion. After you've coded up your university assignments (Tower of Hanoi, anyone?), please forget recursion exists.

C++ Speedup Techniques: Some of the general ways to speed up C++ programs at the design structure or algorithmic level include:

  • Faster data structures (e.g. hash tables).
  • Faster algorithms (e.g. fix linear search to something faster like, you know, hashing again).
  • Parallelize via multi-threading, multi-process, multi-core, multi-GPU, multi-something.
  • Vectorization (parallelize your important loops)
  • Precompute expensive functions into a lookup table at compile-time (e.g. activation functions).
  • Cache any complex calculations to trade extra space for time savings (e.g. KV caching).
  • Change floating-point to integer operations (quantization, anyone?)
  • Replace recursion with iteration. Subtract ten bonus points if you need to do this.

Some of the high-level C++ coding optimizations include:

  • Flatten function call hierarchies (stop wrapping everything so much, and inline the small functions at the bottom).
  • Optimize loops, especially nested loops (e.g. move loop-invariant code out, loop unrolling, vectorization, etc.)
  • Templates are effectively a compile-time optimization that improves speed at the cost of code space.
  • Reduce memory allocation (use less memory overall or replace memory allocation with temporary stack buffers).
  • Operator strength reduction (e.g. replace “*” with “+”, a pipe dream of all AI engineers).
  • Declare variables as close as possible to where they are used. This avoids instantiating objects that aren't needed on some paths.
  • Use pointer arithmetic, especially for loops over arrays.
  • Bitwise operations are fast, but the basic C++ integer operations are also fast too, nowadays. Benchmark, don't assume.
  • Use short-circuiting of the && and || operators, and also the ternary ?: operator, to avoid expensive function calls.

And finally, some things you might forget (and some that are forgettable):

  • Benchmark any important changes (e.g. operator strength reductions).
  • Turn up your C++ optimizer. There are higher settings you could try.
  • Add compile-time optimization hints (e.g. constexpr and restrict).
  • Overclock your PC (like a gamer).
  • Sell your car to buy a better GPU.
  • Put every function in a header file and make them all inline.
  • Reorder your case labels. Surely it helps.
  • Change i++ to ++i in everyone else's code.

C++ Class Slugs

The C++ class features are designed to add encapsulation and modularity, while retaining speed, but there's still plenty of ways that slugs can crawl into your classes. C++ class optimizations include:

  • Ensure small member functions are inline, especially those that do “get” and “set”.
  • Add inline to other friend or non-class functions (esp. if small or commonly used).
  • Pass objects to functions using “const&” (pass-by-reference), rather than pass-by-value.
  • Watch out for temporary objects. These can occur in simple assignments or function call expressions or in weird ways like accidentally making your overloaded assignment operator have the wrong type.
  • Use reference variables instead of copying objects into temporary variables.
  • Take care templating class objects (e.g. when using the std::vector class for a vector of your class objects). Lots of hidden calls to constructors and destructors may arise in resizing.
  • Use the initializer list in the constructor for initializing data members.
  • Use friend functions for faster accesses to internal object data.
  • Block accidental calls to the copy constructor or class assignment operator (i.e., if you aren't defining them, make a dummy version that is “private” with a “void” function body).
  • Avoid returning objects if you can. Return a reference if it's safe to do so.
  • Take care with “wrapper” classes like “smart pointers”, “smart integers” or “smart buffers”. Usually, they're safer but slower. How smart is that?

Bypass interfaces with friend functions

Using friend functions may be faster because they can bypass class getter and setter member functions. If a class declaration has a good deal of private data, it is common C++ style to declare an interface of public member functions to access private data. Although the class interface can be quite efficient if member functions are declared as inline, the need to call a function to access a data value can still make it inefficient in some cases. The use of friend functions and friend classes can be efficient because this bypasses the class interface. For example, a member function to set a data member may perform some range checking on the value, but if we can be sure that a particular function will not use incorrect data, a friend function can be used to bypass this checking.

friend functions (or friend classes) should not be considered unless the function needs very fast access to data members, and the member functions to access the data perform other computations. Note that a member function, with its special privileges, also bypasses the class interface (because it is part of it), and friend functions should not be used where member functions would be more appropriate. Programming style is the consideration here, as they would both have similar efficiency.

A good example of friend function efficiency occurs when an operator function operates on two different classes, such as when we need an operator that multiplies a Matrix object by a Vector object to yield a new Vector. Assume that both classes have member functions to access individual elements of the Vector or Matrix. Consider the declaration of the multiply function as neither a class member nor a friend function, as in:

    const int N = 10; // Number of elements in vector/matrix
    class Vector {
        double data[N];
    public:
        double get_element(int i) const { return data[i]; }
        void set_element(int i, double value) { data[i] = value; }
    };

    class Matrix {
         double data[N][N];
    public:
         double get_element(int i, int j) const { return data[i][i]; }
    };

    Vector operator * (const Matrix& m, const Vector& v)
    {
        Vector temp;
        // multiply matrix by vector
        for (int i = 0; i < N; i++) { // for each row
            double sum = 0.0; // sum of N multiplications
            for (int j = 0; j < N; j++) {
                sum += m.get_element(i, j) * v.get_element(j);
            }
            temp.set_element(i, sum); // store new vector element
        }
        return temp; // return new vector
    }

This will be horribly inefficient because the operator*() function must go through both class interfaces to access elements. Although it isn’t necessarily any less efficient here, if range checking of the array index i were present in the member functions to set or access the elements, this would cause inefficiency.

Note that if the Vector class overloaded the [] operator instead of using a get_element member function, this would make no difference to efficiency—notational convenience is gained but the operator[] function has the same cost as any other function.

One alternative to consider is to make the operator* function a member of the Vector class, but this will still mean using the interface for the Matrix class. A more efficient solution is to make the operator* function a friend of both Matrix and Vector classes, thus allowing it direct access to their individual data elements, bypassing any range checking on array indices. The more efficient version, using a friend function, is:

    const int N = 10; // Number of elements in vector/matrix
    class Matrix;
    class Vector {
        double data[N];
    public:
        friend Vector operator * (const Matrix& m, const Vector& v);
    };

    class Matrix {
        double data[N][N];
    public:
        friend Vector operator * (const Matrix& m, const Vector& v);
    };

    Vector operator * (const Matrix& m, const Vector& v)
    {
        Vector temp;
        // multiply matrix by vector
        for (int i = 0; i < N; i++) { // for each row
            double sum = 0.0; // sum of N multiplications
            for (int j = 0; j < N; j++) {
                sum += m.data[i][j] * v.data[j]; // access data directly
            }
            temp.data[i] = sum; // store new vector element
        }
        return temp; // return new vector
    }

The disadvantage of using friend functions is the same as their advantage: they pierce class encapsulation. Because a friend function directly makes use of hidden private data members, and any change to the class may require a change to the definition of the friend function, whereas in the first version of the operator* function the use of the “get_element” member functions of both Vector and Matrix meant that it would need no changes, provided the “get_element” functions were correctly changed within the class.

Avoid Virtual Functions

Object-oriented programming purists will hate me for this section. C++ virtual functions are a wonderful incarnation of OOP and they can be beautiful and elegant. But you need to avoid them sometimes if speed is your goal.

They're also very fast function calls, even though done dynamically. Although virtual function calls seem like they're complicated and possibly slow, they're actually very carefully designed to be very fast to call in C++ class hierarchies. There's lots of painstaking work for compiler designers to get them to compile correctly, but their runtime efficiency is great for programmers. The implementation is effectively a small lookup table with function pointers. It's a couple more assembler statements before the function call, and the overhead of calling a function will dwarf that cost.

So, why do I say to review your use of virtual functions? Because they're an optimizer blocker. Since they're a dynamic runtime function call, there's much less opportunity for the C++ compile-time optimizations to remove these calls. Indeed, the compiler cannot always determine what function is being called and you can lose these speedups:

  • inline functions
  • constexpr function evaluation

Hence, I say you have to choose carefully in the use of virtual functions. Avoid them for speed-critical functions, and don't use them only for good OOP style if you don't really need them. But also, don't be afraid of using them in other instances because they're only marginally slower than a non-inlined function call. Kudos to the C++ language designers for that!

Avoid unnecessary virtual function calls

The use of virtual functions, when they are not needed, is obviously inefficient. virtual functions are needed only when dealing with pointers or references to objects of unknown type. If the program never uses pointers or references to objects, or if it does not have any derived classes, no function needs to be virtual and the use of virtual wastes space. In addition, because virtual functions relate only to the use of derived classes, declaring any functions as virtual in a class that has no derived classes is also unnecessarily inefficient.

One common situation where virtual may appear necessary, but need not be, occurs with redefining a member function in a derived class. This does not necessarily mean that the function must be defined as virtual in the base class (nor in the derived class — the virtual keyword is never needed in the derived class). Of course, if the program starts using pointers or references to these classes, the functions may need to be virtual, in which case it may be better style to declare the member function as virtual.

A call to a virtual function need not always be a “real” virtual call. For example, passing an object by reference (either as a reference or as a pointer type) can occur when changing functions to pass-by-reference for efficiency improvement. Any calls to virtual functions inside that (not necessarily virtual) function will be such that the compiler cannot know that an ordinary function call to the member function would suffice. It does not perform any global analysis to determine that all arguments to the function are base objects, and not derived objects. For example, in the following code, it isn’t clear that the call to the (virtual) print function could be replaced by an ordinary call:

    void print_base_object( Base & object)
    {
        object.print();
    }

The overhead of virtual function calls can be removed whenever the programmer can be sure that only one type of pointer/reference to an object is being used. In particular, whenever a programmer can be sure that a pointer/reference to a base class object points to a particular object, the qualified member function name can be used. For example, the virtual call uses:

    p->print();

And the more efficient code that avoids a virtual function call is:

    p->Base::print();

An example of extra information making this change possible occurs when a program uses a number of different (homogeneous) linked lists, with each linked list containing the same type of object (one with base objects, one with derived objects). When implementing a print_list function to print out a linked list, you can write it generally to call a virtual-declared print_object function:

    void LinkedList::print_list()
    {
        for (Base *temp = head; temp != NULL; temp = temp->next())
            temp->print_object();
    }

This means that each call to print_object has the run-time overhead of a virtual function call. A more efficient alternative is to make use of the knowledge that each list must contain the same type of object, and have two different print_list functions (i.e. use a virtual function to do the dirty work of printing the objects).

    void Base::print_list_hidden()
    {
        for (Base *temp = this; temp != NULL; temp = temp->next())
        temp->Base::print_object();
    }

    void Derived::print_list_hidden()
    {
        for (Derived *temp = this; temp != NULL;
        temp = (Derived*)temp->next())
        temp->Derived::print_object();
    }

    void LinkedList::print_list()
    {
        if (head != NULL)
            head->print_list_hidden(); // call virtual function
    }

With this approach, all of the lower-level calls to print_object can be bound at compile-time and the only virtual call is the call to print_list_hidden at the very top. Hence, by using our knowledge about the linked lists, we have reduced the number of run-time virtual function calls.

Specialize inherited member functions

In an inheritance hierarchy, the derived class is a specialized version of the base class. This means that member functions inherited from the base class can often be rewritten more efficiently to make use of the known special features of the derived class objects.

Example: Triangular Matrix Algebra. As an example, consider a class “UTMatrix” (upper triangular matrix) which is derived from class “Matrix” and represents matrices where all elements below the main diagonal are zero.

The general matrix “add” function of the Matrix class is inherited by the UTMatrix class, and it will work correctly. However, this inherited function is inefficient and it is more efficient to add a new member function to the UTMatrix class to add two upper triangular matrices avoiding all additions involving elements below the diagonal (because they are known to be zero).

In fact, it is also more efficient to write special functions to add ordinary matrices to upper triangular matrices. The computation of the determinant of a triangular matrix is also more efficient than that for a general square matrix, so this member function should also be rewritten in the UTMatrix class.

Example: Complex Numbers. As another example, consider a class “Imaginary” (imaginary numbers) derived from another class “Complex” (complex numbers). For all operations involving Imaginary objects, it is certain that the real part of the complex number is zero. Hence, it is more efficient to rewrite all inherited operations that use the real part of a Complex object, such as: addition, multiplication, norm, etc.

The main disadvantage of specializing member functions is that the code reuse advantage of inheritance is negated; more programmer time must be spent on recoding the specialized member functions. Other disadvantages are the increased probability of error, most special cases to test, and an increase in executable code size.

Assignment Operator Return Type

The return type of the overloaded assignment operator should usually be a reference type or void. A common mistake is to make it return a class object. Consider the following class declaration:

    class Integer {
        private: int val;
        public:
        Integer operator = (const Integer &x);
        // ...
    };

    Integer Integer::operator = (const Integer &x)
    {
        val = x.val; // copy data
        return *this; // return left operand
    }

This declaration of the assignment operator to return an object permits expressions using the result of assignment, such as:

    Integer x, y, z;
    x = x + (y = z); // embedded assignment
    x = y = z; // multiple assignment

However, it needlessly calls the constructor and destructor for a temporary object, leading to inefficiency, and occasionally to error. The correct declaration of the assignment operator is to return a const reference to Integer. This simply requires an & in the return type declaration, as follows:

    const Integer& Integer::operator = (const Integer &x)
    {
        // ... same as above
    }

Note that const is required because the use of a non-const reference return type is slightly undesirable because it allows the very strange (and probably incorrect) multiple assignment:

    (x = y) = z;

Although the failure to declare the return type as a reference above was a slug, rather than a bug, it can be more dangerous. For a MyString class with dynamic allocation, using a return type of MyString instead of MyString& will cause a temporary object to be created at the return statement, using the copy constructor with “*this” as the argument. If the copy constructor is defined correctly, this is often just an instance of inefficiency, but it may also lead to fatal errors related to temporary objects. If the copy constructor isn't defined correctly, the programmer has an error with an increased level of complexity caused by temporary objects.

Return Type Void: Note that it may be far better simply to declare the return type of the assignment operator as void, rather than a reference type. Although this prohibits embedded assignments in expressions and also multiple assignments, these are poor style anyway and should probably be discouraged. Using return type void is also slightly more efficient because no value need be returned. However, returning the reference type is the more common C++ idiom.

Singleton Classes

In a one-instance class there will only ever be one object defined from it. There are called “singletons” in the “design patterns” parlance. In this situation the class can be defined very efficiently by making use of compile-time initialization with data members declared as “static” members.

An example is a hash table implementation of a symbol table (e.g. in a compiler keyword table or an AI vocabulary table used by the tokenizer), where only one symbol table will ever be used. The crucial fragment from this code is:

    class SymbolTable {
      private:
        Node * table[TABLE_SIZE]; // Hash table - array of pointers
      public:
        SymbolTable(); // constructor
    };

    //-----------------------------------------------------------
    // Constructor - initialize the hash table to empty
    //-----------------------------------------------------------
    SymbolTable::SymbolTable()
    {
        for (int i = 0; i < TABLE_SIZE; i++) // all pointers are NULL
        table[i] = NULL;
    }

If there will only be one hash table, the constructor is needlessly inefficient. A more efficient version declares the hash table as a static data member and the implicit initialization to zero will set all the pointers to NULL at compile-time. The efficient code for a one-instance hash table is:

    class SymbolTable { // ONE INSTANCE ONLY
      private:
        static Node *table[TABLE_SIZE]; // Compile-time initialization
      public:
        SymbolTable() { } // constructor does nothing
    };

Temporary Objects and Destruction

Temporary objects are created automatically by the compiler in a number of situations. This is a similar idea to that of a C++ compiler generating temporary values for intermediate results of a computation. However, a temporary with class type will have its constructor and destructor activated, so temporary objects can be quite expensive.

For example, try the following class to demonstrate how a temporary object is defined for intermediate expression results, particularly that returned by the + operator:

    #include <iostream.h>
    class Integer {
    private: int val;
    public:
        Integer() { val = 0; cout << "Constructor\n"; }
        ~Integer() { cout << "Destructor\n"; }
        Integer(const Integer &x)
        { 
            val = x.val;
            cout << "Copy Constructor\n";
        }
        void operator=(int x) { val = x; }
        void operator=(const Integer &x) { val = x.val; }
        friend Integer operator+(Integer &x, Integer &y);
    };

    Integer operator+(Integer &x, Integer &y)
    {
        Integer temp; // user-defined temporary
        temp.val = x.val + y.val;
        return temp; // creates compiler temporary
    }

    int main()
    {
        Integer i, j, k;
        k = i + j;
    }

There are 4 calls to the ordinary constructor corresponding to i, j, k, and temp; there is a single call to the copy constructor that occurs when the return statement creates a temporary object for the object returned from operator +. This temporary object is the result of i+j and is then assigned to k.

In this case there are poor performance and no errors related to temporary objects and in most cases, temporary objects are transparent to the programmer for a correctly defined class (i.e., having both assignment operator and copy constructor). However, if the programmer unwittingly stores a reference or pointer to members of a temporary object, there may be errors in a later use of the reference or pointer. The problem is that temporary objects can be destroyed by the compiler as soon as they have been used in the computation, and so the reference or pointer is no longer valid. However, since the timing of the destruction of temporaries is undefined, some compilers will not exhibit an error for such code because they leave the destruction of temporaries till late; it depends on how aggressively a particular compiler performs its internal code optimization.

Overloaded Postfix Increment Operator

The postfix increment operator (x++) is a big slimy slug. I'm not talking about your for loop with “i++” versus “++i” for an integer, which is the same on any compiler since about the 1990s, despite the endless online arguments about it. I'm talking about overloaded increment and decrement operators for classes.

In C++ you can declare separate prefix and postfix increment overloaded operators for a class, by putting an extra dummy “int” parameter in the postfix version. You can also leave out a postfix version, and the prefix version will be called for both usages. The default call to prefix versions is not a slug, but a potential bug if you copy-paste code or use postfix ++ in template code. Also, returning the current object for the prefix increment operator is only a minor slug, because you're returning a reference to the current object (and a reference is really just a pointer).

Postfix operations are much worse. They are slower than airport queues at Thanksgiving. The semantics of the postfix increment operator (x++) in the C++ language are effectively:

    1. Create a temporary copy of your object.

    2. Increment the current object.

    3. Return the temporary object.

If you actually do this big shemozzle for a class object, you've got a whole lot of processing happening on a temporary object that's probably not even used. Maybe the optimizer will cut a lot of it as dead code, or maybe not. With the horrors of that echoing in your mind, here's my first suggestion:

Don't even declare postfix overloaded operators for your class.

Don't overload the postfix increment operator. In fact, you can stop it being used by declaring a dummy version that is “private” (stops external usage) with a “void” function body (stops internal usages).

    private:
        void operator++(MyClass &x, int) void;   // Postfix denied!
        void operator--(MyClass &x, int) void;

Void Return Type: Note that attempts to call a postfix ++ operator on a class type may occur in template instantiation with your type. If it's your template, change the template code to use prefix operators. If you really must define an overloaded postfix increment or decrement operator, then here's my second suggestion:

Make the return type “void”

Hence, a basic usage of “x++” will compile and work correctly. Not only will it be efficient to not return anything, but the compiler will also ensure that nothing more fancy will run. A compilation error will block any use of postfix ++ that relies on the operator returning the old object. In other words, this will be fine:

    x++;

But this will get a compiler error alerting you to a problem:

    y = x++;   // Error

Standard Vector Object Resizing

The standard vector class is usually very efficient for basic data types, but you need to take care if you instantiate it with a class type. The risk is that you'll have hidden calls to this class type's constructors and destructors, potentially for every element of the vector, under various circumstances.

This slug is a type of “hidden copy constructor call” problem. If you don't manage the size of the standard C++ vector class objects in the initialization or via the “reserve” method, there can be a lot of hidden resizing happening behind the scenes whenever you are adding elements to the vector. This will at least be doing bitwise copies of the elements of each vector. But it's even worse if the vector contains complex objects with a defined copy constructor. When it's resizing the vector, it will call the copy constructor for each and every object that is an element of the vector because it needs to move them all.

Even for basic data types there can be some cost to copying the data when resizing. You can take control of this with the “reserve” function, so that the vector object doesn't need to keep resizing itself if you're adding to it.

Skipping Destructor Cleanup

It's really good OOP coding style for your destructor to carefully clean up every resource your object needed, and you know, beautiful coding idioms are just so very important. I certainly wouldn't want to be the person to tell you to do some ugly hack, even if it made everything a whole boatload faster. Umm, really, I wouldn't want to, but if you promise not to tell anyone you heard it from me...

Typically, destructor cleanup means calling “delete” on allocated memory used by the data members, and for complex objects, it may also mean closing files. And I often find that the cost of the destructor starts becoming significant in its own right. And one destructor call can trigger lots more, like roaches, only without the social skills. If you call “delete” on any member objects or worse, arrays-of-objects, then those destructors get called, and this triggers a whole blam of code that cascades down the object hierarchy.

Here's a thought: don't cleanup!

This is an optimization worth considering in some cases:

  • Batch jobs
  • Re-launching server daemons
  • Program is shutting down anyway

If your program is a run-once batch job, and it's not going to be running again with a new request, or even if it's an AI inference server process that handles 1,000 user queries, after which another copy will launch in its place, then you can make like a teenager, and don't cleanup. Thumb your nose at Valgrind and comment out all those delete lines in your destructors.

Let the memory leak!

Program exit is a special case that you can detect. If your program is exiting “cleanly” then it does destructor calls to all of the global objects, and so on. And you usually know in the code when the program is shutting down, whether from a user choice, a timeout or limit exceeded, or something internal like an assertion failure. One idea is to use a global Boolean flag that says “I'm shutting down” and then check it inside all of the main destructors:

   MyClass::~MyClass()
   {
        if (g_aussie_im_shutting_down) return;  // Skip!
        ...
        // Lots of stylistically beautiful code
   }

Is it safe? What happens if you just skip all the cleanup? Well, nothing bad in many cases. The operating system cleans up the allocated memory as part of reclaiming all of the memory. Files are a bit more of a complicated story. Standard C++ shutdown should also properly close any files opened for reading, although you might possibly lose some buffered output written to a log file, so maybe you should still flush buffers or close those files.

This idea of skipping destructors isn't always workable. It's not always clear that ending the process will properly save buffered output in closing files. As another more complex example, if there's an abnormal disconnect from a database session or a remote network connection hangup (e.g. socket session not ended properly), there might be some other consequences, like error messages in the logs locally or for the remote peer.

Initializer lists for member objects

When a class declaration contains a class object as one of its members it is important to use the correct method of initialization to retain efficiency. Consider the declaration of a class B containing a member object from class A:

    class A {
      private:
        int val;
      public:
        A() { val = 0; }
        A(int x) { val = x; }
        void operator = (int i) { val = i; }
    };

    class B {
      private:
        A a; // member is itself an object
      public:
        B() { a = 1; } // INEFFICIENT
    };

Declaring an object of type B will cause the default constructor for the member object of type A to be invoked immediately before the default constructor for B. Then the = operator for class A is used to set the member object, a. Hence, the constructor for B involves a call to A’s default constructor and a call to the assignment operator. The call to A’s default constructor is redundant and should be avoided. Fortunately, C++ provides a convenient syntax for passing arguments to constructors of member objects. The default constructor for B should be recoded to use the initializer list:

    B() : a(1) { } // EFFICIENT

This initialization syntax causes the constant 1 to be passed to the constructor for the member object, a (the constructor accepting the int parameter is called, instead of the default constructor). Thus, instead of calling the default constructor and the assignment operator for A, only the int constructor for A is called.

This initialization method is efficient whenever calling the default constructor for a member object is not appropriate, for instance, when the member object is initialized by a call to the assignment operator within the main object’s constructor (as above, where B’s constructor assigned to its member of type A). This form of initialization can be used for any type of data member (i.e. not only class objects), although it will be neither more nor less efficient than assignment for built-in types. The special initialization syntax should be used wherever it is applicable, since it can never be less efficient than assignment to the data members within the constructor, and will often be more efficient.

Initializer lists for base objects

Base objects. Similar efficiency considerations apply to constructors in derived classes, since the data member(s) in the base class act like an object member. The constructor for the base class is always called when a derived class object is constructed. When the default constructor for the base class is of no use to a derived class object, it is more efficient to pass arguments directly to a non-default base class constructor, using the special initialization syntax. The same syntax applies as for data member initialization, except that the type name of the base class is used instead of the name of a data member. A contrived example of this form of initialization is:

    class Derived : public Base {
      public:
        Derived() : Base(0) { } // Call Base(int) constructor
    };

Avoid temporary objects

In the same way that temporary integer variables are used to compute an integer expression, so too are temporary objects used in non-trivial expressions involving class objects. For example, consider this code where the Complex class has defined the + and = operators:

    Complex c1,c2,c3;
    c1 = c2 + c3;

This is likely to create a temporary Complex object as the result of the addition, and this temporary object is then passed as an operand to the = operator. In other words, the expression is actually evaluated as:

    operator=( c1, operator+(c2, c3) );

A temporary object must be created to store the “+” sub-expression computed for the second argument, and then passed to the “=” operator. Whether the operands to operator= are passed by reference or by value has no effect on whether a temporary is created in this situation (it will only affect the creation of new objects inside the operator= function).

One (rather inelegant) method of avoiding this creation of temporaries is to create a specialized function to handle it:

    void AssignThree(Complex &c1, Complex &c2, Complex & c3);
    ...
    AssignThree(c1,c2,c3); // c1 = c2 + c3;

The function should probably be a friend function to allow efficient access to the data members of the three Complex objects.

The problems with this solution are its very poor style (because the neatness of the use of overloaded operators is lost), and also its non-general character. More complicated expressions will still generate temporaries, unless more special functions are added as friend functions, leading to even worse style. This “cure” is perhaps worse than the disease.

Avoid temporaries via extra member functions

There are situations where the removal of temporaries does not lead to poor style. Consider the following definition of a minimal Complex class:

    class complex {
      private:
        double re; // real part
        double im; // imaginary part
      public:
        // Constructors
        complex() { re = 0.0; im = 0.0; }
        complex(double r) { re = r; im = 0.0; }
        complex(double r, double i) { re = r; im = i; }
        // Copy constructor
        complex(complex &c) { re = c.re; im = c.im; }
        // Overloaded assignment operator
        void operator = (complex & d) { re = d.re; im = d.im; }
        // Overloaded + operator
        friend complex operator + (complex &c1, complex &c2);
    };

    inline complex operator + (complex &c1, complex &c2)
    {
        return complex(c1.re + c2.re, c1.im + c2.im);
    }

Consider this class definition when used in the following code sequence:

    complex c1, c2;
    c1 = 2.0;
    c2 = c1 + 3.0;

The effect is identical to:

    c1 = complex(2.0); // invoke "double" constructor for 2.0
    c2 = c1 + complex(3.0); // invoke "double" constructor for 3.0

The C++ compiler automatically creates two temporary objects from the double constants, and calls the double constructor to do so. The inefficiency of the creation of a temporary object and the call to the constructor can be avoided by adding a few more functions to the class declaration:

    void operator = (double d) { re = d; im = 0.0; }
    friend complex operator + (double d, complex &c2);
    friend complex operator + (complex &c1, double d);

If these functions are present, then the double constants are passed directly to the double parameters of these functions. No temporary object is created, and hence the constructor is not called. Note that two symmetric versions of operator+ are required because the C++ compiler cannot assume that the commutativity of + holds for user-defined class objects.

By making the “interface” efficient for mixing complex and double variables, the creation of temporaries has been reduced. This can be generalized: it is better to provide member or friend functions to class X for a specific parameter type Y, than to provide only a constructor to create new X’s from Y’s.

Declare objects close to use

The C++ language allows variable declarations to appear almost anywhere within a program. Although the placement of variable declarations may seem unrelated to efficiency, it can have some effect when objects with non-trivial constructors are declared. For efficiency reasons, an object must be declared as close to its first use as possible. In particular, the C style of declaring all variables at the top of a function is often inefficient. Consider the C++ code below:

    void dummy(...)
    {
        complex c; // create object
        if (... ) {
            .... // use c
        }
    }
The complex object is not used if the condition in the if statement is false — the constructor and destructor for the unused object are called needlessly.

Declare Objects with Full Initialization

Another consideration is that objects should not be declared until there is enough information to construct them fully. For example, given a user-defined class “complex”, consider the following code:

    complex c; // construct c
    // ....
    c = 1.0; // initialize c

This is less efficient than calling the correct constructor directly by using:

    complex c(1.0); // construct and initialize c

The first code sequence involves a call to the default constructor and the overloaded operator=, whereas the second declaration calls only the (double) constructor for the complex class.

Unfortunately, there are practical limits to the extent to which objects can be declared near their first use. If the first use of an object is inside a compound statement, and the object must also be used outside the compound statement, the scope resolution rules prevent the declaration from being placed inside the compound statement. For example, consider the code below:

    double d;
    complex c;
    while(....) {
        cin >> d; // get double value from user
        c=d; // set complex number
    }
    cout << c; // print the complex number

In this sequence, it would be more efficient to declare “c” inside the loop block using the direct call to a double constructor:

    complex c(d);

However, this would prevent the use of c outside the scope of the braces. This limitation is an unfortunate consequence of the programming language design choice to make braces both the method of grouping statements and the scoping mechanism in C++ (but there are many more important advantages supporting this decision). Unfortunately, it is not even possible to remove the braces in the above example, using the comma operator as by:

    while(....)
        cin >> d, complex c(d); // FAILS: compilation error

C++ syntax prevents a declaration from being an operand of the comma operator.

Nothing Constructors. What we really want is a way to declare a class type variable, but not run its constructor. I'm not aware of a good way to do this. One way would be to use pointers and dynamically allocated “complex” objects, which is successful and standardized, but this adds extra memory management overhead.

Here's a thought. Maybe something like this works? Declare a dummy constructor with a dummy parameter type:

    class Banana { };
    complex(Banana b) {  } // nothing!

Then your call to the dummy constructor is hopefully optimized to nothing:

    Banana b;
    complex c(b);  // Nothing!

Data Member Optimizations

These optimizations apply to C++ objects or structures. There are various ways to speed up the data accesses and writes to a data member in an object.

Avoid bit-fields. Bit-fields are a special C++ feature designed to reduce space in an object or structure.

    struct node {
        unsigned int visited :1; // bit-field 
    };

Avoid bit-fields if you want runtime speedup. They are great at reducing memory size, but often at the cost of extra run-time overhead on any accesses to these fields. Hence, for improved efficiency, at the cost of space wastage, remove the “:1” qualification and change to a small data type such as bool, char, or unsigned char.

Memory alignment: If there are mixed size data members, or there are some with “alignas” alignment settings, then memory alignment issues can needlessly create an oversize object. This is more of a problem in terms of unnecessary space usage, but adds inefficiencies in the need to initialize or copy the extra padding bytes for large arrays of objects. The general rules for minimizing size are to: (a) order members from large to small, and (b) group like size data types together.

Most used data member first. The machine code for an access to a structure or object's data fields usually involve a base address of the object, to which is added an offset that is specific to each field. References to the first field of a structure can often be more efficient because there is no need to add an offset (i.e., the offset is zero). Hence, the most used class data member or structure field should be placed first in the declarations.

Order data members by usage. It's not just the first data member whose order matters. Memory access issues such as data locality, predictive caching and memory access pipelining mean that all of the most-used data members should be close together in an object. In very large objects, there are some platforms where smaller offsets are more quickly calculated, such as data members with less than 128 or 256 as their offset. Hence, a simple optimization is to order the data member declarations according to their usage.

Function Slugs

Functions are an important building block of your code. Some ways to get the slugs out of functions include:

  • Declare small functions inline.
  • Avoid recursion.
  • Pass objects by reference.
  • Avoid function pointers.
  • Specialize functions with default arguments.

Avoid Function Pointers

C++ allows a data type called a “function pointer” or a “pointer to a function” as part of its standard language. These are carefully type controlled, so they are reasonably efficient. However, they are not any faster than regular function calls, just because they're a fancy pointer construct, and there's a simple reason that they're not super-efficient: they're function calls!

A function pointer is a call to a function, so it has the whole sequence to implement. It's not much worse than a standard function call, but there's another problem. Function pointers make it difficult for the C++ compiler to get rid of the function call. The use of a function pointer will obscure much of the normal compile-time optimization logic. Hence, function pointers can be less efficient for:

  • inline functions
  • constexpr functions
  • Intrinsic functions

In summary, they're a neat feature of C++, but not an efficiency gain. Use function pointers if they are convenient, but not as a speedup.

Change recursion to iteration

Recursion is an elegant method of problem solution, but often incurs unnecessary function call overhead. Where possible, recursion should be replaced with an iterative algorithm. For example, the famous example of a recursive “factorial” function would always be coded in a loop by professional programmers.

Fibonacci numbers. With a little insight, many recursive algorithms can be coded without recursion. For example, the Fibonacci number sequence (1,1,2,3,5,8,13,...) is defined by having the next number as the sum of the previous two numbers, with the following recursive rules:

    Fib(0) = 1
    Fib(1) = 1
    Fib(n) = Fib(n−1) + Fib(n−2)

This has the obvious and very elegant recursive implementation:

    int fibonacci(int n)
    {
        if (n <= 1 )
            return 1;
        else
            return fibonacci(n - 1) + fibonacci(n - 2);
    }

However, there is no need to use recursion here, and a short loop is adequate. A non-recursive computation of the Fibonacci numbers is shown below:

    int fibonacci(int n)
    {
        int small = 1, large = 1;  // F0 = F1 = 1
        while (n > 1) {
            int temp = small + large; // Fn = Fn-1 + Fn-2
            small = large;
            large = temp;
            n--;
        }
        return large;
    }

Binary Trees. There are many examples of common algorithms that are unnecessarily coded using recursion. Almost all linked list algorithms can be coded without recursion, as can the most common binary search tree operations: search, insertion and deletion. For example, the recursive implementation of tree insertion is:

    void insert(Tree *root, Tree new_node)
    {
        if (*root == NULL) // Found bottom of tree 
            *root = new_node; // insert here 
        else {
            if (new_node->data <= (*root)->data)
                insert(&(*root)->left, new_node);
            else
                insert(&(*root)->right, new_node);
        }
    }

The non-recursive version of binary tree insertion is given below. It is somewhat less elegant, uses a few more variables, but should be more efficient.

    void insert(Tree *root, Tree new_node)
    {
        Tree temp = *root;
        if (temp == NULL) // empty tree special case
            *root = new_node;
        else {
            for (;;) {
                if (new_node->data <= temp->data) { // go left?
                    if (temp->left == NULL) { // leaf?
                        temp->left = new_node; // insert it
                        return; // finished
                    }
                    else
                        temp = temp->left; // go left
                }
                else { // going right
                    if (temp->right == NULL) { // leaf?
                        temp->right = new_node; // insert it
                        return; // finished
                    }
                    else
                        temp = temp->right; // go right
                }
            }
        }
    }

I'm sorry, Professor! Your recursive code is short and beautifully elegant, but mine is longer, uglier, and faster! Maybe I shouldn't tell my Professor that I've never coded a binary tree since finishing my degree? Hash tables are the name of the game.

Eliminating tail recursion

Recursion is rarely a good solution, but some types of recursive algorithms are not easy to change to loops, because they would require a stack data structure to do so. If a stack is needed, there may be little gain in removing recursion fully — it depends on how efficiently recursion is implemented by the compiler on the builtin C++ function call stack, versus your skill in hand-coding a stack data structure.

In these situations, a simpler optimization is still possible without a stack. Partial recursion elimination without the need for a stack is possible via the elimination of “tail recursion.” Tail recursion occurs when the last action of the recursive procedure is to call itself.

A simple modification changes this last recursive call to become a loop back to the top of the current invocation. For example, consider the preorder traversal of a binary tree. The simplest recursive algorithm is:

    void preorder(node_ptr root)
    {
        if (root != NULL) {
            visit(root);
            preorder(root->left);
            preorder(root->right); // Tail recursion here
        }
    }

Tail recursion can be eliminated by replacing the if statement with a while loop. The transformation effectively reduces recursion by half, as the second recursive call is eliminated. This reduction in recursion is achieved with virtually no extra overhead!

    void preorder(node_ptr root)
    {
        while (root != NULL) { // while loop replaces if
            visit(root);
            preorder(root->left);
            root = root->right; // Move to right subtree
        }
    }

Replacing recursion with a stack

Some recursive algorithms cannot be easily replaced by iterative loop equivalents. For example, in the preorder binary tree traversal above, we were unable to remove both of the recursive calls. In these situations, recursion can be replaced with an algorithm using a stack data structure.

All recursive algorithms can be replaced by a stack because recursive algorithms are actually using an implicit stack (the program stack of function calls). Whether use of a stack will be more efficient than a recursive algorithm depends on a number of factors. The choice of a stack over recursion is machine-dependent. In particular, it is quite likely that the program stack is supported by efficient low-level instructions and that (recursive) function calls are executed very efficiently. Can you do better?

On the other hand, recursion requires that much information be stored on the stack (i.e. parameters, automatic local variables, machine registers), whereas an algorithm making use of an explicit stack will usually only need to store a few items, making it potentially faster than the function call stack. If the maximum size of the required stack is known beforehand, a stack can be quite efficiently implemented as an array, whereas a dynamic stack as a linked list will usually be more costly because of the cost of memory allocation.

The following shows the preorder traversal with tail recursion elimination removing one recursive call and an explicit stack replacing the other. In this case, the explicit stack need only store pointers.

    void preorder(node_ptr root)
    {
        stack_type S;
        init_stack(S); // set to empty stack
        while (root != NULL || !is_empty_stack(S)) {
            if (root != NULL) {
                visit(root); // visit a tree node
                push(S, root->right); // save right subtree
                root = root->left; // go to left subtree
            }
            else
                root = pop(S); // get node from stack
        }
    }

Collapsing recursive calls

If you can't be bothered changing a recursive algorithm to a loop or stack, here's a smaller optimization to consider. By channeling the spirit of loop unrolling, we can “collapse” one or more levels of recursion into sequential code. The method of “function call collapsing” can be applied to recursive functions in this limited sense. Obviously, it isn’t possible to collapse a recursive function call completely into inline code, but it is possible to collapse a few levels of recursive calls at a time, reducing the total number of recursive calls by a constant factor.

Moving the recursive base case higher. The simplest method is to test the base case one level higher up. In the simple implementation of the preorder traversal , the recursive base case is “root==NULL”. If this occurs, the function call does nothing. One simple method of avoiding these unnecessary function calls is to test for the base case before the recursive call. The new function becomes:

    void preorder(node_ptr root)
    {
        while (root != NULL) {
            visit(root);
            if (root->left != NULL) // Test moved up
                preorder(root->left);
            }
            root = root->right;
        }

Collapsing multiple levels of recursion. By converting multiple levels of recursive calls into sequential code, the function does much more work each time, but makes recursive calls less frequently, thereby reducing function call overhead. For example, the preorder traversal can be rewritten so that the current node and its two children are handled by the function, and then recursive calls are made for any of the children’s children:

    void preorder(node_ptr root)
    {
        if (root != NULL) {
            visit(root);
            if (root->left != NULL) { // do left child
                visit(root->left);
                preorder(root->left->left);
                preorder(root->left->right);
            }
            if (root->right != NULL) { // do right child
                visit(root->right);
                preorder(root->right->left);
                preorder(root->right->right);
            }
        }
    }

But alas, we've reverted here to a fully recursive version again, just to show function call collapsing. The above method should also be combined with (a) tail recursion elimination, and (b) a stack data structure. This is left as an exercise for the reader (thankfully), and as a project scope estimate, I suggest two weeks!

Use Parameters as local variables

Parameters to functions can be used as if they were local variables. Because of C++ call-by-value parameter passing of basic data types (not arrays), the modification of a parameter inside the function does not change the values of any variables not local to the function. This method saves on initialization time, and on stack space. In the example below, to zero an array, the size is counted down, rather than having a local variable counting up.

    void zero_array(int arr[], int n)
    {
        while (n > 0)
            arr[--n] = 0;
    }

This code also has the optimization of “looping down to zero”. Note that we have to be careful that this code doesn't access arr[n], but does correctly clear arr[0]. I think it works correctly, but my brain is on fire trying to check it.

Pass function parameters by reference

Passing objects or large parameters by value is an inefficiency. The C++ language provides a very convenient method of achieving pass-by-reference, by simply using & in the parameter declaration. One method of improving efficiency is to pass objects to functions as reference parameters.

Behind the scenes, pass-by-reference is like passing a single pointer as the parameter. This avoids not only the cost of copying a large object onto the stack, but also the cost of the copy constructor and destructor for the object within the function (i.e. the parameter is a separate object when passed by value).

A function parameter can be changed to use pass-by-reference parameters only if it does not change the object. Fortunately, modifications to parameters can be detected simply by qualifying the parameter declaration with const, thus forcing the compiler to warn about any modifications to the object within the function. An example of the use of reference parameters in the definition of a Complex object is shown below:

    class Complex {
        double r, i;
      public:
        Complex & operator += (const Complex & c);
        // c is passed by reference for efficiency
        // The return type is also a reference
    };

    Complex & Complex::operator += (const Complex & c)
    {
        r += c.r; // add to both data fields
        i += c.i;
        return *this; // return reference to updated object
    }

Const reference parameters. Passing the argument by reference improves efficiency by avoiding big objects. Note that the parameter is declared “const” as well as “&” indicating a reference. This “const&” pattern is the common C++ idiom for simulating a non-modified pass-by-value object send into a function as a faster reference type.

Returning References. This code also has a second optimization: reference return types. Making the return value a reference is also efficient, because the return statement does not invoke the copy constructor. Note that a returned reference is necessary only if the user of the Complex class uses complicated expressions such as x+=y+=z. If such expressions are not required, efficiency can be improved by making the return type void.

Objects Only. The use of references is best limited to class objects, and also to structures and unions. Arrays are already passed by reference in C++ and hence there is no need to change them. The use of references for scalar types (integers, float, double, and pointers) is unlikely to give much improvement, if any, and might even be slower for some.

Pitfall: Temporary Objects. Another disadvantage of using reference parameters for scalar types like “int” is the inefficiency caused if a constant value is passed as an argument (i.e. a number not a variable). Paradoxically, passing a constant argument to a reference parameter is not an error in C++, but instead a new temporary object with this type is created automatically by the compiler and its address passed.

Implicit “this” object. Note that the object to which a member function is applied is already passed by reference in a certain sense, because it is using the implicit “this” parameter. Hence, the simple types of member function calls are already efficiently using a hidden type of pass-by-reference of the object itself. Consider this code:

    int MyClass::fn() // member function
    {
        return x;
    }

It is not faster with a non-member friend function call that uses an explicit reference parameter. This code will not be more efficient (and is probably less efficient):

    int fn(MyClass & object) // friend function
    {
        return object.x;
    }

Specialize functions with default arguments

Every default function argument is a place where you can optimize. Default arguments to functions are not a source of inefficiency in themselves, and cost no more than using a fixed-argument function and passing some constants explicitly. However, the use of default arguments indicates the possibility of improving efficiency by replacing a single function with a number of specialized functions.

How to do this? Instead of one function with a default argument, create two functions using function overloading. The specialization of the function into two separate functions will often make other optimization techniques possible, thus improving overall efficiency at the cost of some duplication of executable code. As an example of the possibilities that can exist, consider the function with default arguments:

    void indent(int n = 4) // default argument n=4
    {
        for (int i = 0; i < n; i++)
            cout.put(’ ’);
    }

Rewriting this single function as one general function and one specialized function leads to opportunities for optimization in the specialized function. In this case, loop unrolling can be employed:

    void indent() // Specialized function (n=4)
    {
        cout.put(’ ’); // Loop completely unrolled
        cout.put(’ ’);
        cout.put(’ ’);
        cout.put(’ ’);
    }

    void indent(int n) // General function
    {
        for (int i = 0; i < n; i++)
            cout.put(’ ’);
    }

Note that this optimization is also limited in scope, as there any need to change any other code that calls the functions. The C++ compiler will automatically make the correct choice of which overloaded function to call. Another thought for improved readability is to name the specialized function differently (e.g., indent4), which requires calls to the function to be changed. However, default arguments are certainly convenient and the slight increase in efficiency should be balanced against the loss of good programming style.

Medium-Sized Slugs

There are a lot more examples of possible inefficiencies in C++ coding. Some of the types of errors that are “medium-sized” slugs include:

  • Automatic array initializations with constant data.
  • Loop test function calls (ie. expensive loop conditional tests).
  • Member initializations in the constructor body (they should be in the initializer lists).
  • Program startup hidden initializations (global or static object constructors).
  • Small non-inline functions called frequently.
  • Busy wait loops.
  • Unnecessary code inside loops.
  • C++ classes wrapping simple data types (e.g. overuse of “smart pointers” or “smart integer” classes).
  • Overuse of standard string concatenation operations.
  • Recursion is almost always a slug.

Automatic Array Repeated Initialization

A simple example of unnecessary double initializations is any type of large local variable, such as an automatic array. When a function makes use of a large array variable with constant data, or even a large constant object, the variable should probably be declared as both “const” and “static”, even if it need not retain its value between calls. Consider the following code example:

    char *convert(int day)
    {
        char *days[] = { "Monday", "Tuesday", "Wednesday",
                    "Thursday", "Friday", "Saturday", "Sunday" };
        return days[day];
    }

The initialization of array “days” illustrates an inefficiency. The initialization of “days” occurs every time the convert function is entered. It would be much more efficient to declare “days” as a static variable to avoid it being re-initialized, and also “const” to help the compiler optimize.

Data Structure Double Initialization

If you have an initialization routine that does a lot of work, it sometimes becomes a slug by accident. I'm not talking about a single variable initialization, but the initialization of a large program data structure at startup, like a precomputed lookup-table or a perfect hashing algorithm. In the design patterns vocabulary, such a situation is a “singleton” data structure, where only a single object ever exists in the program. It's easy to lose track of whether its initialization routine has been called, and then it gets called twice (or more!).

An example would be some of the precomputation methods whereby a large lookup-table is initialized at program startup. For example, a 24-bit lookup table has been used elsewhere in this book to optimize AI activation functions such as GELU.

The way to avoid the slug of double-initialization is simply to track calls to the initialization routine. The idiom that I use is a local static variable of type bool at the start of the initialization function:

    static bool s_once = false;
    if (s_once) {
        yassert(!s_once);  // Should be once only
        return;  // Avoid double intialization!
    }
    s_once = true;

Another way is to actually count the calls with an integer, which is a generalization that works for additional scenarios:

    static int s_calls = 0;
    ++s_calls;
    if (s_calls > 1) {
        yassert(s_calls <= 1);
        return;  // Avoid double intialization!
    }

Note that I've shown how to wrap these multiple lines of code up into a single “yassert_once” macro in Chapter 41, if you want a simpler method.

Singleton global objects. If you've done the hard yards to declare a big data structure like this as its own class, then you can simply instantiate only one object (i.e. as a global). The C++ class infrastructure does well in ensuring that a constructor is only called once. Even so, it may be worthwhile to declare a static data member and use similar logic to ensure that initialization on this object isn't ever done twice.

In any of these situations, it's a worthwhile investment of a couple of CPU instructions, an increment and a test, to avoid accidentally running the whole routine again. Since the code is virtually identical for all cases, to avoid copy-paste typos, you could even hide these few statements behind a standard C++ preprocessor macro with a name of your choosing Or you could even use an inline function with the “return” statement changed to throwing an exception.

Busy waiting for input

Humans are very slow compared to computers. In particular, a computer can do much work in the background, even when handling the (slow) interactive input of a human. Hence, one method of improving efficiency is to perform background processing while awaiting input, instead of using blocking input that waits for a keypress before doing anything. In other words, you can't use std::cin or scanf for non-blocking keypress polling.

A common example of this idea is chess-playing programs that “think” during their opponent’s time. The computer can continue its game-tree analysis while waiting for the player to press a key or click a mouse. The C++ standard provides no simple standardized function for non-blocking input. In general, there are two ways:

  • Keyboard polling API calls (non-portable).
  • Multi-threading with input on one thread and processing on another.

There are various non-portable ways to poll for key presses. For example, on Windows there's the “_getch” or “kbhit” functions (also “_kbhit”), which are all deprecated. Assuming you've found a workable polling API call, at some regular interval, perhaps before each node of the game tree is analyzed, the chess program checks if a key has been pressed. If a key has been pressed, the chess program stores information about its current analysis, and processes the user's keystroke. Unless the key press completes the user’s move, the background analysis can continue after processing the key.

Overall, there's no simple and standardized way to do non-blocking input in C++. This is probably because of C’s ancestry, where it was difficult to poll the keyboard on a traditional UNIX line terminal. Multi-threading can be used in C++ to achieve the result instead.

Slow disk I/O

The cost of performing I/O on disk files can make up a large proportion of the run-time cost of some programs. For reducing the amount of data to be read from or written to the disk, the main methods are:

  • Use smaller records.
  • Cache frequently used records.
  • Buffer multiple reads or writes.
  • Compress data.
  • Use better data structures.

A very simple method of reducing disk I/O is to reduce the size of records being read or written. This can be achieved using many of the methods to create smaller objects. There are various methods in C++ to reduce a class object's byte size: unions, bit-fields, packing, smaller data types, or reordering data members.

Caching is useful if some records are being read more often than others. It is a very general idea and there are many possible implementations. You can even create your own caching mechanism.

It may be possible to keep all of the most frequently used records in main memory, writing them to disk only at the end of the program (even caching records in memory and writing them to disk for every modification will still avoid the cost of multiple disk reads).

If this method cannot be used, try using several memory locations for record I/O, and whenever a read operation is required, examine these in-memory records first. If any of them is the required record, the cost of a disk read is avoided. Caching always has a slight overhead, and may increase run-time slightly if the desired records are rarely in memory; however, it will never increase the amount of disk I/O and the computational overhead is likely to be small compared to the cost of reading a record from disk.

When reading or writing multiple contiguous records, disk I/O can be speeded up by reading in a number of records each time. The advantage is that buffering multiple operations reduces the number of disk seek operations. For example, when using <stdio.h>, the buffering can be changed using the setbuf and setvbuf functions.

Another alternative is to use other low-level I/O functions, such as the Linux open, read and write functions. However, this method reduces portability of the code.

When the amounts of data being read are quite massive, the level of disk I/O can be reduced by compressing the data in the file. Read and write operations then have the overhead of uncompressing or compressing the data, but the cost of this computation may well be less than that of the disk I/O (or it might also be more; be careful!). However, methods of compressing data are beyond the scope of this book.

The use of a different data structure for data in disk files is often worthwhile. In particular, if the disk file is being searched, then many search algorithms are applicable. For example, binary search can be performed on a direct access file if the data is sorted. However, even binary search is inefficient for large disk files, and data structures specifically intended for disk data should be used. The B-tree is a commonly used data structure, and hashing is another possibility. Unfortunately, these algorithms are highly advanced and again beyond the scope of this book.

Incorrect choice of loop

Although the choice of loop is largely a matter of style, there is an important difference between the post-tested “do” loop, and the pre-tested “for” and “while” loops. The loop condition of a do-while loop is not evaluated on the first iteration and the loop body is always executed at least once. However, a for or while loop condition is evaluated before the first iteration and the loop body need not be executed at all. A common form of minor inefficiency is declaring loops that are always executed the first time, such as:

    bool done = false;
    while(!done) {
        // ....
    }

It is more efficient to use the do loop, which avoids a single evaluation of the loop condition:

    bool done = false;
    do {
        // ....
    } while(!done);

The use of the correct type of loop is also helpful to the optimizer. It is valuable to know that a code segment is always executed once.

Infinite loops are control flow structures that can also be detected and used by the optimizer. Hence, you should code an infinite loop explicitly by using one of the common idioms:

    for(;;)       // Forever
    while(1)      // Common
    do..while(1)  // Not commonly used

This allows the compiler to generate efficient code, because you've made it easy for the compiler to recognize the loop as infinite.

Exit loops and functions early

Control structures should be exited as soon as possible, including function paths and loops. This means judicious use of return for functions and break or continue for loops.

Using “return” as early as possible in a function is efficient. It prevents unnecessary code being executed. Testing for edge cases at the start of a function is an example of using the return statement to do “easy cases first” or “simple cases first” optimizations.

Exit loops early. Similarly, both break and continue are efficient, as no more of a loop is executed than is necessary. For example, consider the code using a Boolean variable “done” to indicate the end of the loop, as in:

    done = false;
    while (!done) {
        ch = get_user_choice();
        if (ch == ’q’)
            done = false;
        else
            ... // rest of loop
    }

The faster code has a break statement used to exit the loop immediately:

    while (1) { // Infinite loop
        ch = get_user_choice();
        if (ch == ’q’)
            break; // EXIT EARLY!
        else
            ... // rest of loop
    }

Unfortunately, the overuse of jump statements such as break and continue can make the control flow of a program less clear, but professional C++ programmers are used to these statements being used often.

More Slug Repellent

There's plenty of other optimizations in the chapters on compile-time optimizations, code transformations, loop optimizations, and vectorization. Well, actually most of the book! Nevertheless, here's a list of some more C++ code optimization techniques for you to consider. Some of the bigger ideas:

  • Use “move constructors” instead of copy constructors where appropriate (since C++11).
  • Use static data members where appropriate, so they are initialized once only.
  • Use std::sort rather than qsort.
  • Don't put try..catch inside an inner loop that's a bottleneck.
  • Use std::bitset for bit sets or bit vectors.
  • Use the “iterators” design pattern rather than returning a full scan of a data structure all at once (saves memory and allows early exit).
  • Consider basic C++ arrays instead of std::vector if it has a fixed size (known at compile-time) or its maximum size is small enough.
  • Consider C++20 coroutines where appropriate for the architecture.
  • Structure of arrays (SoA) data layout is more vectorizable than Array of Structures (AoS).

And some of the smaller optimizations:

  • Commonly used object or struct fields should be first. On some platforms, smaller offsets from the start of an object are accessed faster. Also, the very first field has offset zero, which is optimized away, so put the most used field first.
  • Avoid long else-if sequences. You are effectively doing linear search on the problem space in a long block of if-else-if statements. The best alternative is to use a switch statement, if the conditions are constants. For non-constant conditions or string comparisons, consider tabularizing the options and/or using heuristics to bifurcate the search space (e.g. start with a switch on the first letter of a string).
  • Use compact numeric ranges for switch. If the case numbers are close together, the compiler will probably use a lookup-table in assembler. If the cases are sparse, it can be forced to do an if-else-if equivalent in machine code.
  • Correct choice of loop. If the condition is true at the first iteration, use do-while loops.
  • Instead of range checking a signed integer with “i>=0 && i < MAX” use a typecast with “(unsigned)i<MAX” because negatives become large unsigned positives, and a cast from int to unsigned int isn't a real instruction at run-time.
  • Enable the FTZ (“flush-to-zero”) and/or DAZ (“denormals-are-zero”) floating-point modes on your CPU, even though they violate the IEEE 754 standard. You probably don't care about tiny floating-point numbers in your weight or probability calculations.
  • Enable GCC's floating-point arithmetic speedup options: -ffast-math, -fno-math-errno, -fno-trapping-math, and -ffinite-math-only.
  • bsearch is slow. Choose a better method.
  • Use static_assert rather than assert (e.g. to check data type sizes).
  • Copy arrays by wrapping them in a dummy struct and using C++ struct bitwise assignment. It might be faster than memcpy.
  • Use memcpy rather than memmove if you're sure the arguments won't overlap.
  • Move local non-static objects outside of a critical loop. Reuse the same object rather than re-running constructors and destructors every loop iteration. Add a “reset” member function if needed.
  • Use scaling factors that are a power-of-two, so that multiplication or division can be a bitshift.
  • Specialize a function with a void and non-void version if you find yourself ignoring the return value sometimes. This avoids all of the calculations to determine the return value inside the void function, because the function itself cannot tell whether or not the caller will use its return value.
  • Prefer pre-increment (++i) to post-increment (i++) for non-scalar values. And it's better to use pre-increment even for “int” types, even though it's the same, just to get into the habit.
  • Use the GCC __builtin_unreachable() statement and the “noreturn” function attribute to help the GCC optimizer identify dead code paths, allowing unreachable code removal (not that we care that much) and also better optimization of path-specific optimizations on other live paths (e.g. compile-time constant propagation).
  • Test the first character of two strings directly with character tests before calling strcmp.
  • Replace calls to “round”, “floor” or “ceil” functions with a type cast to int (as an approximation).
  • Consider using the simpler putchar/putc/fputc or puts/fputs functions rather than printf or fprintf.
  • Write your own versions of abs and fabs/fabsf (but benchmark it).
  • Avoid the floating-point pow function for computing integer powers.
  • Instead of strlen("literal") declare it as an initialized char[] array variable and use sizeof(arr)-1.
  • Merge a large number of function parameters into an object. Don't pass 10 Boolean flags as differently named function parameters. Create an object or structure and make them fields instead.
  • Avoid calling strlen in a “for” loop conditional. Compute strlen before the loop, or test for the null byte.
  • Merge multiple Boolean function parameters into a bit set. packed into an int or long. The gain from passing fewer values as function arguments will be offset by the cost of packing and unpacking bits, but still should be better.
  • Use int type mostly, not char or short. Maybe prefer int to size_t, too.
  • Specialize functions being called with a constant for an argument using a template function with an integer field. This will increase code size, but the constant will be propagated more at compile-time, and you also don't have the cost of passing it as an argument.
  • Add “noexcept” specifiers to functions wherever it applies, because this allows the compiler to know not to worry about adding any extra exception handling code.
  • If you're “searching” an array or set of constant integers, known at compile-time, consider “proceduralization” by putting the numbers as cases in a switch. (Trust the compiler engineers.)
  • Consider writing your own faster atoi/itoa functions, as the standard libraries need to handle lots of rare cases, making them slower. (I'm not sure I agree and you might want to benchmark.)
  • Don't overuse “alignas” to specify address alignments if you don't need them, as the enforcement of alignment requirements can impose runtime cost.
  • sprintf is a slow and unsafe function. snprintf is safer but still slow. Find another way.
  • Post-increment can be faster in pointer arithmetic, so prefer using the normal idiom “*ptr++” rather than “*++ptr” to scan a vector.

References

  1. Agner Fog, 2023, Optimizing software in C++: An optimization guide for Windows, Linux, and Mac platforms, PDF: https://www.agner.org/optimize/optimizing_cpp.pdf
  2. Kurt Guntheroth, 2016, Optimized C++: Proven Techniques for Heightened Performance, O'Reilly Media, https://www.amazon.com/dp/1491922060
  3. Dov Bulka and David Mayhew, 1999, Efficient C++: Performance Programming Techniques, https://www.amazon.com//dp/0201379503
  4. Fedor G. Pikus, 2021, The Art of Writing Efficient Programs: An advanced programmer's guide to efficient hardware utilization and compiler optimizations using C++ examples, Packt Publishing, https://www.amazon.com/dp/1800208111
  5. ISO/IEC, Feb 15, 2006, Technical Report on C++ Performance, ISO/IEC TR 18015:2006(E), https://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf (Design of the C++ language from an efficiency perspective, including discussion of virtual functions and other language features.)
  6. Nicolai M. Josuttis, 2012, The C++ Standard Library: A Tutorial and Reference, Second Edition, Supplementary Chapter, https://www.amazon.com/Standard-Library-Tutorial-Reference-2nd/dp/0321623215, PDF (extra chapter): http://www.cppstdlib.com/cppstdlib_supplementary.pdf (C++ optimizations such as bit sets and user-defined memory allocators.)
  7. Bjarne Stroustrup, 2013, The Essence of C++ with examples in C++84, C++98, C++11, and C++14, PDF Slides: http://www.staroceans.org/e-book/essenceOfC++.pdf
  8. Wikibooks, 2023, Optimizing C++/Writing efficient code/Performance improving features, Wikibooks, https://en.wikibooks.org/wiki/Optimizing_C%2B%2B/Writing_efficient_code/Performance_improving_features
  9. Dave Abrahams et. al., 2003, Technical Report on C++ Performance, http://web.archive.org/web/20040608203404/http://www.research.att.com/~bs/performanceTR.pdf
  10. Jakob Engblom, 2001, Getting the Least Out of Your C Compiler, https://www.engbloms.se/publications/engblom-esc-sf-2001.pdf
  11. Jon Louis Bentley, 1982, Writing Efficient Programs, Prentice Hall.
  12. Thomas Plum and Jim Brodie, 1985, Efficient C, Plum Hall Inc.
  13. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, 1986, Compilers—Principles, Techniques and Tools, Addison-Wesley.
  14. Donald E. Knuth, 1973, The Art of Computer Programming (Vol. 3): Sorting and Searching, Addison-Wesley.
  15. James O. Coplien, 1992, Advanced C++ Programming Styles and Idioms, Addison-Wesley.
  16. Jonathan S. Shapiro, 1991, A C++ Toolkit, Prentice Hall.
  17. Bjarne Stroustrup, 1991, The C++ Programming Language (2nd edition), Addison-Wesley.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++