Aussie AI
Appendix 1: C++ Slug Catalog
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
“Life moves pretty fast.
If you don’t stop and look around
once in a while, you could miss it.”
— Ferris Bueller’s Day Off, 1986.
Slug Hunting Advice
This appendix is about speeding up your C++ programs through general improvements to sequential or parallel coding. For AI-specific techniques for speeding up your Transformer's inference of your model (e.g. quantization, pruning), see the various chapters in Part V (Chapters 28-36).
Before we begin with anything that's actually useful, I have to introduce the obligatory wrist-slapping politically-correct deslugging advice for programmers. Hence, here are some general nuggets of advice when attempting to speed up your program:
- Profile twice, code once. Performance profiling tools exist for a reason.
- Don't micro-optimize. Unless you're into that kind of thing. But really, try to sit on your hands.
- Do macro-optimize. Think about your data structures and algorithms.
- Optimizing introduces new bugs. 100% guaranteed! Don't optimize the night before your release. Re-run your test suite.
- Don't optimize exception handling. Tweaking rarely-executed code is a poor use of your geniousness.
- Use open source third-party libraries that have already been optimized by others.
Or just ignore that advice and go crazy. It's just too much fun optimizing when the alternative is dreary debugging. Pro tip: it's even more fun writing a book on optimizing!
Where to hunt slugs? Some of the common large-scale issues with coding inefficiency in typical C++ programs include:
- Function call hierarchies
- Nested loops
- Overuse of memory allocation
- Constructor and destructor inefficiencies
- Inefficient algorithms (e.g. linear search of arrays)
- Unnecessary overhead or wrappers
- Recursion. After you've coded up your university assignments (Tower of Hanoi, anyone?), please forget recursion exists.
C++ Speedup Techniques: Some of the general ways to speed up C++ programs at the design structure or algorithmic level include:
- Faster data structures (e.g. hash tables).
- Faster algorithms (e.g. fix linear search to something faster like, you know, hashing again).
- Parallelize via multi-threading, multi-process, multi-core, multi-GPU, multi-something.
- Vectorization (parallelize your important loops)
- Precompute expensive functions into a lookup table at compile-time (e.g. activation functions).
- Cache any complex calculations to trade extra space for time savings (e.g. KV caching).
- Change floating-point to integer operations (quantization, anyone?)
- Replace recursion with iteration. Subtract ten bonus points if you need to do this.
Some of the high-level C++ coding optimizations include:
- Flatten function call hierarchies (stop wrapping everything so much, and inline the small functions at the bottom).
- Optimize loops, especially nested loops (e.g. move loop-invariant code out, loop unrolling, vectorization, etc.)
- Templates are effectively a compile-time optimization that improves speed at the cost of code space.
- Reduce memory allocation (use less memory overall or replace memory allocation with temporary stack buffers).
- Operator strength reduction (e.g. replace “
*
” with “+
”, a pipe dream of all AI engineers). - Declare variables as close as possible to where they are used. This avoids instantiating objects that aren't needed on some paths.
- Use pointer arithmetic, especially for loops over arrays.
- Bitwise operations are fast, but the basic C++ integer operations are also fast too, nowadays. Benchmark, don't assume.
- Use short-circuiting of the
&&
and||
operators, and also the ternary?:
operator, to avoid expensive function calls.
And finally, some things you might forget (and some that are forgettable):
- Benchmark any important changes (e.g. operator strength reductions).
- Turn up your C++ optimizer. There are higher settings you could try.
- Add compile-time optimization hints (e.g.
constexpr
andrestrict
). - Overclock your PC (like a gamer).
- Sell your car to buy a better GPU.
- Put every function in a header file and make them all
inline
. - Reorder your
case
labels. Surely it helps. - Change
i++
to++i
in everyone else's code.
C++ Class Slugs
The C++ class features are designed to add encapsulation and modularity, while retaining speed, but there's still plenty of ways that slugs can crawl into your classes. C++ class optimizations include:
- Ensure small member functions are
inline
, especially those that do “get” and “set”. - Add
inline
to otherfriend
or non-class functions (esp. if small or commonly used). - Pass objects to functions using “
const&
” (pass-by-reference), rather than pass-by-value. - Watch out for temporary objects. These can occur in simple assignments or function call expressions or in weird ways like accidentally making your overloaded assignment operator have the wrong type.
- Use reference variables instead of copying objects into temporary variables.
- Take care templating class objects (e.g. when using the
std::vector
class for a vector of your class objects). Lots of hidden calls to constructors and destructors may arise in resizing. - Use the initializer list in the constructor for initializing data members.
- Use
friend
functions for faster accesses to internal object data. - Block accidental calls to the copy constructor or class assignment operator (i.e., if you aren't defining them, make a dummy version that is “
private
” with a “void
” function body). - Avoid returning objects if you can. Return a reference if it's safe to do so.
- Take care with “wrapper” classes like “smart pointers”, “smart integers” or “smart buffers”. Usually, they're safer but slower. How smart is that?
Bypass interfaces with friend functions
Using friend
functions may be faster because they can bypass class getter and setter member functions.
If a class
declaration has a good deal of private data, it is common C++ style to declare an
interface of public member functions to access private data. Although the class interface
can be quite efficient if member functions are declared as inline
, the need to call a
function to access a data value can still make it inefficient in some cases. The use of
friend
functions and friend
classes can be efficient because this bypasses the class
interface. For example, a member function to set a data member may perform some
range checking on the value, but if we can be sure that a particular function will not use
incorrect data, a friend
function can be used to bypass this checking.
friend
functions (or friend
classes) should not be considered unless the function needs
very fast access to data members, and the member functions to access the data perform
other computations. Note that a member function, with its special privileges, also
bypasses the class interface (because it is part of it), and friend
functions should not be
used where member functions would be more appropriate. Programming style is the
consideration here, as they would both have similar efficiency.
A good example of friend
function efficiency occurs when an operator function
operates on two different classes, such as when we need an operator
that multiplies a Matrix
object
by a Vector
object to yield a new Vector
.
Assume that both classes have member
functions to access individual elements of the Vector
or Matrix
.
Consider the declaration of the multiply function as neither a class
member nor a friend
function, as in:
const int N = 10; // Number of elements in vector/matrix class Vector { double data[N]; public: double get_element(int i) const { return data[i]; } void set_element(int i, double value) { data[i] = value; } }; class Matrix { double data[N][N]; public: double get_element(int i, int j) const { return data[i][i]; } }; Vector operator * (const Matrix& m, const Vector& v) { Vector temp; // multiply matrix by vector for (int i = 0; i < N; i++) { // for each row double sum = 0.0; // sum of N multiplications for (int j = 0; j < N; j++) { sum += m.get_element(i, j) * v.get_element(j); } temp.set_element(i, sum); // store new vector element } return temp; // return new vector }
This will be horribly inefficient because the operator*()
function must go through both class
interfaces to access elements. Although it isn’t necessarily any less efficient here, if
range checking of the array index i
were present in the member functions to set or access
the elements, this would cause inefficiency.
Note that if the Vector
class overloaded
the []
operator instead of using a get_element
member function, this would make no
difference to efficiency—notational convenience is gained but the operator[]
function has the same cost as any other function.
One alternative to consider is to make the operator*
function a member of the
Vector
class, but this will still mean using the interface for the Matrix
class.
A more
efficient solution is to make the operator*
function a friend
of both Matrix
and
Vector
classes, thus allowing it direct access to their individual data elements,
bypassing any range checking on array indices. The more efficient version, using a
friend
function, is:
const int N = 10; // Number of elements in vector/matrix class Matrix; class Vector { double data[N]; public: friend Vector operator * (const Matrix& m, const Vector& v); }; class Matrix { double data[N][N]; public: friend Vector operator * (const Matrix& m, const Vector& v); }; Vector operator * (const Matrix& m, const Vector& v) { Vector temp; // multiply matrix by vector for (int i = 0; i < N; i++) { // for each row double sum = 0.0; // sum of N multiplications for (int j = 0; j < N; j++) { sum += m.data[i][j] * v.data[j]; // access data directly } temp.data[i] = sum; // store new vector element } return temp; // return new vector }
The disadvantage of using friend
functions is the same as their advantage: they pierce class encapsulation.
Because a friend
function directly makes use of hidden private data members,
and any change to the class may require a change to the definition of the friend
function, whereas in the first version of the operator*
function the use of the
“get_element
” member functions of both Vector
and Matrix
meant that it would
need no changes, provided the “get_element
” functions were correctly changed within the class.
Avoid Virtual Functions
Object-oriented programming purists will hate me for this section.
C++ virtual
functions are a wonderful incarnation of OOP
and they can be beautiful and elegant.
But you need to avoid them sometimes if speed is your goal.
They're also very fast function calls, even though done dynamically.
Although virtual
function calls seem
like they're complicated and possibly slow, they're
actually very carefully designed to be very fast to call in C++ class hierarchies.
There's lots of painstaking work for compiler designers to get them to compile correctly,
but their runtime efficiency is great for programmers.
The implementation is effectively a small lookup table with function pointers.
It's a couple more assembler statements before the function call,
and the overhead of calling a function will dwarf that cost.
So, why do I say to review your use of virtual
functions?
Because they're an optimizer blocker.
Since they're a dynamic runtime function call,
there's much less opportunity for the C++ compile-time optimizations to remove these calls.
Indeed, the compiler cannot always determine what function is being called
and you can lose these speedups:
inline
functionsconstexpr
function evaluation
Hence, I say you have to choose carefully in the use of virtual
functions.
Avoid them for speed-critical functions,
and don't use them only for good OOP style if you don't really need them.
But also, don't be afraid of using them in other instances
because they're only marginally slower than a non-inlined function call.
Kudos to the C++ language designers for that!
Avoid unnecessary virtual function calls
The use of virtual
functions, when they are not needed, is obviously inefficient.
virtual
functions are needed only when dealing with pointers or references to objects
of unknown type. If the program never uses pointers or references to objects, or if it does
not have any derived classes, no function needs to be virtual
and the use of virtual
wastes space. In addition, because virtual
functions relate only to the use of derived
classes, declaring any functions as virtual
in a class that has no derived classes is also
unnecessarily inefficient.
One common situation where virtual
may appear necessary, but need not be,
occurs with redefining a member function in a derived class. This does not necessarily
mean that the function must be defined as virtual
in the base class (nor in the derived
class — the virtual
keyword is never needed in the derived class). Of course, if the
program starts using pointers or references to these classes, the functions may need to be
virtual
, in which case it may be better style to declare the member function as
virtual
.
A call to a virtual
function need not always be a “real” virtual
call. For
example, passing an object by reference (either as a reference or as a pointer type) can
occur when changing functions to pass-by-reference for efficiency improvement. Any
calls to virtual
functions inside that (not necessarily virtual
) function will be such
that the compiler cannot know that an ordinary function call to the member function
would suffice. It does not perform any global analysis to determine that all arguments to
the function are base objects, and not derived objects. For example, in the following
code, it isn’t clear that the call to the (virtual
) print
function could be replaced by
an ordinary call:
void print_base_object( Base & object) { object.print(); }
The overhead of virtual function calls can be removed whenever the programmer can
be sure that only one type of pointer/reference to an object is being used. In particular,
whenever a programmer can be sure that a pointer/reference to a base class object points
to a particular object, the qualified member function name can be used. For example,
the virtual
call uses:
p->print();
And the more efficient code that avoids a virtual
function call is:
p->Base::print();
An example of extra information making this change possible occurs when a program
uses a number of different (homogeneous) linked lists, with each linked list containing
the same type of object (one with base objects, one with derived objects). When implementing a print_list
function to print out a linked list, you can write it generally to
call a virtual
-declared print_object
function:
void LinkedList::print_list() { for (Base *temp = head; temp != NULL; temp = temp->next()) temp->print_object(); }
This means that each call to print_object
has the run-time overhead of a virtual
function call. A more efficient alternative is to make use of the knowledge that each list
must contain the same type of object, and have two different print_list
functions
(i.e. use a virtual
function to do the dirty work of printing the objects).
void Base::print_list_hidden() { for (Base *temp = this; temp != NULL; temp = temp->next()) temp->Base::print_object(); } void Derived::print_list_hidden() { for (Derived *temp = this; temp != NULL; temp = (Derived*)temp->next()) temp->Derived::print_object(); } void LinkedList::print_list() { if (head != NULL) head->print_list_hidden(); // call virtual function }
With this approach, all of the lower-level calls to print_object
can be bound at compile-time and the
only virtual
call is the call to print_list_hidden
at the very top. Hence, by using our knowledge
about the linked lists, we have reduced the number of run-time virtual
function
calls.
Specialize inherited member functions
In an inheritance hierarchy, the derived class is a specialized version of the base class. This means that member functions inherited from the base class can often be rewritten more efficiently to make use of the known special features of the derived class objects.
Example: Triangular Matrix Algebra.
As an example, consider a class “UTMatrix
” (upper triangular matrix) which is
derived from class “Matrix
” and represents matrices where all elements below the main diagonal are
zero.
The general matrix “add
” function of the Matrix
class is inherited by the
UTMatrix
class, and it will work correctly. However, this inherited function is inefficient
and it is more efficient to add a new member function to the UTMatrix
class to
add two upper triangular matrices avoiding all additions involving elements below the
diagonal (because they are known to be zero).
In fact, it is also more efficient to write special functions to add ordinary
matrices to upper triangular matrices. The computation of the determinant of a triangular
matrix is also more efficient than that for a general square matrix, so this member
function should also be rewritten in the UTMatrix
class.
Example: Complex Numbers.
As another example, consider a class “Imaginary
” (imaginary numbers) derived from
another class “Complex
” (complex numbers). For all operations involving Imaginary
objects, it is certain that the real part of the complex number is zero. Hence, it is more
efficient to rewrite all inherited operations that use the real part of a Complex
object,
such as: addition, multiplication, norm, etc.
The main disadvantage of specializing member functions is that the code reuse advantage of inheritance is negated; more programmer time must be spent on recoding the specialized member functions. Other disadvantages are the increased probability of error, most special cases to test, and an increase in executable code size.
Assignment Operator Return Type
The return type of the overloaded assignment operator should usually be a reference type
or void
. A common mistake is to make it return a class object. Consider the following
class declaration:
class Integer { private: int val; public: Integer operator = (const Integer &x); // ... }; Integer Integer::operator = (const Integer &x) { val = x.val; // copy data return *this; // return left operand }
This declaration of the assignment operator to return an object permits expressions using the result of assignment, such as:
Integer x, y, z; x = x + (y = z); // embedded assignment x = y = z; // multiple assignment
However, it needlessly calls the constructor and destructor for a temporary object, leading
to inefficiency, and occasionally to error. The correct declaration of the assignment
operator is to return a const
reference to Integer
. This simply requires an &
in the
return type declaration, as follows:
const Integer& Integer::operator = (const Integer &x) { // ... same as above }
Note that const
is required because the use of a non-const
reference return type is slightly undesirable because it allows the
very strange (and probably incorrect) multiple assignment:
(x = y) = z;
Although the failure to declare the return type as a reference above was a slug, rather than a bug,
it can be more dangerous. For a MyString
class with dynamic allocation, using a return
type of MyString
instead of MyString&
will cause a temporary object to be created at the
return
statement, using the copy constructor with “*this
” as the argument. If the
copy constructor is defined correctly, this is often just an instance of inefficiency, but it
may also lead to fatal errors related to temporary objects. If
the copy constructor isn't defined correctly, the programmer has an error with an increased level of complexity caused by temporary objects.
Return Type Void:
Note that it may be far better simply to declare the return type of the assignment
operator as void
, rather than a reference type. Although this prohibits embedded assignments in expressions and
also multiple assignments, these are poor style anyway and should probably be
discouraged. Using return type void
is also slightly more efficient because no value
need be returned. However, returning the reference type is the more common C++ idiom.
Singleton Classes
In a one-instance class there will only ever be one object defined from it.
There are called “singletons” in the “design patterns” parlance.
In this situation
the class can be defined very efficiently by making use of compile-time initialization
with data members declared as “static
” members.
An example is a hash table implementation of a symbol table (e.g. in a compiler keyword table or an AI vocabulary table used by the tokenizer), where only one symbol table will ever be used. The crucial fragment from this code is:
class SymbolTable { private: Node * table[TABLE_SIZE]; // Hash table - array of pointers public: SymbolTable(); // constructor }; //----------------------------------------------------------- // Constructor - initialize the hash table to empty //----------------------------------------------------------- SymbolTable::SymbolTable() { for (int i = 0; i < TABLE_SIZE; i++) // all pointers are NULL table[i] = NULL; }
If there will only be one hash table, the constructor is needlessly inefficient. A more
efficient version declares the hash table as a static
data member and the implicit
initialization to zero will set all the pointers to NULL
at compile-time. The efficient code
for a one-instance hash table is:
class SymbolTable { // ONE INSTANCE ONLY private: static Node *table[TABLE_SIZE]; // Compile-time initialization public: SymbolTable() { } // constructor does nothing };
Temporary Objects and Destruction
Temporary objects are created automatically by the compiler in a number of situations. This is a similar idea to that of a C++ compiler generating temporary values for intermediate results of a computation. However, a temporary with class type will have its constructor and destructor activated, so temporary objects can be quite expensive.
For example, try the following class to demonstrate
how a temporary object is defined for intermediate expression results, particularly that
returned by the +
operator:
#include <iostream.h> class Integer { private: int val; public: Integer() { val = 0; cout << "Constructor\n"; } ~Integer() { cout << "Destructor\n"; } Integer(const Integer &x) { val = x.val; cout << "Copy Constructor\n"; } void operator=(int x) { val = x; } void operator=(const Integer &x) { val = x.val; } friend Integer operator+(Integer &x, Integer &y); }; Integer operator+(Integer &x, Integer &y) { Integer temp; // user-defined temporary temp.val = x.val + y.val; return temp; // creates compiler temporary } int main() { Integer i, j, k; k = i + j; }
There are 4 calls to the ordinary constructor corresponding to i
, j
, k
, and temp
; there is
a single call to the copy constructor that occurs when the return
statement creates a
temporary object for the object returned from operator +
. This temporary object is the
result of i+j
and is then assigned to k
.
In this case there are poor performance and no errors related to temporary objects and in most cases, temporary objects are transparent to the programmer for a correctly defined class (i.e., having both assignment operator and copy constructor). However, if the programmer unwittingly stores a reference or pointer to members of a temporary object, there may be errors in a later use of the reference or pointer. The problem is that temporary objects can be destroyed by the compiler as soon as they have been used in the computation, and so the reference or pointer is no longer valid. However, since the timing of the destruction of temporaries is undefined, some compilers will not exhibit an error for such code because they leave the destruction of temporaries till late; it depends on how aggressively a particular compiler performs its internal code optimization.
Overloaded Postfix Increment Operator
The postfix increment operator (x++
) is a big slimy slug.
I'm not talking about your for
loop with “i++
” versus “++i
” for an integer, which is the same
on any compiler since about the 1990s,
despite the endless online arguments about it.
I'm talking about overloaded increment and decrement operators for classes.
In C++ you can declare separate prefix and postfix increment overloaded operators for a class,
by putting an extra dummy “int
” parameter in the postfix version.
You can also leave out a postfix version, and the prefix version will be called for both usages.
The default call to prefix versions is not a slug, but a potential bug if you copy-paste code or use postfix ++
in template code.
Also, returning the current object for the prefix increment operator is only a minor slug,
because you're returning a reference to the current object (and a reference is really just a pointer).
Postfix operations are much worse. They are slower than airport queues at Thanksgiving.
The semantics of the postfix increment operator (x++
) in the C++ language are effectively:
1. Create a temporary copy of your object.
2. Increment the current object.
3. Return the temporary object.
If you actually do this big shemozzle for a class object, you've got a whole lot of processing happening on a temporary object that's probably not even used. Maybe the optimizer will cut a lot of it as dead code, or maybe not. With the horrors of that echoing in your mind, here's my first suggestion:
Don't even declare postfix overloaded operators for your class.
Don't overload the postfix increment operator.
In fact, you can stop it being used by declaring a dummy version that is “private
” (stops external usage) with a “void
” function body
(stops internal usages).
private: void operator++(MyClass &x, int) void; // Postfix denied! void operator--(MyClass &x, int) void;
Void Return Type:
Note that attempts to call a postfix ++
operator on a class type may occur in
template instantiation with your type.
If it's your template, change the template code to use prefix operators.
If you really must define an overloaded postfix increment or decrement operator,
then here's my second suggestion:
Make the return type “void”
Hence, a basic usage of “x++
” will compile and work correctly.
Not only will it be efficient to not return anything,
but the compiler will also ensure that nothing more fancy will run.
A compilation error will block any use of postfix ++
that relies on the operator returning the old object.
In other words, this will be fine:
x++;
But this will get a compiler error alerting you to a problem:
y = x++; // Error
Standard Vector Object Resizing
The standard vector
class is usually very efficient for basic data types,
but you need to take care if you instantiate it with a class type.
The risk is that you'll have hidden calls to this class type's constructors and destructors,
potentially for every element of the vector
,
under various circumstances.
This slug is a type of “hidden copy constructor call” problem.
If you don't manage the size of the standard C++ vector
class objects in the initialization
or via the “reserve
” method,
there can be a lot of hidden resizing happening behind the scenes
whenever you are adding elements to the vector.
This will at least be doing bitwise copies of the elements of each vector.
But it's even worse if the vector contains complex objects with a defined
copy constructor.
When it's resizing the vector
, it will call the copy constructor for each and every
object that is an element of the vector
because it needs to move them all.
Even for basic data types there can be some cost to copying the data when resizing.
You can take control of this with the “reserve
” function, so that the vector
object
doesn't need to keep resizing itself if you're adding to it.
Skipping Destructor Cleanup
It's really good OOP coding style for your destructor to carefully clean up every resource your object needed, and you know, beautiful coding idioms are just so very important. I certainly wouldn't want to be the person to tell you to do some ugly hack, even if it made everything a whole boatload faster. Umm, really, I wouldn't want to, but if you promise not to tell anyone you heard it from me...
Typically, destructor cleanup means calling “delete
” on allocated memory
used by the data members, and for complex objects, it may also mean closing files.
And I often find that the cost of the destructor starts becoming significant
in its own right.
And one destructor call can trigger lots more, like roaches, only without the social skills.
If you call “delete
” on any member objects or worse, arrays-of-objects, then those destructors get called,
and this triggers a whole blam of code that cascades down the object hierarchy.
Here's a thought: don't cleanup!
This is an optimization worth considering in some cases:
- Batch jobs
- Re-launching server daemons
- Program is shutting down anyway
If your program is a run-once batch job,
and it's not going to be running again with a new request,
or even if it's an AI inference server process that handles 1,000 user queries,
after which another copy will launch in its place,
then you can make like a teenager, and don't cleanup.
Thumb your nose at Valgrind and comment out all those delete
lines in your destructors.
Let the memory leak!
Program exit is a special case that you can detect. If your program is exiting “cleanly” then it does destructor calls to all of the global objects, and so on. And you usually know in the code when the program is shutting down, whether from a user choice, a timeout or limit exceeded, or something internal like an assertion failure. One idea is to use a global Boolean flag that says “I'm shutting down” and then check it inside all of the main destructors:
MyClass::~MyClass() { if (g_aussie_im_shutting_down) return; // Skip! ... // Lots of stylistically beautiful code }
Is it safe? What happens if you just skip all the cleanup? Well, nothing bad in many cases. The operating system cleans up the allocated memory as part of reclaiming all of the memory. Files are a bit more of a complicated story. Standard C++ shutdown should also properly close any files opened for reading, although you might possibly lose some buffered output written to a log file, so maybe you should still flush buffers or close those files.
This idea of skipping destructors isn't always workable. It's not always clear that ending the process will properly save buffered output in closing files. As another more complex example, if there's an abnormal disconnect from a database session or a remote network connection hangup (e.g. socket session not ended properly), there might be some other consequences, like error messages in the logs locally or for the remote peer.
Initializer lists for member objects
When a class declaration contains a class object as one of its members it is important to
use the correct method of initialization to retain efficiency. Consider the declaration of a
class B
containing a member object from class A
:
class A { private: int val; public: A() { val = 0; } A(int x) { val = x; } void operator = (int i) { val = i; } }; class B { private: A a; // member is itself an object public: B() { a = 1; } // INEFFICIENT };
Declaring an object of type B
will cause the default constructor for the member object of
type A
to be invoked immediately before the default constructor for B
. Then the =
operator for class A
is used to set the member object, a
. Hence, the constructor for B
involves a call to A
’s default constructor and a call to the assignment operator. The call to
A
’s default constructor is redundant and should be avoided.
Fortunately, C++ provides a convenient syntax for passing arguments to constructors
of member objects. The default constructor for B
should be recoded to use the initializer list:
B() : a(1) { } // EFFICIENT
This initialization syntax causes the constant 1 to be passed to the constructor for the
member object, a
(the constructor accepting the int
parameter is called, instead of the
default constructor). Thus, instead of calling the default constructor and the assignment
operator for A
, only the int
constructor for A
is called.
This initialization method is efficient whenever calling the default constructor for a
member object is not appropriate, for instance, when the member object is initialized by a
call to the assignment operator within the main object’s constructor (as above, where B
’s
constructor assigned to its member of type A
). This form of initialization can be used for
any type of data member (i.e. not only class objects), although it will be neither more nor
less efficient than assignment for built-in types. The special initialization syntax should
be used wherever it is applicable, since it can never be less efficient than assignment to
the data members within the constructor, and will often be more efficient.
Initializer lists for base objects
Base objects. Similar efficiency considerations apply to constructors in derived classes, since the data member(s) in the base class act like an object member. The constructor for the base class is always called when a derived class object is constructed. When the default constructor for the base class is of no use to a derived class object, it is more efficient to pass arguments directly to a non-default base class constructor, using the special initialization syntax. The same syntax applies as for data member initialization, except that the type name of the base class is used instead of the name of a data member. A contrived example of this form of initialization is:
class Derived : public Base { public: Derived() : Base(0) { } // Call Base(int) constructor };
Avoid temporary objects
In the same way that temporary integer variables are used to compute an integer
expression, so too are temporary objects used in non-trivial expressions involving class
objects. For example, consider this code where the Complex
class has defined the +
and =
operators:
Complex c1,c2,c3; c1 = c2 + c3;
This is likely to create a temporary Complex object as the result of the addition, and this
temporary object is then passed as an operand to the =
operator. In other words, the
expression is actually evaluated as:
operator=( c1, operator+(c2, c3) );
A temporary object must be created to store the “+
” sub-expression computed for the second
argument, and then passed to the “=
” operator. Whether the operands to operator=
are passed by reference or by
value has no effect on whether a temporary is created in this situation (it will only
affect the creation of new objects inside the operator=
function).
One (rather inelegant) method of avoiding this creation of temporaries is to create a specialized function to handle it:
void AssignThree(Complex &c1, Complex &c2, Complex & c3); ... AssignThree(c1,c2,c3); // c1 = c2 + c3;
The function should probably be a friend
function to allow efficient access to the data
members of the three Complex
objects.
The problems with this solution are its very poor style (because the neatness of the
use of overloaded operators is lost), and also its non-general character. More complicated
expressions will still generate temporaries, unless more special functions are added as
friend
functions, leading to even worse style. This “cure” is perhaps worse than the disease.
Avoid temporaries via extra member functions
There are situations where the removal of temporaries does not lead to poor style.
Consider the following definition of a minimal Complex
class:
class complex { private: double re; // real part double im; // imaginary part public: // Constructors complex() { re = 0.0; im = 0.0; } complex(double r) { re = r; im = 0.0; } complex(double r, double i) { re = r; im = i; } // Copy constructor complex(complex &c) { re = c.re; im = c.im; } // Overloaded assignment operator void operator = (complex & d) { re = d.re; im = d.im; } // Overloaded + operator friend complex operator + (complex &c1, complex &c2); }; inline complex operator + (complex &c1, complex &c2) { return complex(c1.re + c2.re, c1.im + c2.im); }
Consider this class definition when used in the following code sequence:
complex c1, c2; c1 = 2.0; c2 = c1 + 3.0;
The effect is identical to:
c1 = complex(2.0); // invoke "double" constructor for 2.0 c2 = c1 + complex(3.0); // invoke "double" constructor for 3.0
The C++ compiler automatically creates two temporary objects from the double
constants, and calls the double
constructor to do so. The inefficiency of the creation of
a temporary object and the call to the constructor can be avoided by adding a few more
functions to the class declaration:
void operator = (double d) { re = d; im = 0.0; } friend complex operator + (double d, complex &c2); friend complex operator + (complex &c1, double d);
If these functions are present, then the double
constants are passed directly to the
double
parameters of these functions. No temporary object is created, and hence the
constructor is not called. Note that two symmetric versions of operator+
are required
because the C++ compiler cannot assume that the commutativity of +
holds for user-defined class objects.
By making the “interface” efficient for mixing complex and double variables, the
creation of temporaries has been reduced. This can be generalized: it is better to provide
member or friend
functions to class X
for a specific parameter type Y
, than to provide
only a constructor to create new X
’s from Y
’s.
Declare objects close to use
The C++ language allows variable declarations to appear almost anywhere within a program. Although the placement of variable declarations may seem unrelated to efficiency, it can have some effect when objects with non-trivial constructors are declared. For efficiency reasons, an object must be declared as close to its first use as possible. In particular, the C style of declaring all variables at the top of a function is often inefficient. Consider the C++ code below:
void dummy(...) { complex c; // create object if (... ) { .... // use c } }The
complex
object is not used if the condition in the if
statement is false — the
constructor and destructor for the unused object are called needlessly.
Declare Objects with Full Initialization
Another consideration is that objects should not be declared until there is enough
information to construct them fully. For example, given a user-defined class “complex
”,
consider the following code:
complex c; // construct c // .... c = 1.0; // initialize c
This is less efficient than calling the correct constructor directly by using:
complex c(1.0); // construct and initialize c
The first code sequence involves a call to the default constructor and the overloaded
operator=
, whereas the second declaration calls only the (double
) constructor for the
complex
class.
Unfortunately, there are practical limits to the extent to which objects can be declared near their first use. If the first use of an object is inside a compound statement, and the object must also be used outside the compound statement, the scope resolution rules prevent the declaration from being placed inside the compound statement. For example, consider the code below:
double d; complex c; while(....) { cin >> d; // get double value from user c=d; // set complex number } cout << c; // print the complex number
In this sequence, it would be more efficient to declare “c
” inside the loop block using the direct call to a
double
constructor:
complex c(d);
However, this would prevent the use of c
outside the scope of the braces. This limitation
is an unfortunate consequence of the programming language design choice to make
braces both the method of grouping statements and the scoping mechanism in C++ (but
there are many more important advantages supporting this decision). Unfortunately, it is
not even possible to remove the braces in the above example, using the comma operator
as by:
while(....) cin >> d, complex c(d); // FAILS: compilation error
C++ syntax prevents a declaration from being an operand of the comma operator.
Nothing Constructors.
What we really want is a way to declare a class type variable,
but not run its constructor.
I'm not aware of a good way to do this.
One way would be to use pointers and dynamically allocated “complex
” objects,
which is successful and standardized,
but this adds extra memory management overhead.
Here's a thought. Maybe something like this works? Declare a dummy constructor with a dummy parameter type:
class Banana { }; complex(Banana b) { } // nothing!
Then your call to the dummy constructor is hopefully optimized to nothing:
Banana b; complex c(b); // Nothing!
Data Member Optimizations
These optimizations apply to C++ objects or structures. There are various ways to speed up the data accesses and writes to a data member in an object.
Avoid bit-fields. Bit-fields are a special C++ feature designed to reduce space in an object or structure.
struct node { unsigned int visited :1; // bit-field };
Avoid bit-fields if you want runtime speedup.
They are great at reducing memory size,
but often at the cost of extra run-time
overhead on any accesses to these fields.
Hence, for improved efficiency, at the cost of space
wastage, remove the “:1
” qualification
and change to a small data
type such as bool
, char
, or unsigned char
.
Memory alignment:
If there are mixed size data members, or there are some with “alignas
” alignment settings,
then memory alignment issues can needlessly create an oversize object.
This is more of a problem in terms of unnecessary space usage,
but adds inefficiencies in the need to initialize or copy the extra padding bytes for large arrays
of objects.
The general rules for minimizing size are to: (a) order members from large to small, and (b) group like size data types together.
Most used data member first. The machine code for an access to a structure or object's data fields usually involve a base address of the object, to which is added an offset that is specific to each field. References to the first field of a structure can often be more efficient because there is no need to add an offset (i.e., the offset is zero). Hence, the most used class data member or structure field should be placed first in the declarations.
Order data members by usage. It's not just the first data member whose order matters. Memory access issues such as data locality, predictive caching and memory access pipelining mean that all of the most-used data members should be close together in an object. In very large objects, there are some platforms where smaller offsets are more quickly calculated, such as data members with less than 128 or 256 as their offset. Hence, a simple optimization is to order the data member declarations according to their usage.
Function Slugs
Functions are an important building block of your code. Some ways to get the slugs out of functions include:
- Declare small functions
inline
. - Avoid recursion.
- Pass objects by reference.
- Avoid function pointers.
- Specialize functions with default arguments.
Avoid Function Pointers
C++ allows a data type called a “function pointer” or a “pointer to a function” as part of its standard language. These are carefully type controlled, so they are reasonably efficient. However, they are not any faster than regular function calls, just because they're a fancy pointer construct, and there's a simple reason that they're not super-efficient: they're function calls!
A function pointer is a call to a function, so it has the whole sequence to implement. It's not much worse than a standard function call, but there's another problem. Function pointers make it difficult for the C++ compiler to get rid of the function call. The use of a function pointer will obscure much of the normal compile-time optimization logic. Hence, function pointers can be less efficient for:
inline
functionsconstexpr
functions- Intrinsic functions
In summary, they're a neat feature of C++, but not an efficiency gain. Use function pointers if they are convenient, but not as a speedup.
Change recursion to iteration
Recursion is an elegant method of problem solution, but often incurs unnecessary function call overhead. Where possible, recursion should be replaced with an iterative algorithm. For example, the famous example of a recursive “factorial” function would always be coded in a loop by professional programmers.
Fibonacci numbers. With a little insight, many recursive algorithms can be coded without recursion. For example, the Fibonacci number sequence (1,1,2,3,5,8,13,...) is defined by having the next number as the sum of the previous two numbers, with the following recursive rules:
Fib(0) = 1 Fib(1) = 1 Fib(n) = Fib(n−1) + Fib(n−2)
This has the obvious and very elegant recursive implementation:
int fibonacci(int n) { if (n <= 1 ) return 1; else return fibonacci(n - 1) + fibonacci(n - 2); }
However, there is no need to use recursion here, and a short loop is adequate. A non-recursive computation of the Fibonacci numbers is shown below:
int fibonacci(int n) { int small = 1, large = 1; // F0 = F1 = 1 while (n > 1) { int temp = small + large; // Fn = Fn-1 + Fn-2 small = large; large = temp; n--; } return large; }
Binary Trees. There are many examples of common algorithms that are unnecessarily coded using recursion. Almost all linked list algorithms can be coded without recursion, as can the most common binary search tree operations: search, insertion and deletion. For example, the recursive implementation of tree insertion is:
void insert(Tree *root, Tree new_node) { if (*root == NULL) // Found bottom of tree *root = new_node; // insert here else { if (new_node->data <= (*root)->data) insert(&(*root)->left, new_node); else insert(&(*root)->right, new_node); } }
The non-recursive version of binary tree insertion is given below. It is somewhat less elegant, uses a few more variables, but should be more efficient.
void insert(Tree *root, Tree new_node) { Tree temp = *root; if (temp == NULL) // empty tree special case *root = new_node; else { for (;;) { if (new_node->data <= temp->data) { // go left? if (temp->left == NULL) { // leaf? temp->left = new_node; // insert it return; // finished } else temp = temp->left; // go left } else { // going right if (temp->right == NULL) { // leaf? temp->right = new_node; // insert it return; // finished } else temp = temp->right; // go right } } } }
I'm sorry, Professor! Your recursive code is short and beautifully elegant, but mine is longer, uglier, and faster! Maybe I shouldn't tell my Professor that I've never coded a binary tree since finishing my degree? Hash tables are the name of the game.
Eliminating tail recursion
Recursion is rarely a good solution, but some types of recursive algorithms are not easy to change to loops, because they would require a stack data structure to do so. If a stack is needed, there may be little gain in removing recursion fully — it depends on how efficiently recursion is implemented by the compiler on the builtin C++ function call stack, versus your skill in hand-coding a stack data structure.
In these situations, a simpler optimization is still possible without a stack. Partial recursion elimination without the need for a stack is possible via the elimination of “tail recursion.” Tail recursion occurs when the last action of the recursive procedure is to call itself.
A simple modification changes this last recursive call to become a loop back to the top of the current invocation. For example, consider the preorder traversal of a binary tree. The simplest recursive algorithm is:
void preorder(node_ptr root) { if (root != NULL) { visit(root); preorder(root->left); preorder(root->right); // Tail recursion here } }
Tail recursion can be eliminated by replacing the if
statement with a while
loop. The
transformation effectively reduces recursion by half, as the second recursive call is eliminated. This reduction in recursion is achieved with virtually no extra overhead!
void preorder(node_ptr root) { while (root != NULL) { // while loop replaces if visit(root); preorder(root->left); root = root->right; // Move to right subtree } }
Replacing recursion with a stack
Some recursive algorithms cannot be easily replaced by iterative loop equivalents. For example, in the preorder binary tree traversal above, we were unable to remove both of the recursive calls. In these situations, recursion can be replaced with an algorithm using a stack data structure.
All recursive algorithms can be replaced by a stack because recursive algorithms are actually using an implicit stack (the program stack of function calls). Whether use of a stack will be more efficient than a recursive algorithm depends on a number of factors. The choice of a stack over recursion is machine-dependent. In particular, it is quite likely that the program stack is supported by efficient low-level instructions and that (recursive) function calls are executed very efficiently. Can you do better?
On the other hand, recursion requires that much information be stored on the stack (i.e. parameters, automatic local variables, machine registers), whereas an algorithm making use of an explicit stack will usually only need to store a few items, making it potentially faster than the function call stack. If the maximum size of the required stack is known beforehand, a stack can be quite efficiently implemented as an array, whereas a dynamic stack as a linked list will usually be more costly because of the cost of memory allocation.
The following shows the preorder traversal with tail recursion elimination removing one recursive call and an explicit stack replacing the other. In this case, the explicit stack need only store pointers.
void preorder(node_ptr root) { stack_type S; init_stack(S); // set to empty stack while (root != NULL || !is_empty_stack(S)) { if (root != NULL) { visit(root); // visit a tree node push(S, root->right); // save right subtree root = root->left; // go to left subtree } else root = pop(S); // get node from stack } }
Collapsing recursive calls
If you can't be bothered changing a recursive algorithm to a loop or stack, here's a smaller optimization to consider. By channeling the spirit of loop unrolling, we can “collapse” one or more levels of recursion into sequential code. The method of “function call collapsing” can be applied to recursive functions in this limited sense. Obviously, it isn’t possible to collapse a recursive function call completely into inline code, but it is possible to collapse a few levels of recursive calls at a time, reducing the total number of recursive calls by a constant factor.
Moving the recursive base case higher.
The simplest method is to test the base case one level higher up.
In the simple implementation of the preorder traversal , the recursive base case is “root==NULL
”. If this occurs, the function call does nothing. One
simple method of avoiding these unnecessary function calls is to test for the base case before the
recursive call. The new function becomes:
void preorder(node_ptr root) { while (root != NULL) { visit(root); if (root->left != NULL) // Test moved up preorder(root->left); } root = root->right; }
Collapsing multiple levels of recursion. By converting multiple levels of recursive calls into sequential code, the function does much more work each time, but makes recursive calls less frequently, thereby reducing function call overhead. For example, the preorder traversal can be rewritten so that the current node and its two children are handled by the function, and then recursive calls are made for any of the children’s children:
void preorder(node_ptr root) { if (root != NULL) { visit(root); if (root->left != NULL) { // do left child visit(root->left); preorder(root->left->left); preorder(root->left->right); } if (root->right != NULL) { // do right child visit(root->right); preorder(root->right->left); preorder(root->right->right); } } }
But alas, we've reverted here to a fully recursive version again, just to show function call collapsing. The above method should also be combined with (a) tail recursion elimination, and (b) a stack data structure. This is left as an exercise for the reader (thankfully), and as a project scope estimate, I suggest two weeks!
Use Parameters as local variables
Parameters to functions can be used as if they were local variables. Because of C++ call-by-value parameter passing of basic data types (not arrays), the modification of a parameter inside the function does not change the values of any variables not local to the function. This method saves on initialization time, and on stack space. In the example below, to zero an array, the size is counted down, rather than having a local variable counting up.
void zero_array(int arr[], int n) { while (n > 0) arr[--n] = 0; }
This code also has the optimization of “looping down to zero”.
Note that we have to be careful that this code doesn't access arr[n]
,
but does correctly clear arr[0]
.
I think it works correctly, but my brain is on fire trying to check it.
Pass function parameters by reference
Passing objects or large parameters by value is an inefficiency.
The C++ language provides a very convenient method of achieving pass-by-reference, by
simply using &
in the parameter declaration.
One method of improving efficiency is to pass objects to functions as reference
parameters.
Behind the scenes, pass-by-reference is like passing a single pointer as the parameter. This avoids not only the cost of copying a large object onto the stack, but also the cost of the copy constructor and destructor for the object within the function (i.e. the parameter is a separate object when passed by value).
A function parameter can be changed to use pass-by-reference parameters only if it does not
change the object. Fortunately, modifications to parameters can be detected simply by
qualifying the parameter declaration with const
, thus forcing the compiler to warn
about any modifications to the object within the function. An example of the use of
reference parameters in the definition of a Complex
object is shown below:
class Complex { double r, i; public: Complex & operator += (const Complex & c); // c is passed by reference for efficiency // The return type is also a reference }; Complex & Complex::operator += (const Complex & c) { r += c.r; // add to both data fields i += c.i; return *this; // return reference to updated object }
Const reference parameters.
Passing the argument by reference improves efficiency by avoiding big objects.
Note that the parameter is declared “const
” as well as “&
” indicating a reference.
This “const&
” pattern is the common C++ idiom for simulating a non-modified pass-by-value object
send into a function as a faster reference type.
Returning References.
This code also has a second optimization: reference return types.
Making the return value a
reference is also efficient, because the return
statement does not invoke the copy constructor. Note
that a returned reference is necessary only if the user of the Complex
class uses complicated expressions such as x+=y+=z
. If such expressions are not required, efficiency can
be improved by making the return type void
.
Objects Only.
The use of references is best limited to class objects, and also to structures and unions.
Arrays are already passed by reference in C++ and hence there is no need to
change them.
The use of references for scalar types (integers, float
, double
, and pointers) is unlikely to
give much improvement, if any, and might even be slower for some.
Pitfall: Temporary Objects.
Another disadvantage of using reference parameters for scalar types like “int
” is the inefficiency
caused if a constant value is passed as an argument (i.e. a number not a variable). Paradoxically,
passing a constant argument to a reference parameter is not an error in C++, but instead a
new temporary object with this type is created automatically by the compiler and its address passed.
Implicit “this” object.
Note that the object to which a member function is applied is already passed by reference in a certain sense,
because it is using the implicit “this
” parameter. Hence, the simple types of member function calls
are already efficiently using a hidden type of pass-by-reference of the object itself.
Consider this code:
int MyClass::fn() // member function { return x; }
It is not faster with a non-member friend
function call that uses an explicit reference parameter.
This code will not be more efficient (and is probably less efficient):
int fn(MyClass & object) // friend function { return object.x; }
Specialize functions with default arguments
Every default function argument is a place where you can optimize. Default arguments to functions are not a source of inefficiency in themselves, and cost no more than using a fixed-argument function and passing some constants explicitly. However, the use of default arguments indicates the possibility of improving efficiency by replacing a single function with a number of specialized functions.
How to do this? Instead of one function with a default argument, create two functions using function overloading. The specialization of the function into two separate functions will often make other optimization techniques possible, thus improving overall efficiency at the cost of some duplication of executable code. As an example of the possibilities that can exist, consider the function with default arguments:
void indent(int n = 4) // default argument n=4 { for (int i = 0; i < n; i++) cout.put(’ ’); }
Rewriting this single function as one general function and one specialized function leads to opportunities for optimization in the specialized function. In this case, loop unrolling can be employed:
void indent() // Specialized function (n=4) { cout.put(’ ’); // Loop completely unrolled cout.put(’ ’); cout.put(’ ’); cout.put(’ ’); } void indent(int n) // General function { for (int i = 0; i < n; i++) cout.put(’ ’); }
Note that this optimization is also limited in scope,
as there any need to change any
other code that calls the functions.
The C++ compiler will automatically make the correct choice of which overloaded function to call.
Another thought for improved readability
is to name the specialized function differently (e.g., indent4
),
which requires calls to the function to be changed.
However, default arguments are certainly convenient and the slight increase in efficiency
should be balanced against the loss of good programming style.
Medium-Sized Slugs
There are a lot more examples of possible inefficiencies in C++ coding. Some of the types of errors that are “medium-sized” slugs include:
- Automatic array initializations with constant data.
- Loop test function calls (ie. expensive loop conditional tests).
- Member initializations in the constructor body (they should be in the initializer lists).
- Program startup hidden initializations (global or
static
object constructors). - Small non-
inline
functions called frequently. - Busy wait loops.
- Unnecessary code inside loops.
- C++ classes wrapping simple data types (e.g. overuse of “smart pointers” or “smart integer” classes).
- Overuse of standard string concatenation operations.
- Recursion is almost always a slug.
Automatic Array Repeated Initialization
A simple example of unnecessary double initializations is any type of large local variable,
such as an automatic array.
When a function makes use of a large array variable with constant data, or even a large constant object,
the variable should probably be declared as both “const
” and “static
”, even if it
need not retain its value between calls.
Consider the following code example:
char *convert(int day) { char *days[] = { "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday" }; return days[day]; }
The initialization of array “days
” illustrates an inefficiency. The initialization
of “days
” occurs every time the convert function is entered. It would be much more
efficient to declare “days
” as a static
variable
to avoid it being re-initialized, and also “const
” to help the compiler optimize.
Data Structure Double Initialization
If you have an initialization routine that does a lot of work, it sometimes becomes a slug by accident. I'm not talking about a single variable initialization, but the initialization of a large program data structure at startup, like a precomputed lookup-table or a perfect hashing algorithm. In the design patterns vocabulary, such a situation is a “singleton” data structure, where only a single object ever exists in the program. It's easy to lose track of whether its initialization routine has been called, and then it gets called twice (or more!).
An example would be some of the precomputation methods whereby a large lookup-table is initialized at program startup. For example, a 24-bit lookup table has been used elsewhere in this book to optimize AI activation functions such as GELU.
The way to avoid the slug of double-initialization
is simply to track calls to the initialization routine.
The idiom that I use is a local static
variable of type bool
at the start of the initialization function:
static bool s_once = false; if (s_once) { yassert(!s_once); // Should be once only return; // Avoid double intialization! } s_once = true;
Another way is to actually count the calls with an integer, which is a generalization that works for additional scenarios:
static int s_calls = 0; ++s_calls; if (s_calls > 1) { yassert(s_calls <= 1); return; // Avoid double intialization! }
Note that I've shown how to wrap these multiple lines of code up into a single “yassert_once
” macro in Chapter 41, if you want a simpler method.
Singleton global objects.
If you've done the hard yards to declare a big data structure like this
as its own class, then you can simply instantiate only one object (i.e. as a global).
The C++ class infrastructure does well in ensuring that a constructor is only called once.
Even so, it may be worthwhile to declare a static
data member
and use similar logic to ensure that initialization on this object isn't ever done twice.
In any of these situations, it's a worthwhile investment of a couple of CPU instructions,
an increment and a test,
to avoid accidentally running the whole routine again.
Since the code is virtually identical for all cases, to avoid copy-paste typos,
you could even hide
these few statements behind a standard C++ preprocessor macro with a name of your choosing
Or you could even use an inline
function with the “return
” statement changed to throwing an exception.
Busy waiting for input
Humans are very slow compared to computers. In particular, a computer can do much work in the background, even when handling the (slow) interactive input of a human. Hence, one method of improving efficiency is to perform background processing while awaiting input, instead of using blocking input that waits for a keypress before doing anything. In other words, you can't use std::cin or scanf for non-blocking keypress polling.
A common example of this idea is chess-playing programs that “think” during their opponent’s time. The computer can continue its game-tree analysis while waiting for the player to press a key or click a mouse. The C++ standard provides no simple standardized function for non-blocking input. In general, there are two ways:
- Keyboard polling API calls (non-portable).
- Multi-threading with input on one thread and processing on another.
There are various non-portable ways to poll for key presses.
For example, on Windows there's the “_getch
” or “kbhit
” functions (also “_kbhit
”), which are all deprecated.
Assuming you've found a workable polling API call,
at some regular interval, perhaps before each node of the game tree
is analyzed, the chess program checks if a key has been pressed.
If a key has been pressed, the chess program stores
information about its current analysis, and processes the user's keystroke. Unless the key press
completes the user’s move, the background analysis can continue after processing the key.
Overall, there's no simple and standardized way to do non-blocking input in C++. This is probably because of C’s ancestry, where it was difficult to poll the keyboard on a traditional UNIX line terminal. Multi-threading can be used in C++ to achieve the result instead.
Slow disk I/O
The cost of performing I/O on disk files can make up a large proportion of the run-time cost of some programs. For reducing the amount of data to be read from or written to the disk, the main methods are:
- Use smaller records.
- Cache frequently used records.
- Buffer multiple reads or writes.
- Compress data.
- Use better data structures.
A very simple method of reducing disk I/O is to reduce the size of records being read or written. This can be achieved using many of the methods to create smaller objects. There are various methods in C++ to reduce a class object's byte size: unions, bit-fields, packing, smaller data types, or reordering data members.
Caching is useful if some records are being read more often than others. It is a very general idea and there are many possible implementations. You can even create your own caching mechanism.
It may be possible to keep all of the most frequently used records in main memory, writing them to disk only at the end of the program (even caching records in memory and writing them to disk for every modification will still avoid the cost of multiple disk reads).
If this method cannot be used, try using several memory locations for record I/O, and whenever a read operation is required, examine these in-memory records first. If any of them is the required record, the cost of a disk read is avoided. Caching always has a slight overhead, and may increase run-time slightly if the desired records are rarely in memory; however, it will never increase the amount of disk I/O and the computational overhead is likely to be small compared to the cost of reading a record from disk.
When reading or writing multiple contiguous records, disk I/O can be speeded up by
reading in a number of records each time.
The advantage is that buffering multiple operations reduces the number of disk seek operations.
For example, when using <stdio.h>
, the buffering
can be changed using the setbuf
and setvbuf
functions.
Another alternative is to use other low-level I/O functions, such as the Linux open
, read
and
write
functions. However, this method reduces portability of the code.
When the amounts of data being read are quite massive, the level of disk I/O can be reduced by compressing the data in the file. Read and write operations then have the overhead of uncompressing or compressing the data, but the cost of this computation may well be less than that of the disk I/O (or it might also be more; be careful!). However, methods of compressing data are beyond the scope of this book.
The use of a different data structure for data in disk files is often worthwhile. In particular, if the disk file is being searched, then many search algorithms are applicable. For example, binary search can be performed on a direct access file if the data is sorted. However, even binary search is inefficient for large disk files, and data structures specifically intended for disk data should be used. The B-tree is a commonly used data structure, and hashing is another possibility. Unfortunately, these algorithms are highly advanced and again beyond the scope of this book.
Incorrect choice of loop
Although the choice of loop is largely a matter of style, there is an important difference
between the post-tested “do
” loop, and the pre-tested “for
” and “while
” loops. The loop
condition of a do
-while
loop is not evaluated on the first iteration and the loop body is always
executed at least once. However, a for
or while
loop condition is evaluated before the
first iteration and the loop body need not be executed at all.
A common form of minor inefficiency is declaring loops that are always executed the first time, such as:
bool done = false; while(!done) { // .... }
It is more efficient to use the do
loop, which avoids a single evaluation of the loop condition:
bool done = false; do { // .... } while(!done);
The use of the correct type of loop is also helpful to the optimizer. It is valuable to know that a code segment is always executed once.
Infinite loops are control flow structures that can also be detected and used by the optimizer. Hence, you should code an infinite loop explicitly by using one of the common idioms:
for(;;) // Forever while(1) // Common do..while(1) // Not commonly used
This allows the compiler to generate efficient code, because you've made it easy for the compiler to recognize the loop as infinite.
Exit loops and functions early
Control structures should be exited as soon as possible,
including function paths and loops.
This means judicious use of return
for functions and break
or continue
for loops.
Using “return
” as early as possible in a function is efficient. It prevents unnecessary
code being executed. Testing for edge cases at the start of a function is an example
of using the return
statement to do “easy cases first” or “simple cases first” optimizations.
Exit loops early.
Similarly, both break
and continue
are efficient, as no more of
a loop is executed than is necessary. For example, consider the code using a Boolean variable “done
” to
indicate the end of the loop, as in:
done = false; while (!done) { ch = get_user_choice(); if (ch == ’q’) done = false; else ... // rest of loop }
The faster code has
a break
statement used to exit the loop immediately:
while (1) { // Infinite loop ch = get_user_choice(); if (ch == ’q’) break; // EXIT EARLY! else ... // rest of loop }
Unfortunately, the overuse of jump statements such as break
and continue
can make
the control flow of a program less clear, but professional C++ programmers
are used to these statements being used often.
More Slug Repellent
There's plenty of other optimizations in the chapters on compile-time optimizations, code transformations, loop optimizations, and vectorization. Well, actually most of the book! Nevertheless, here's a list of some more C++ code optimization techniques for you to consider. Some of the bigger ideas:
- Use “move constructors” instead of copy constructors where appropriate (since C++11).
- Use
static
data members where appropriate, so they are initialized once only. - Use
std::sort
rather thanqsort
. - Don't put
try
..catch
inside an inner loop that's a bottleneck. - Use
std::bitset
for bit sets or bit vectors. - Use the “iterators” design pattern rather than returning a full scan of a data structure all at once (saves memory and allows early exit).
- Consider basic C++ arrays instead of
std::vector
if it has a fixed size (known at compile-time) or its maximum size is small enough. - Consider C++20 coroutines where appropriate for the architecture.
- Structure of arrays (SoA) data layout is more vectorizable than Array of Structures (AoS).
And some of the smaller optimizations:
- Commonly used object or struct fields should be first. On some platforms, smaller offsets from the start of an object are accessed faster. Also, the very first field has offset zero, which is optimized away, so put the most used field first.
- Avoid long
else
-if
sequences. You are effectively doing linear search on the problem space in a long block ofif
-else
-if
statements. The best alternative is to use aswitch
statement, if the conditions are constants. For non-constant conditions or string comparisons, consider tabularizing the options and/or using heuristics to bifurcate the search space (e.g. start with aswitch
on the first letter of a string). - Use compact numeric ranges for
switch
. If the case numbers are close together, the compiler will probably use a lookup-table in assembler. If the cases are sparse, it can be forced to do anif
-else
-if
equivalent in machine code. - Correct choice of loop. If the condition is true at the first iteration, use
do
-while
loops. - Instead of range checking a signed integer
with “
i>=0 && i < MAX
” use a typecast with “(unsigned)i<MAX
” because negatives become large unsigned positives, and a cast fromint
tounsigned int
isn't a real instruction at run-time. - Enable the FTZ (“flush-to-zero”) and/or DAZ (“denormals-are-zero”) floating-point modes on your CPU, even though they violate the IEEE 754 standard. You probably don't care about tiny floating-point numbers in your weight or probability calculations.
- Enable GCC's floating-point arithmetic speedup options:
-ffast-math
,-fno-math-errno
,-fno-trapping-math
, and-ffinite-math-only
. bsearch
is slow. Choose a better method.- Use
static_assert
rather thanassert
(e.g. to check data type sizes). - Copy arrays by wrapping them in a dummy
struct
and using C++struct
bitwise assignment. It might be faster thanmemcpy
. - Use
memcpy
rather thanmemmove
if you're sure the arguments won't overlap. - Move local non-
static
objects outside of a critical loop. Reuse the same object rather than re-running constructors and destructors every loop iteration. Add a “reset
” member function if needed. - Use scaling factors that are a power-of-two, so that multiplication or division can be a bitshift.
- Specialize a function with a
void
and non-void
version if you find yourself ignoring the return value sometimes. This avoids all of the calculations to determine the return value inside thevoid
function, because the function itself cannot tell whether or not the caller will use its return value. - Prefer pre-increment (
++i
) to post-increment (i++
) for non-scalar values. And it's better to use pre-increment even for “int
” types, even though it's the same, just to get into the habit. - Use the GCC
__builtin_unreachable()
statement and the “noreturn
” function attribute to help the GCC optimizer identify dead code paths, allowing unreachable code removal (not that we care that much) and also better optimization of path-specific optimizations on other live paths (e.g. compile-time constant propagation). - Test the first character of two strings directly with character tests before calling
strcmp
. - Replace calls to “
round
”, “floor
” or “ceil
” functions with a type cast toint
(as an approximation). - Consider using the simpler
putchar
/putc
/fputc
orputs
/fputs
functions rather thanprintf
orfprintf
. - Write your own versions of
abs
andfabs
/fabsf
(but benchmark it). - Avoid the floating-point
pow
function for computing integer powers. - Instead of
strlen("literal")
declare it as an initializedchar[]
array variable and usesizeof(arr)-1
. - Merge a large number of function parameters into an object. Don't pass 10 Boolean flags as differently named function parameters. Create an object or structure and make them fields instead.
- Avoid calling
strlen
in a “for
” loop conditional. Computestrlen
before the loop, or test for the null byte. - Merge multiple Boolean function parameters into a bit set.
packed into an
int
orlong
. The gain from passing fewer values as function arguments will be offset by the cost of packing and unpacking bits, but still should be better. - Use
int
type mostly, notchar
orshort
. Maybe preferint
tosize_t
, too. - Specialize functions being called with a constant for an argument using a template function with an integer field. This will increase code size, but the constant will be propagated more at compile-time, and you also don't have the cost of passing it as an argument.
- Add “
noexcept
” specifiers to functions wherever it applies, because this allows the compiler to know not to worry about adding any extra exception handling code. - If you're “searching” an array or set of constant integers, known at compile-time, consider “proceduralization” by putting
the numbers as cases in a
switch
. (Trust the compiler engineers.) - Consider writing your own faster
atoi
/itoa
functions, as the standard libraries need to handle lots of rare cases, making them slower. (I'm not sure I agree and you might want to benchmark.) - Don't overuse “
alignas
” to specify address alignments if you don't need them, as the enforcement of alignment requirements can impose runtime cost. sprintf
is a slow and unsafe function.snprintf
is safer but still slow. Find another way.- Post-increment can be faster in pointer arithmetic, so prefer using the normal idiom “
*ptr++
” rather than “*++ptr
” to scan a vector.
References
- Agner Fog, 2023, Optimizing software in C++: An optimization guide for Windows, Linux, and Mac platforms, PDF: https://www.agner.org/optimize/optimizing_cpp.pdf
- Kurt Guntheroth, 2016, Optimized C++: Proven Techniques for Heightened Performance, O'Reilly Media, https://www.amazon.com/dp/1491922060
- Dov Bulka and David Mayhew, 1999, Efficient C++: Performance Programming Techniques, https://www.amazon.com//dp/0201379503
- Fedor G. Pikus, 2021, The Art of Writing Efficient Programs: An advanced programmer's guide to efficient hardware utilization and compiler optimizations using C++ examples, Packt Publishing, https://www.amazon.com/dp/1800208111
- ISO/IEC, Feb 15, 2006, Technical Report on C++ Performance, ISO/IEC TR 18015:2006(E), https://www.open-std.org/jtc1/sc22/wg21/docs/TR18015.pdf (Design of the C++ language from an efficiency perspective, including discussion of virtual functions and other language features.)
- Nicolai M. Josuttis, 2012, The C++ Standard Library: A Tutorial and Reference, Second Edition, Supplementary Chapter, https://www.amazon.com/Standard-Library-Tutorial-Reference-2nd/dp/0321623215, PDF (extra chapter): http://www.cppstdlib.com/cppstdlib_supplementary.pdf (C++ optimizations such as bit sets and user-defined memory allocators.)
- Bjarne Stroustrup, 2013, The Essence of C++ with examples in C++84, C++98, C++11, and C++14, PDF Slides: http://www.staroceans.org/e-book/essenceOfC++.pdf
- Wikibooks, 2023, Optimizing C++/Writing efficient code/Performance improving features, Wikibooks, https://en.wikibooks.org/wiki/Optimizing_C%2B%2B/Writing_efficient_code/Performance_improving_features
- Dave Abrahams et. al., 2003, Technical Report on C++ Performance, http://web.archive.org/web/20040608203404/http://www.research.att.com/~bs/performanceTR.pdf
- Jakob Engblom, 2001, Getting the Least Out of Your C Compiler, https://www.engbloms.se/publications/engblom-esc-sf-2001.pdf
- Jon Louis Bentley, 1982, Writing Efficient Programs, Prentice Hall.
- Thomas Plum and Jim Brodie, 1985, Efficient C, Plum Hall Inc.
- Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman, 1986, Compilers—Principles, Techniques and Tools, Addison-Wesley.
- Donald E. Knuth, 1973, The Art of Computer Programming (Vol. 3): Sorting and Searching, Addison-Wesley.
- James O. Coplien, 1992, Advanced C++ Programming Styles and Idioms, Addison-Wesley.
- Jonathan S. Shapiro, 1991, A C++ Toolkit, Prentice Hall.
- Bjarne Stroustrup, 1991, The C++ Programming Language (2nd edition), Addison-Wesley.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |