Aussie AI Blog
Multithreading Optimizations Overview
-
26th March, 2025
-
by David Spuler, Ph.D.
C++ Multithreading Optimizations
Multithreading is the art of parallelizing on a multicore CPU,
often as part of low latency programming.
Threads have been around since at least the 1990s (e.g., POSIX threads),
even before most CPUs even had "cores,"
but recent advancements have made them much easier to code.
C++11 introduced a more standardized thread library called std::thread
(along with std::mutex
and std::atomic
),
and C++17 introduced a lot more advanced parallelization modes.
What is Multithreading?
In this discussion, threads run on the CPU, and you can have many threads per CPU (or per "core"). Multithreading and multicore programming are largely the same thing, or at least they're in the same ballpack.
Other types of threads can differ quite a lot. For example, there is also a slighly different idea of "threads" on GPUs in the CUDA C++ programming language. You can run 1024 threads on an NVIDIA GPU, but you might not want to do that on your CPU lest you run out of stack space. CUDA C++ allows 1024 threads by having a quite restricted amount of GPU memory (sometimes called VRAM) allocated to the call stacks for each GPU thread in a grid. Hence, stack overflow is a thing on GPUs, too.
How Not to Multithread
If you're looking for a short career as a multithreading programmer, here are some suggestions:
- Launch as many CPU threads as you possibly can, ideally one per vector element, just like you do in a low-level GPU kernel for AI inference.
- Put huge buffer objects as local variables on your call stack, and launch multiple threads of that.
- Fix your huge local buffer variables by making them
static
, because that function won't ever get run twice at the same time. - Use mutexes around every access to all your variables, just to be safe.
- Recursion will get you fired in any coding job, except university lecturer, so it's best to pretend you've never heard of it.
High-Level Multithreading Optimization
The first point above all else: multithreading is a high-level optimization in itself. Hence, you want to be judicious in your choices of where to use your threads, and at what level.
Some of the issues that control the overall concurrency that is achieved via a multithreaded architecture include:
- Abstraction level choices for splitting the work across threads.
- Thread pool design pattern — avoid creating and destroying threads.
- Thread specializations — e.g., producer and consumer threads model.
- Message-passing design pattern to avoid locking — e.g., with a paired future and promise.
Focusing on the data can also be useful to optimize:
- Multithreading-friendly data structures — e.g., queues (esp. lock-free versions).
- Maximize read-only "immutable" data usage — avoids blocking concurrent readers.
- Advanced data structure read-write models — copy-on-write, versioned data structures.
- Shard data across threads — reduces needed synchronizations (or other types of data partitioning).
- Reduce disk writes — e.g., use in-memory logging with occasional disk writes.
Ways to optimize by focusing on the execution pathways include:
- Slowpath removal — keep the hot path small and tight.
- Defer error handing — most error code is uncommonly executed (i.e., a slowpath), so avoid, defer or combine error detection code branches.
- Cache warming — keep the hotpath bubbling away.
- Full hotpath optimizations — e.g., for HFT, the hotpath is not just "trade" but actually the full latency from data feed ingestion to execution, so it's actually "receive-analyze-decide-and-trade."
Some of the more pragmatic points include:
- How many threads?
- How long should each thread run?
- When to exit a thread versus waiting.
There's no wrong or right answer to these questions, as they depend on the application and the problem you're trying to solve.
Low-Level Multithreading Optimization
There are various ways to modify how you run threads in order to optimize their concurrency speed. These are not as impactful as the higher-level thread choices, but are still important. Some methods to change the lower-level thread architectures include:
- Core pinning (processor affinity) — every popular thread can have a favorite core.
- Early unlocking — e.g., copy data to local variables, release lock, then do the computations.
- Cache locality improvements (L1 cache and memory prefetch cache)
- Branch reductions — keep the instruction pointer on the straight-and-narrow.
- Lock-free algorithms — avoiding mutex overhead and blocked thread delays.
Ways to avoid slow-downs in multithreading, and therefore increase speed:
- Minimizing thread launch and shutdown overheads.
- Releasing locks early by avoiding unnecessary computation, I/O waits, etc.
- Minimizing context switches
- Memory reductions (e.g., allocated memory; reduce thread-specific call stack size).
- Avoid spinlocks (busy waiting) or mitigate them with exponential backoff methods.
- Avoiding "false sharing" from overlap of CPU memory prefetch cache lines (e.g., use
alignas(64)
to separate unrelated atomics). - Check
std::lock_guard
is not unnecessarily delaying the unlock (i.e., till it goes out-of-scope).
Lock-Free Algorithms
Lock-free programming is a method of optimizing multithreaded code to avoid locks (i.e., mutexes). The advantages in speed arise from:
- Overhead of mutexes
- Lost performance from threads blocked awaiting a resource.
The main disadvantage of lock-free programming:
- Your brain will explode.
The internet is littered with articles about failed attempts to write lock-free algorithms, even by some of the best programmers. There are many ways to go wrong in the quest to get rid of mutexes.
Note that "lock-free" programming does not mean that you just search up "mutex" in vi, and then hit the "dd" button. No, lock-free programming is not just sequential programming. Instead, the idea is to switch to a faster concurrency method than mutexes, so this is the main idea:
std::mutex
— lock-based programming.std::atomic
— lock-free programming.
The overall idea is to use an "atomic" operation instead of a mutex. To make this work, it's usually a quite complex atomic operation, such as a "Compare-And-Swap" (CAS) operation.
This is how a CAS operation works, which a number of steps all done atomically in one unbreakable sequence:
- Access a variable (that you want to set atomically).
- Compare it to the "old" or "expected" value.
- If it's equal to the old value, then successfully update to the new value (and done).
- If it's not equal to the old value, someone else has already updated it, so we fail (and then loop around and retry).
What a mouthful!
Fortunately, C++ has the std::atomic
class (since C++11) to take care of all that.
The main routines to use for a CAS instruction are:
std::atomic::compare_exchange_weak std::atomic::compare_exchange_strong
Note that you will also need to know about "memory orders" around atomic primitives,
as controlled via the std::memory_order
library.
Portability issues.
There are also a variety of non-standard methods to achieve lock-free programming with primitives in older code platforms, or in a platform-specific manner. Some of the primitives are:
InterlockedCompareExchange
— Win32 version in<winnt.h>
.OSAtomicCompareAndSwapInt
— iOS/Mac variant in<OSAtomic.h>
__atomic_compare_exchange
— older GCC version.
std::atomic
class is not actually guaranteed
to be a lock-free atomic operation
on every platform.
It's a good idea to test your platform
using the "is_lock_free
" primitive
as part of your initialization or self-testing code:
assert(std::atomic<int>::is_lock_free());
Sequential C++ Code Optimizations
An important point about the code running in any thread is that: it's just C++ code. Each thread is running a sequential set of instructions, with its own call stack. Hence, all of the many ways to optimize normal C++ code also applies to all of the code in the thread.
Hence, all of the basic ideas for C++ code optimizations apply:
- Compile-time processing —
constexpr
,constinit
(see Chapter 11: Compile-time optimizations) - Operator efficiency — e.g. replace multiply with bitshift or addition (see Chapter 10: Arithmetic Optimizations)
- Data type optimizations — e.g. integers versus floating-point.
- Memory optimizations — cache warming (prefetching), memory reductions (see Chapter 14: Memory Optimizations.
- Loop optimizations — e.g. loop unrolling, code hoisting, and many more (see Chapter 15: Loop Vectorization).
- Compiler hints — e.g.,
[[likely]]
statements. - Function call optimizations — e.g., inlining, always_inline, etc.
- C++ class-level optimizations — e.g., specializing member functions (see Appendix 1: C++ Slugs).
- Algorithm improvements — various non-concurrency improvements, such as precomputation, caching, approximations, etc. (see Chapter 13: Algorithm Speed-Ups and Chapter 18: Parallel Data Structures).
So, the bad news is that once you've coded your multithreaded algorithm, you still have to go and do all the other types of sequential optimizations. Oh, come on, who are we kidding? — it's loads of bonus fun (see also Appendix 1: C++ Slug Hunting).
Combining Multithreading and SIMD CPU Instructions
You can double up! It's totally allowed, and you can even put it on your resume. The idea is basically this structure:
- Multithreading architecture — higher-level CPU parallelization.
- SIMD instructions — lower-level CPU vectorization.
Some of the main CPU architectures with SIMD parallelization include:
- AVX — x86 (e.g., Intel or AMD)
- ARM Neon — iOS/Mac
Note that there are variants of each of these SIMD architectures, available on different chips. For example, AVX has AVX-1 (128 bits), AVX-2 (256 bits), AVX-512 (you can figure it out), and AVX-10 (1024 bits). Read more about AVX vectorization techniques in Chapter 30. Vectorization.
Combining Multithreading and GPU Vectorization
If you've sold your car to buy a PC that has both a fast CPU and a high-end NVIDIA GPU, there's good news to think about while you ride the bus: both chips run at the same time. (Wow, in parallel, even.)
In fact, there are "threads" on both the CPU and the GPU. However, C++ CPU threads are much higher-level than the CUDA C++ threads on the GPU. The idea is:
- CPU threads — big chunks of work.
- GPU threads — very granular computations.
CPU threads are not that granular, and you use them to do quite large chunks of work, not just one addition instruction. For example, you might have threads pulling incoming user requests off the queue, and a thread might handle the entire user request, perhaps launching some other threads on the CPU or GPU to do so.
There are some parallels between coding CPU and GPU threads:
- Both types of threads have a call stack.
- Both have "global" or "shared" memory to use across threads.
- Overhead of thread launches and exits are a thing for both CPU and GPU threads.
Note that there's also a new generation of "mini-GPUs" called a Neural Processing Unit (NPU), which aren't as powerful as a fully-fledged GPU. NPUs tend to be used on "AI Phones" and other "edge" devices, which aren't as powerful as a PC. Most of the comments about combining C++ multithreading and GPU coding also apply to the use of NPUs, except a little slower.
Going for the Triple-Double
You can even triple up your parallelism:
- Multithreading/multicore (CPU)
- SIMD instructions (CPU)
- GPU vectorization
Is there a way to do four levels of parallelism in just one C++ program? Yes, of course:
- Linux processes (at a higher level).
- Networking communications (the NIC runs parallel, too).
There are some optimizations of those things, too.
Advanced O/S and Networking Optimizations
It doesn't end with the C++ code. There are other things you can optimize:
- Process priorities — be nice and turn yours up to eleven!
- Linux system processes — turn off the various Linux system processes that you don't need (so they don't compete for CPU time).
- Kernel bypass — direct NIC manipulations.
- Overlap communications and compute — e.g. PCIe bus GPU-to-memory upload/download.
- Networking technologies — e.g. TcpDirect and Onload.
- Linux kernel optimizations — e.g., network buffer settings; disable writes that update the "file access date" when reading a file.
- Linux system settings — ensure you don't have accounting or security modes on.
There's also some other items on the advanced menu:
- Buy a bigger box
- Get a faster SSD disk (e.g., NVMe)
- Overclock your CPU (and the GPU)
- Assembly language
- Microwave communications
- FPGA
There's always more, but I've run out of room in your web browser. Here are some references to read.
References
Here are some of the Aussie AI blog articles on code optimizations:
- Low latency programming, Aussie AI Blog
- 500+ Techniques for LLM Inference Optimization, Aussie AI Blog
- CUDA C++ Optimization, Aussie AI Blog
And here are some book chapters, for Generative AI in C++ (see full text online), with more coverage of general C++ speed-ups:
Part II: Basic C++ Optimizations
Chapter 8. Bitwise Operations
Chapter 9. Floating Point Arithmetic
Chapter 10. Arithmetic Optimizations
Chapter 11. Compile-Time Optimizations
Chapter 12. Pointer Arithmetic
Chapter 13. Algorithm Speedups
Chapter 14. Memory Optimizations
Part III: Parallel C++ Optimizations
Chapter 15. Loop Vectorization
Chapter 16. Hardware Acceleration
Chapter 17. AVX Intrinsics
Chapter 18. Parallel Data Structures
Part V: Optimizing Transformers in C++
Chapter 28. Deslugging AI Engines
Chapter 29. Caching Optimizations
Chapter 30. Vectorization
Chapter 31. Kernel Fusion
Chapter 32. Quantization
Chapter 33. Pruning
Chapter 34. MatMul/GEMM
Chapter 35. Lookup Tables & Precomputation
Chapter 36. AI Memory Optimizations
Appendices
Appendix 1: C++ Slug Catalog
Low Latency C++ Book
![]() |
The new Low Latency C++ coding book:
Get your copy from Amazon: Low Latency C++ |