Aussie AI Blog
Low Latency Programming
-
January 23, 2025
-
by David Spuler, Ph.D.
Low Latency Programming
Low latency programming is coding an algorithm so that it completes the task in the fastest time. In many cases, this is effectively the "user response time" or the "round-trip time" for a computation.
The main uses of low latency programming include:
- AI kernels — latency is the time between submitting a query, and starting to get the answer back.
- Embedded devices — the system must respond quickly, in real time (e.g., autonomous self-driving cars are a large embedded device).
- High-Frequency Trading (HFT) — latency is the time it takes to submit, execute, and complete a trade.
- Game engines — latency is ensuring that the characters or environment moves fast enough to be responsive to user inputs and to keep up with the frame rate.
The main programming language used for all of these low latency algorithms is my favorite one. I've written books on it!
C++ for Low Latency Programming
I'm a fan of C++, so you can take this with some grains of salt. The main programming languages for fast latency are:
- C++
- C
- Rust
- Assembly
- Hardware acceleration
The C++ is under the hood for most of the above cases. Most AI engines are Python at the top level, but C++ in the low-level kernels doing all those matrix multiplications. Game engines have historically been written in C++, at least for all the low-level stuff dealing with frame rates and 3D animation. Similarly, high-frequency trading is usually running in C++ at the bottom level.
You can also use C, which is the longstanding precursor to C++. The C programming language is obviously fast, as that was its key design point. C is not necessarily any faster than C++, so if you used only a C-like subset of C++, the two would be the same speed. However, using C does avoid the temptation to use some of the slower features that are available in the higher levels of C++.
Rust is a language that we refuse to talk about much, if you're any kind of C++ programmer. We'll only learn Rust if absolutely forced to do so. Apparently, Rust is also fast, and more memory safe than C++. But there's also Safe C++, profiles, hardened standard C++ libraries, and other variants of C++ to compete against Rust, so it's a whole big shemozzle.
Assembly language is faster than any of these higher-level languages. If you speak directly to the machine, there are various ways to speed up code. But it's a very low-level way of programming, and harder to learn, so the best method is to focus on optimizing only the main hot paths with assembly.
Hardware acceleration is the last option: just buy a better rig. Some of the main silicon to consider include:
- GPUs — AI, anyone? Data centers for cloud AI backends have the biggest GPUs. Or there's gaming desktop PCs with lower-end GPUs.
- FPGA — this is common in high-frequency trading and quant trading.
Plus, there's always that CPU to consider.
CPU versus GPU
With all this fuss about NVIDIA GPUs for AI, you might think that a GPU is what you need. Not so fast! The characteristics of AI engines and LLMs that make super-duper GPUs the mainstay of acceleration are:
- Huge numbers of arithmetic computations, and
- Highly parallelizable algorithms.
AI engines are number-crunching beasts, mostly doing vector dot product, matrix-vector and matrix-matrix multiplications. Here's the thing about GPUs:
GPUs have throughput not low latency!
You didn't hear this from me, but GPUs actually run slow. The clock speed of a high-end GPU is often around 1GHz, whereas a high-end gaming PC has a CPU clock speed of 4GHz or more. So, if you couldn't parallelize an algorithm, it would run slower on a GPU than a CPU. The key point is this:
Throughput + Parallelization = Low Latency
AI algorithms are very amenable to parallelization. And GPUs have high throughput of parallel operations on all those cores. A multi-core CPU has a dozen cores, but a big GPU can have thousands. Hence, it crunches data in parallel with high throughput, and the net effect is that a GPU runs AI algorithms with very low latency.
Which explains why those data center GPUs cost more than your car!
AI Engines
As already examined above, AI engines have an algorithm structure that's perfect for GPUs. The basic point about AI inference algorithms include:
- Process all of that data, and
- Hardly any alternate pathways.
Yes, for every word that an LLM throws out, it has to crunch through multiplication operations on every single number in the model. And that's just for one word. This process repeats over and over, and there are very few ways to shortcut the arithmetic without losing accuracy.
In fact, there are two main phases in AI inference with different latency characteristics:
- Prompt processing phase ("prefill") — process all the input tokens.
- Decoding phase — emit the answer words.
The prefill phase has these characteristics:
- Parallel processing of every token in the input text.
- Compute-bound (because of that parallelization).
The decoding phase has opposite characteristics:
- Sequential algorithm (one output token at a time, called "autoregression").
- Memory-bound (loading the entire model each time).
In fact, the situation with compute-bound vs memory-bound is a little more nuanced in the decoding phase. It's memory-bound overall, but the sub-components of a layer have slightly different characteristics during the decoding phase:
- Attention module — memory-bound (model weights and KV cache data)
- Feed-forard network (FFN) — compute-bound (model weights)
Hence, the double sequence of two matrix multiplications is an intense computation in the FFN (also known as the Multi-Layer Perceptron or MLP). However, the attention mechanism is memory-bound, mainly from needing to load the "KV Cache" data and less so from needing model weights. This characteristic affects the overall status of the decoding phase more than FFN computations, causing the decoding phase to be memory-bound overall.
High-Frequency Trading
HFT and quant trading algorithms have some peculiar characteristics with regard to low latency programming. The main point to consider about the algorithm is there are conceptually two main code pathways:
- Cold path — analyze, but don't trade.
- Hot path — trigger a trade.
And here's the weird part:
- Cold path — very common.
- Hot path — rarely executed.
This is different from most other types of algorithms, where the main path to optimize is also the common path. For non-HFT apps, you crank up the profiler, run the whole app, find where it's spinning the most CPU cycles, and optimize that code.
Not for HFT!
For HFT, the hot path is the rare path. Despite what people think from the name, the algorithm is actually trading much less frequently than it decides not to trade. Once the analysis decides to trigger a trade, that is a very hot path, and every step must execute with minimal latency. There are multiple actions for a single trade from initiation, network submission, processing, and finalization. The whole round-trip latency of this trade execution hot path is hyper-critical.
But the analysis part of the HFT code can't be slow either. The hot path is not really just "trade" and should really be thought of as "analyze-and-trade." We can't have the analysis phase running too slow, or we'll miss the opportunity to trade. So, it's true that once a trade is triggered, that pathway must be super hot, but the analysis phase cannot be a laggard either. Optimizing the analysis phase has an element of normal performance profiling of code hot spots, along with extra network latency issues from the data gathering phase via exchange network connections.
Low Latency AI Application Optimizations
I've written whole books on C++ efficiency, and some of the chapters of the book Generative AI in C++ are free to read on this website (see Generative AI book overview page). The AI-specific speedups for LLM inference engines include:
- Attention algorithms (e.g., memory-efficiency with Flash attention or paged attention)
- Model compression
- Quantization
- Integer arithmetic (e.g., low-bit quantization)
- Pruning
- Sparsifying data (e.g., activation sparsity)
- Distillation
- Small Language Models (SLMs)
- Caching
- Tensors
- Kernel fusion
- Zero-multiplication models
- Neural Architecture Search (NAS)
- Matrix multiplication (MatMul) optimizations
C++ Low Latency Optimizations
If you're not doing AI, or even if you are, there are also a lot of general C++ efficiency improvements to consider:
- Algorithm optimizations
- Fast data structures (e.g., basic hashing, vector hashing, permuted arrays)
- Very fast data structures (e.g., perfect hashing, lookup tables, bit vectors, Bloom filters)
- Parallel data structure
- Hybrid data structures (e.g., hashing with an interleaved doubly-linked list for fast scanning)
- Precomputed table lookup
- Source code synthesis for precomputation
- Lazy evaluation
- Parallelization
- Worker kernels (processing job queues)
- Compile-time optimizations (e.g.,
const
andconstexpr
) - Template Meta-Programming (TMP) (amazing but weird)
- Caching
- Approximations
- Simple cast first
- Common case first
- Incremental algorithms
- Multithreading
- SIMD CPU instructions (e.g., AVX parallelization, ARM Neon, etc.)
- Bit-level fiddling computations (fun and fast!)
- Floating-point tricks
- Arithmetic optimizations
- Operator strength reduction (use plus, not the star)
- Integer arithmetic (not floating-point)
- Pointer arithmetic
- Compiler auto-optimizations
- Vectorization
- Function inlining
- Recursive removal (ugh!)
- Memory optimizations
- Memory usage reductions
- Data size reductions
- Data compaction/compression
- Heap memory operation limits
- Contiguous memory blocks
- Fast memory operations
- Loop optimizations (e.g., loop fusion, loop fission, loop unrolling)
- C++ class slugs (to avoid)
- Function slugs
- More C++ slugs
Did I miss any? (Yes, hundreds. There are literally 500 ways to speed up an AI engine.)
Serving and Deployment Optimizations
If your software has to do multiple things at once, such as talk to multiple people (users), or communicate with multiple stock trading platforms, then there are many system-level practicalities that affect latency.
Some of the issues to consider in the whole end-to-end latency of a request going through a system include:
- DNS lookup time
- Connection handshake time
- SSL time
- Load balancing
- Round-robin DNS
- Parallelization (multiple servers)
- Utility servers
- Caching (e.g., etags)
- CDNs
- Database lookup time
- Database indexes
- Keep-warm server architectures
Building a low-latency system is more than just coding up some C++. You have to put together a bunch of off-the-shelf components. For more information, see Generative AI Deployment Architectures (book chapter).
Network Optimization
If your algorithm has to talk between two computers, there's a network in between. The time spent sending data across the wire and back is a key part of the latency. Faster algorithms need to optimize the network traffic. The main techniques for network optimization include:
- Higher bandwidth network connections
- Advanced network protocols
- Compressing network data sizes
- Spreading bandwidth usage over time (avoiding peaks)
- Overlapping computation and communications
- Direct access to peripherals (local and remote)
- Direct access to memory (local and remote)
- Sticky sessions (keeps session data local)
There's a whole book that needs to be written about network optimizations!
Intentional Slowness
Although latency is important, it is worth noting that there are times to go slow. The main point is that humans are slower than computers, so the algorithm often has to slow down the user interface so that the human user can keep up.
Game engines are a particular example of this. The computer has to move all of the game characters and enemies fast, yes, but also not too fast. The speed of the user's character cannot be too fast for the inputs of the user. Similarly, the enemies cannot move too fast, or the user will not be able to evade them or destroy them.
AI engines don't really have this problem in text-to-text classic LLMs. The only concern for excessive speed is not having the text output too fast to be read. However, other types of AI models such as speech and video need to have outputs in the right speed range, not too slow, but also not too fast.
High-freqency trading is one area that doesn't really have a "human in the loop." There's no real need to intentionally slow down the execution of a trade. However, there is a need to avoid over-trading too fast, lest the algorithm fail to notice some sort of failure. But this is the less common case than simply needing to go as fast as possible. Reporting a trade back to a supervising user is the last step, and not in the critical path.
More AI Research Topics
Read more about: