Aussie AI
Efficient Attention Algorithms
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Efficient Attention Algorithms
The self-attention mechanism is computationally expensive, being perhaps a case of “too much of a good thing.” The quadratic complexity of self-attention in a vanilla Transformer is well-known, and becomes a performance bottleneck of the engine.
There has been much research on how to optimize attention to a linear-complexity algorithm, often called “linear attention” algorithms. The main attempts have been:
- Parallel attention optimizations — going beyond MHA.
- Efficient attention algorithms — faster attention algorithms, notably “Flash Attention.”
- Removing some attention heads — called “attention head pruning.”
- Approximations of attention — not paying attention to every token.
- Sparse attention — using sparse matrices for attention.
- Low-rank matrices — using smaller matrices and matrix algebra.
- Alternative architectures — using other ideas instead of attention.
- QKV code optimizations — advanced coding speedups such as “KV caching” and “QKV tensor merging.”
Some of these are discussed more fully in other chapters. For example, KV caching is in Chapter 29, and attention head pruning is under “width pruning” in Chapter 48. Although we'll now examine some of these interesting attention speedup methods in extra detail, let's be honest in admitting these are mostly research techniques. The main way to optimize attention in industry practice: more GPUs.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |