Aussie AI

Advances in Transformer Architectures

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Advances in Transformer Architectures

Here are some of the notable research advances that have become commonplace in the industry.

Decoder-only architectures (e.g. GPT). As mentioned above, one of the major architectural improvements was to use decoder-only architectures, rather than encoder-decoder methods. Note that encoder-decoder models are still valuable for some use cases, such as foreign language translation (although GPT can also do this).

Pre-Norm beats Post-Norm: Probably the first improvement over the original 2017 vanilla Transformer architecture was to move normalization to apply on the inputs (“pre-norm”) rather than on the layer outputs (“post-norm”). This change sped up training because it solved a training instability problem in the original Transformer that had required a “warmup period” at the beginning of training. Modern commercial engines use pre-norm, but various research papers still continue to try post-norm with some success.

Quantization: This is so prevalent it hardly needs saying. Quantization changes weights from 32-bit floats to smaller data sizes, such as 16-bit floating-point or integers. There are numerous quantized versions of the major open source models available in the repos. Quantization to 8-bit integer, and even down to 4-bit integer, has become commonplace. This improves inference efficiency tremendously at the cost of a few percentage points of accuracy (perplexity).

Pruning: Pruning is removal of small or less important weights to reduce model size. Various types of model pruning are widely supported in model frameworks, such as PyTorch or TensorFlow. Unstructured pruning means removing or zeroing any weights that are too small. Structured pruning means removing whole Transformer components, such as layer pruning. In all types of pruning, removing weights allows the engine to run lighter and faster, with a reasonable trade-off in model accuracy.

Flash Attention: At first there were many attempts to overcome the quadratic complexity of attention on long contexts. However, Flash Attention followed by Flash Attention 2, seems to have succeeded as the best and is starting to be implemented on major engine platforms.

Rotary Positional Embeddings (RoPE): The RoPE method of adding positional encoding is becoming a standard way to (a) efficiently handle longer contexts, and (b) have models become better at understanding or generating long texts (“length generalization”).

KV Caching: There are various ways to do KV caching, and issues of how much to cache, but the general idea that a KV cache is required is widespread nowadays.

Flash Decoding: The decoding algorithm is the last part of the Decoder block, where it chooses the next output token. A new fast decoding algorithm from the team that created Flash Attention is now garnering some attention.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++