Aussie AI
FAQs on Transformer Architecture
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
FAQs on Transformer Architecture
Here are some top-level questions about the architecture of modern Transformer architectures.
What is attention? Attention is an important underpinning concept in how LLMs work. The idea is for the model to focus its “attention” on particular tokens in a sequence of words, and parts needing the most attention are amplified by larger model weights. More about attention is found in the next chapter, if I haven't lost yours by then.
What is prefill? It's a Clayton's encoder: the encoder you have when you don't have an encoder. Don't worry, it's an Aussie joke. Prefill is an encoder-like phase at the start of inference for decoder-only architectures (e.g. GPT-2). There's no encoder, so the first step in a decoder-only architecture is to process the input prompt so as to “prefill” the internal embeddings with known data. It's very similar to having an encoder, but it's inside the decoder. The second phase that the decoder runs is then the “decoding” phase, which emits one token at a time.
What are linear and quadratic attention? These are statements about the efficiency or lack thereof in the “attention” phase of a Transformer. The vanilla 2017 Transformer had quadratic or O(n^2) complexity in the length of the inputs, which is slow for a long token prompt. Various modifications to the attention architecture in research papers, notably Flash Attention, have changed this to linear O(n) complexity, which is faster.
What are pre-norm and post-norm? This refers to the placement of the normalization module relative to the feed-forward networks in a Transformer architecture. The original 2017 Transformer used post-norm, with normalization after the outputs, but noted an instability in training. Thus, the first GPT used this “post-norm” architecture. Various researchers subsequently confirmed that changing the Transformer architecture to “pre-norm”, with normalization before the attention heads, was more stable, removed the instability, and thereby allowed for faster training. GPT-2 was subsequently released with a pre-norm architecture. Although the general view is the “pre-norm” is preferred, I'm still seeing some research papers that say the opposite, so this is somewhat unresolved.
What are BatchNorm and LayerNorm? These are normalization modules. BatchNorm came first and normalizes a vector of probabilities. LayerNorm was an extension to layerwise normalization and is more complicated, but is broadly regarded as having advantages over BatchNorm.
What's an igloo? Oh, you mean SwiGlu? That's a Swish function in a Gated Linear Unit (GLU). It's one of the many possible “activation functions” that you can choose. There's also RELU, GELU, leaky RELU, and a bunch more in research papers. See the chapter on activation functions, or just skip it, because I still have my doubts that these fancy functions are worth the effort.
What are autoregressive and non-autoregressive? The standard Transformer with the GPT architecture has an “autoregressive” decoding algorithm when it emits tokens. This means that it sends its own output back to itself (“auto”) and loops around again (“regressive”). The simplest decoding method is for the decoder to emit one token, and then it adds that new token onto the end of the input sequence, creating a longer “input sequence”, which is then processed again by the entire decoder stack to spit out the next one. In a word: sloooow. But very smart. Generally, non-autoregressive decoding algorithms, such as parallel decoding, will be faster than the default autoregressive mode, but possibly less accurate.
What is overfitting? When you put on a jacket and your hands don't appear. No, wait. Overfitting is an obscure statistical concept that the approximation (i.e. the AI model) fits the data too well, is too specific, and cannot generalize its insight to newer data. Any further attempt to explain this will just get me into trouble, because overfitting is something that everyone sort-of understands, but no-one can explain properly. The way I think about it, which isn't fully accurate but is a useful approximation, is that an overfitting model has “too much” capability to predict with too much specificity. Overfitting also doesn't really mean that the model has too many parameters, and could have been just as smart with fewer weights. At the very least, overfitting is better than underfitting, which means the model can't predict much of anything.
What's are linear and bilinear layers? The standard Transformer layer has a Feed Forward Network (FFN) component that consists of two linear layers. The term “linear layer” is a fancy way of saying matrix multiplication (similarly “linear projection”), where a matrix of weights is multiplied against a vector of probabilities (embedding vector) to get an updated vector of probabilities (with extra geniousness added). The default Transformer FFN's do a linear layer twice, with an activation function applied on the vector as an extra step between them (usually RELU). A “bilinear layer” is an FFN that's lost its in-between activation function, so it just does two matrix multiplies. Bilinear layers are not normally used in a Transformer, although researchers have tried.
What is masking? Well, it's not bit masks, if that's what you're thinking. It refers to attention masks for tokens. In an encoder-decoder Transformer, the encoder is allowed to examine tokens not only backwards, but also look ahead and see all of the possible future tokens in the sequence. However, the decoder is a naughty child that is only allowed to look backwards to the tokens it has already output. No copying allowed! This is done in the decoder's “attention” module by “masking” the lookahead tokens so that the decoder can't see them (no matter how hard it tries to peek). Encoders have a non-masked attention allowing lookahead, whereas decoders have a “masked attention” module only allowing look-backwards.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |