Aussie AI

What are Q, K and Vand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

What are Q, K and V?

Each attention head has a significant amount of computational work to do each iteration. The attention mechanism works at runtime by using three different vectors:

  • Q — Query
  • K — Key
  • V — Value

All three of these vectors are acted upon by parameters that are learned during training. In fact, the ability to learn how to show attention to different tokens is deeply enmeshed in the intelligence of LLMs in their processing of text sequences. These calculations occur in the three vectors during runtime processing.

The attention block performs multiple levels of computations on the QKV matrices. In pseudocode, the QKV attention mechanism looks like:

  // 3 linear projections
  Q = linear-projection(WQ, Input);
  K = linear-projection(WK, Input);
  V = linear-projection(WV, Input);

  // Combine Q and K
  QVCombined = MatMul(Q, K)

  // Softmax normalization
  QVCombined = Softmax(QVCombined)

  // Merge V in too
  QKVCombined = MatMul(QVCombined, VLinear);
  return QKVCombined

Are QKV vectors or matrices? Both, really. Let me explain. Q, K and V are often mentioned as “vectors” and I've also been calling them vectors above, but that's not the full story. They can be called “vectors” because each of Q, K, and V end up as a vector for each token. The key point is at the end: for each token.

There's three QKV vectors for each token. However, for a sequence of multiple tokens that make up the prompt, each of QKV has vectors for each token, so the structure is really a vector-of-vectors, which is a matrix. Hence, the resulting Q, K and V calculations result in three matrices, which are called Q, K, and V.

Aren't you glad that you asked, now?

What are the QKV weight matrices? These are three other matrices, not the QKV matrices, so there are six matrices floating around inside your GPU. Three are dynamically computed, and three are static parts of the model. The operations performed on QKV all have matrices of weights involved, and these three matrices are designated with names based on the QKV matrix to which they apply (i.e. WQ, WK, and WV). These three weight matrices are:

    (a) learned during training, and

    (b) static during inference.

Hence, the intelligence in the attention mechanism is trained into these three weight matrices, and the resulting three QKV matrices are the dynamic computations based on this learned attention, as computed during runtime inference.

In more detail, what's actually happening is the input matrix has one-dimension of the sequence length (of tokens) and one dimension of the embedding size (an internal model meta-parameter). Each of Q, K, and V has their own (static) matrix of weights. The input into the attention block is the same for all three computations. This input matrix is separately multiplied in a matrix-by-matrix multiplication (MatMul), using each of the three different weight matrices, to give you the three resulting Q, K and V matrices.

Yes, matrices. These three Q, K and V matrices are dynamically computed values during processing, and thus don't contain any learned data or static weights themselves. However, the QKV matrices are still two-dimensional matrices indexed by:

    (a) token sequence (in the prompt), and

    (b) embedding vector dimensions.

Technically, the Q, K and V matrices are “linear projections” of input sequences based on the (static) parameters in the three weight matrices. And the creation of the three QKV matrices is just the first step inside the attention block, with multiple subsequent steps that combine Q, K and V matrices back together. It is analogous to a “mini-model” within the overall model, because this attention method has trained weights and ongoing tracking of QKV matrices of probability-like values that indicate how much “attention” each token should pay to another.

Are QKV used for inference or training? Both. The computations of the Q, K and V matrices occur in both training and inference. During training, the weights in the three related weight matrices are updated (and put into the model file at the end), whereas for inference, the weights are static. The QKV matrices themselves are not part of the model file, because they contain dynamic calculations during both training and inference.

Are QKV used by encoders or decoders? Both. There are attention mechanisms in both encoders and decoders. In the vanilla Transformer, the encoder mechanism allows “lookahead” for processing, whereas the decoder uses “masked attention” that disallows processing of upcoming prompt tokens. Masked attention means the decoder can look backwards at already-emitted tokens only. Hence, the encoder pays attention to the input prompt, whereas the decoder can only directly consider the output tokens. Just to confuse matters further, there's also “cross attention” where the decoder indirectly gets information about the prompt, but only via the encoder's work.

What about decoder-only models? Yes, there are differences in attention architectures for encoder-only models (e.g. BERT) and decoder-only models (e.g. GPT). In a decoder-only model, the encoder does not provide input to the decoder layers (because there's no encoder at all), and the “cross attention” capabilities are therefore removed from the decoder. However, the decoder-only architecture still uses masked attention without lookahead, and outputs are based entirely off the already-output tokens.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++