Aussie AI
Positional Encoding
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Positional Encoding
The way I think about positional encoding is that it's a kind of work-around. A weird feature of the attention mechanism is that the Transformer can lose track of the relative position of words. And the ordering of words in a sentence somewhat matters!
In order to help the attention mechanism know to pay attention not just to the words, but also their position in sentences, positional encoding is used put more “position” information into the input sequences. Technically, positional encoding creates a vector that is added into the “embeddings” vector as an extra step.
The conversion of tokens to embeddings is a complex learned procedure that is then combined with positional encoding. The vanilla Transformer used trigonometric sine and cosine functions, without any trainable parameters, but there are various alternative positional encoding algorithms. The computation of positional encodings adds some extra mathematical values at the very end of the creation of the embeddings. This occurs dynamically at runtime, and positional encoding values are not learned weights. See Chapter 27 for more about embeddings and positional encoding.
Interestingly, maybe this work-around is not necessary. There's been some theoretical research that positional encoding can be omitted completely, and Transformers are apparently capable of learning positional context without the extra hints. The algorithm is amusingly named NoPE (“no positional encoding”), but this is still early research (see Chapter 27).
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |