Aussie AI

Attention Head Approximation

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Attention Head Approximation

Much of the research into attention heads has been in regard to attention head pruning (removing redundant or under-utilized attention head components) or reducing the quadratic cost of attention in terms of sequence length (related to non-autoregressive algorithms). However, there are also some “simple” or “approximate” attention heads that have been considered to replace the original Transformer components.

Note that the default Transformer's attention method is “full attention” where every token attends to every other one. This is what creates the quadratic complexity, because N tokens attend to N other tokens, giving N*N computational mappings.

The idea of approximation is to process fewer mappings. Some of the various approximate attention approaches include:

  • Local attention: the idea is that each token only pays attention to a few prior tokens. This has also been called “sliding window attention” because the scope of attention moves along with the position. This method is fast but has obvious implications for accuracy.
  • Sparse attention: Instead of having every token attending to every token, various attention algorithms reduce this mapping by paying attention to fewer tokens. There are numerous research papers that only consider tokens in various patterns. The cost reduction depends roughly on the sparsity level, but accuracy also declines with sparsity. For example, attending to only every second token is a simplistic idea in this vein, which would halve the cost, but there are much more sophisticated methods in the research literature.
  • Random attention: one type of sparse attention is to randomly pay attention to different tokens in a random pattern. Although this sounds silly at first, it's an example where a stochastic (probabilistic) algorithm works in AI, although the model does suffer some accuracy degradation.

Note that in addition to these various faster attention algorithms, there are also research papers that try to do more computations to get a slower-but-smarter attention algorithm. For example, one approach is “double attention.”

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++