Aussie AI

Long Context

  • Last Updated 3 December, 2024
  • by David Spuler, Ph.D.

Context window size is the number of input tokens that a model can process. Early models, even ChatGPT, had small context windows of about 2048. Each token is usually a part-word or a whole word, so this meant it could process inputs of about 1,000-2,000 words.

Why seek a longer input size? Well, because a report is going to be 5,000 words, and a full-length novel is 100,000 words, or even 200,000 words if it's in the "epic sci-fi" genre.

Newer models have been increasing the context window size. For example, GPT-4 has a 32k window size, which is 32,000 tokens, which will handle a small novella or novellette of maybe 15,000-20,000 words. Anthropic reportedly has a Claude model with a 100k context window. MPT has an open-source model called "MPT-StoryWriter-65k+" with a 65,000 token window size.

Why is there this context size limitation? One of the main bottlenecks is the "quadratic" cost of the self-attention mechanism. And there are various ways to optimize attention to overcome this limitation. However, it's not the only bottleneck. Alperovich (2023) offers the "secret sauce" for long contexts as fixing three main bottlenecks:

  • Quadratic attention cost (in the input token size)
  • Quadratic size of internal tensors (in the model dimension)
  • Positional embedding cost.

Some of the techniques with relevance to allowing the processing and creation of longer token sequences include:

Length Generalization: Accuracy with Longer Contexts

Speed is not the only problem with long contexts. The vanilla Transformers are also not particularly good at generalizing their results with a long context window. This ability is known as "length generalization" (or "length extrapolation"), and improving the accuracy in long inputs and longer outputs is an area of active research.

One of the methods being analyzed to improve length generalization is called "scratchpad" or "chain-of-thought" algorithms. The idea is that the AI inference engine emits rough summaries to an internal scratchpad at regular intervals, which are merged into subsequent inference, thereby the AI helps itself keep track of its own thoughts over a longer output sequence.

Research papers on "length generalization" include:

Industry Articles on Long Context Length

Real-world industry models have started offering longer context windows:

Research on Quadratic Attention Cost

Linearizing the attention algorithm to avoid the quadratic cost of attention processing is an area with a massive research base with numerous algorithms proposed. Faster attention algorithms include sparse attention and Flash Attention. See research on attention optimization methods.

Research on Positional Encoding Optimization

One of the less obvious bottlenecks for long contexts is the positional encoding algorithm. See research on positional encoding optimizations and removing positional encoding.

Research on Context Length

Research on making "longer" Transformer models includes:

More AI Research

Read more about: