Aussie AI

Long Context Research

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Long Context Research

Context size is the number of input tokens that a model can process. Early models, even ChatGPT, had small context sizes of about 2048. Each token is usually a part-word or a whole word, so this meant it could process inputs of about 1,000-2,000 words.

Why seek a longer context size? Because context length is not just the user's current prompt, but is also everything else that the engine must process. And note that context is not only input, but also output text, because the Transformer must track its own prior output during creation of the next token. Some examples where long context matters include:

  • Completions: for the engine to write long text responses or creative essays, it needs to track the context of what it's already written, and its output text is also processed as input context as it works along.
  • Chatbots or Q&A: The full length of the text to be processed is not just the user's current question, but also the full contextual history of the prior conversation.
  • Retrieval Augmented Generation (RAG): the extra retrieved document “chunks” must be processed as context in addition to the user's query.
  • Editing: the document to be analyzed is input context. A user's report could be 5,000 words, and a full-length novel is 50,000-100,000 words, or even 200,000 words if it's in the “epic sci-fi” genre.

Newer models have been increasing the context size. For example, GPT-4 has a 32k window size, which is 32,000 tokens, which will handle a small novella or novelette of maybe 15,000-20,000 words. Anthropic reportedly has a Claude model with a 100k context size, which will hold most document sizes. MPT has an open-source model called “MPT-StoryWriter-65k+” with a 65,000 token window size.

Why is there a context size limitation? One of the main bottlenecks is the “quadratic” cost of the self-attention mechanism. And there are various ways to optimize attention to overcome this limitation. However, it's not the only bottleneck, and Alperovich (2023) offers the "secret sauce" for long contexts as fixing three main bottlenecks:

  • Quadratic attention cost (in the input token size)
  • Quadratic size of internal tensors (in the model dimension)
  • Positional embedding cost.

Some of the engine optimization techniques with relevance to allowing the processing and creation of longer token sequences include:

  • Faster attention algorithms
  • Autoregression optimizations
  • Tokenization algorithms
  • Token pruning
  • Embeddings pruning
  • Length pruning
  • Positional encoding optimization

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++