Aussie AI

Lookup Decoding

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

What is Lookup Decoding?

Lookup decoding is the idea of using lookup on text to get hints on what token should be next. This is used to "speculate" as to the future tokens, with a "draft" sequence generated, which can then be "verified" using a large model, but with the advantage of doing this in parallel (rather than serially, as in normal autoregression of the large model).

This "draft then verify in parallel" idea is used in other techniques. This is a similar idea to aggressive decoding, which uses the input prompt in this way, mainly for the editing use case. Similarly, speculative decoding generates a draft text sequence, so lookup decoding can also be thought of as a special case of speculative decoding. Lookup decoding differs from these approaches by finding its draft sequence in other texts, by searching a prebuilt document datastore or similar text search methods.

Types of Lookup Decoding

There are two main types of lookup decoding:

  • Retrieval lookup decoding
  • Prompt lookup decoding

Retrieval lookup decoding searches for the current tokens in a datastore, such as a collection of documents. In prompt lookup decoding, the search is only within the text of the input prompt, which assumes that the output text will contain sub-sequences from the input (often true in RAG prompts).

Research on Lookup Decoding

Research papers include:

Prompt Lookup Decoding

A special case of this is "prompt lookup decoding" which searches the input prompt itself for sequences to use as the draft, which can be verified in parallel for a wall clock speedup. This method does not use any additional datastore or cache of documents, beyond the submitted input prompt and its included text content.

This prompt lookup method works well if sub-sequences of the input prompt, especially the "context", are going to appear in the output. A good example of this situation is RAG prompts where the output text will contain parts of the "chunks" from the RAG retriever/datastore, as they appear in the context.

Like all of the variations on draft-then-verify-in-parallel, this method will cause extra computation for a latency speedup, making it applicable to parallel GPU implementions, but not effective for resource-constrained on-device inference such as AI PCs or AI Phones. However, this method does not require a smaller "draft model", with its sequential overhead being primarily a relatively simple text search for sub-sequences within the input prompt.

Research on Prompt Lookup Decoding

  • Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
  • Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL] https://arxiv.org/abs/2304.04487
  • vLLM, 2024, Speculative decoding in vLLM, https://docs.vllm.ai/en/stable/models/spec_decode.html (vLLM has speculative decoding and also looks for n-grams in the prompt, which is prompt lookup decoding.)
  • Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
  • Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
  • Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, 2 Dec 2024, PLD+: Accelerating LLM inference by leveraging Language Model Artifacts, https://arxiv.org/abs/2412.01447

Retrieval Lookup Decoding

Retrieval lookup decoding is where a token sequence from a retrieved document is used as the draft for speculative decoding. An obvious example of this is token sequences inside a RAG datastore, but this would also work with prompt lookup decoding, possibly better. Retrieval lookup decoding can gather the draft tokens from any datastore, web search or data source integration.

  • Sophia Ho, Jinsol Park, Patrick Wang, 8 Aug 2024, CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2408.04678

More Research on Decoding Algorithms

More AI Research

Read more about: