Aussie AI

Lookup Decoding

  • Last Updated 17 March, 2025
  • by David Spuler, Ph.D.

What is Lookup Decoding?

Lookup decoding is the idea of using lookup on text to get hints on what token should be next. This is used to "speculate" as to the future tokens, with a "draft" sequence generated, which can then be "verified" using a large model, but with the advantage of doing this in parallel (rather than serially, as in normal autoregression of the large model).

This "draft then verify in parallel" idea is used in other techniques. This is a similar idea to aggressive decoding, which uses the input prompt in this way, mainly for the editing use case. Similarly, speculative decoding generates a draft text sequence, so lookup decoding can also be thought of as a special case of speculative decoding. Lookup decoding differs from these approaches by finding its draft sequence in other texts, by searching a prebuilt document datastore or similar text search methods.

Types of Lookup Decoding

There are two main types of lookup decoding:

  • Retrieval lookup decoding
  • Prompt lookup decoding

Retrieval lookup decoding searches for the current tokens in a datastore, such as a collection of documents. In prompt lookup decoding, the search is only within the text of the input prompt, which assumes that the output text will contain sub-sequences from the input (often true in RAG prompts).

Research on Lookup Decoding

Research papers include:

  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092, Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including a section on lookup decoding..)
  • Apoorv Saxena, November 2023, Prompt Lookup Decoding, https://github.com/apoorvumang/prompt-lookup-decoding
  • Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei, 10 Apr 2023, Inference with Reference: Lossless Acceleration of Large Language Models, https://arxiv.org/abs/2304.04487 (Uses lookup in the prompt to speedup decoding.)
  • Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He, 4 Apr 2024 (v2), REST: Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2311.08252
  • Sophia Ho, Jinsol Park, Patrick Wang, 8 Aug 2024, CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2408.04678
  • Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Borui Zhang, Runlin Guo, Jia Li, 24 Feb 2025, CodeSwift: Accelerating LLM Inference for Efficient Code Generation, https://arxiv.org/abs/2502.17139 (Using draft sequences from a datastore of code, to achieve parallel inference, similar to prompt looking decoding or retrieval lookup decoding.)
  • Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh, 27 Feb 2025, Long-Context Inference with Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2502.20330
  • Guofeng Quan, Wenfeng Feng, Chuzhan Hao, Guochao Jiang, Yuewei Zhang, Hao Wang, 5 Mar 2025, RASD: Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2503.03434

Prompt Lookup Decoding

A special case of this is "prompt lookup decoding" which searches the input prompt itself for sequences to use as the draft, which can be verified in parallel for a wall clock speedup. This method does not use any additional datastore or cache of documents, beyond the submitted input prompt and its included text content.

This prompt lookup method works well if sub-sequences of the input prompt, especially the "context", are going to appear in the output. A good example of this situation is RAG prompts where the output text will contain parts of the "chunks" from the RAG retriever/datastore, as they appear in the context.

Like all of the variations on draft-then-verify-in-parallel, this method will cause extra computation for a latency speedup, making it applicable to parallel GPU implementions, but not effective for resource-constrained on-device inference such as AI PCs or AI Phones. However, this method does not require a smaller "draft model", with its sequential overhead being primarily a relatively simple text search for sub-sequences within the input prompt.

Research on Prompt Lookup Decoding

  • Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
  • Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL] https://arxiv.org/abs/2304.04487
  • vLLM, 2024, Speculative decoding in vLLM, https://docs.vllm.ai/en/stable/models/spec_decode.html (vLLM has speculative decoding and also looks for n-grams in the prompt, which is prompt lookup decoding.)
  • Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
  • Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
  • Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, 2 Dec 2024, PLD+: Accelerating LLM inference by leveraging Language Model Artifacts, https://arxiv.org/abs/2412.01447
  • Hongxuan Zhang, Zhining Liu, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, 4 Jun 2024 (v2), Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster, https://arxiv.org/abs/2311.08263 (Use of Jacobi parallel decoding with Chain-of-Thought.)
  • Razvan-Gabriel Dumitru, Minglai Yang, Vikas Yadav, Mihai Surdeanu, 13 Feb 2025, CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality, https://arxiv.org/abs/2502.08923 https://github.com/RazvanDu/CopySpec (Prompt lookup decoding done across the entire chat history across multiple user queries.)
  • Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Borui Zhang, Runlin Guo, Jia Li, 24 Feb 2025, CodeSwift: Accelerating LLM Inference for Efficient Code Generation, https://arxiv.org/abs/2502.17139 (Using draft sequences from a datastore of code, to achieve parallel inference, similar to prompt looking decoding or retrieval lookup decoding.)
  • Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh, 27 Feb 2025, Long-Context Inference with Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2502.20330

Retrieval Lookup Decoding

Retrieval lookup decoding is where a token sequence from a retrieved document is used as the draft for speculative decoding. An obvious example of this is token sequences inside a RAG datastore, but this would also work with prompt lookup decoding, possibly better. Retrieval lookup decoding can gather the draft tokens from any datastore, web search or data source integration.

  • Sophia Ho, Jinsol Park, Patrick Wang, 8 Aug 2024, CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2408.04678
  • Qianhui Zhao, Li Zhang, Fang Liu, Xiaoli Lian, Qiaoyuanhe Meng, Ziqian Jiao, Zetong Zhou, Borui Zhang, Runlin Guo, Jia Li, 24 Feb 2025, CodeSwift: Accelerating LLM Inference for Efficient Code Generation, https://arxiv.org/abs/2502.17139 (Using draft sequences from a datastore of code, to achieve parallel inference, similar to prompt looking decoding or retrieval lookup decoding.)
  • Guanzheng Chen, Qilong Feng, Jinjie Ni, Xin Li, Michael Qizhe Shieh, 27 Feb 2025, Long-Context Inference with Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2502.20330
  • Guofeng Quan, Wenfeng Feng, Chuzhan Hao, Guochao Jiang, Yuewei Zhang, Hao Wang, 5 Mar 2025, RASD: Retrieval-Augmented Speculative Decoding, https://arxiv.org/abs/2503.03434

More Research on Decoding Algorithms

More AI Research

Read more about: