Aussie AI
Lookup Decoding
-
Last Updated 7 December, 2024
-
by David Spuler, Ph.D.
What is Lookup Decoding?
Lookup decoding is the idea of using lookup on text to get hints on what token should be next. This is used to "speculate" as to the future tokens, with a "draft" sequence generated, which can then be "verified" using a large model, but with the advantage of doing this in parallel (rather than serially, as in normal autoregression of the large model).
This "draft then verify in parallel" idea is used in other techniques. This is a similar idea to aggressive decoding, which uses the input prompt in this way, mainly for the editing use case. Similarly, speculative decoding generates a draft text sequence, so lookup decoding can also be thought of as a special case of speculative decoding. Lookup decoding differs from these approaches by finding its draft sequence in other texts, by searching a prebuilt document datastore or similar text search methods.
Types of Lookup Decoding
There are two main types of lookup decoding:
- Retrieval lookup decoding
- Prompt lookup decoding
Retrieval lookup decoding searches for the current tokens in a datastore, such as a collection of documents. In prompt lookup decoding, the search is only within the text of the input prompt, which assumes that the output text will contain sub-sequences from the input (often true in RAG prompts).
Research on Lookup Decoding
Research papers include:
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092, Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including a section on lookup decoding..)
- Apoorv Saxena, November 2023, Prompt Lookup Decoding, https://github.com/apoorvumang/prompt-lookup-decoding
- Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei, 10 Apr 2023, Inference with Reference: Lossless Acceleration of Large Language Models, https://arxiv.org/abs/2304.04487 (Uses lookup in the prompt to speedup decoding.)
- Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He, 4 Apr 2024 (v2), REST: Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2311.08252
- Sophia Ho, Jinsol Park, Patrick Wang, 8 Aug 2024, CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2408.04678
Prompt Lookup Decoding
A special case of this is "prompt lookup decoding" which searches the input prompt itself for sequences to use as the draft, which can be verified in parallel for a wall clock speedup. This method does not use any additional datastore or cache of documents, beyond the submitted input prompt and its included text content.
This prompt lookup method works well if sub-sequences of the input prompt, especially the "context", are going to appear in the output. A good example of this situation is RAG prompts where the output text will contain parts of the "chunks" from the RAG retriever/datastore, as they appear in the context.
Like all of the variations on draft-then-verify-in-parallel, this method will cause extra computation for a latency speedup, making it applicable to parallel GPU implementions, but not effective for resource-constrained on-device inference such as AI PCs or AI Phones. However, this method does not require a smaller "draft model", with its sequential overhead being primarily a relatively simple text search for sub-sequences within the input prompt.
Research on Prompt Lookup Decoding
- Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, and Furu Wei. 2023. Inference with Reference: Lossless Acceleration of Large Language Models. arXiv:2304.04487 [cs.CL] https://arxiv.org/abs/2304.04487
- vLLM, 2024, Speculative decoding in vLLM, https://docs.vllm.ai/en/stable/models/spec_decode.html (vLLM has speculative decoding and also looks for n-grams in the prompt, which is prompt lookup decoding.)
- Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
- Lawrence Stewart, Matthew Trager, Sujan Gonugondla, Stefano Soatto, 2024, The N-Grammys: Accelerating autoregressive inference with learning-free batched speculation, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://www.amazon.science/publications/the-n-grammys-accelerating-autoregressive-inference-with-learning-free-batched-speculation (Use a variety of heuristics instead of a draft model, such as precalculated likelihoods, and also prompt lookup decoding using n-grams from the context tokens.)
- Shwetha Somasundaram, Anirudh Phukan, Apoorv Saxena, 2 Dec 2024, PLD+: Accelerating LLM inference by leveraging Language Model Artifacts, https://arxiv.org/abs/2412.01447
Retrieval Lookup Decoding
Retrieval lookup decoding is where a token sequence from a retrieved document is used as the draft for speculative decoding. An obvious example of this is token sequences inside a RAG datastore, but this would also work with prompt lookup decoding, possibly better. Retrieval lookup decoding can gather the draft tokens from any datastore, web search or data source integration.
- Sophia Ho, Jinsol Park, Patrick Wang, 8 Aug 2024, CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding, https://arxiv.org/abs/2408.04678
More Research on Decoding Algorithms
- Decoding algorithms (overview)
— Non-autoregressive decoding
— Greedy decoding
— Top-k decoding
— Top-p decoding
— Min-P Sampling
— Flash decoding
— Beam search decoding
— Edit decoding
— Contrastive decoding
— Constrained decoding - Parallel decoding (overview)
— Blockwise parallel decoding
— n-gram parallel decoding
— Lookahead decoding
— Medusa decoding
— Consensus decoding - Speculative decoding (overview)
— Generalized speculative decoding
— Aggressive decoding
— Lookup decoding
— Retrieval lookup decoding
— Prompt lookup decoding
— Self speculative decoding
— Tree speculative decoding
— Superposed decoding
— Hierarchical speculative decoding
— Heuristic speculative decoding
— Multi-token speculative decoding
— Sequential speculative decoding
More AI Research
Read more about: