Aussie AI
Beam Search Decoding
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
What is Beam Search Decoding?
Beam search decoding is an advancement upon greedy decoding that looks ahead a little further. Beam search maintains some alternative possibilities for sequences of tokens to outputs, which is analogous to having it consider a number of different words and phrases, before it finally makes a decision on which to output.
Research on Beam Search Decoding
Research papers on beam search decoding algorithms include:
- Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, Lei Li Apr 2021, LightSeq: A High Performance Inference Library for Transformers, https://arxiv.org/pdf/2010.13887.pdf
- James Briggs Feb 25, 2021, The Three Decoding Methods For NLP, Towards Data Science https://towardsdatascience.com/the-three-decoding-methods-for-nlp-23ca59cb1e9d
- GC Garbacea, 2023, Neural Language Generation for Content Adaptation: Explainable, Efficient Low-Resource Text Simplification and Evaluation, Ph.D. thesis, Computer Science and Engineering, University of Michigan, https://deepblue.lib.umich.edu/bitstream/handle/2027.42/178028/garbacea_1.pdf?sequence=1 (Broad thesis with sections on beam search decoding optimizations and AI safety issues such as bias.)
- Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Guided open vocabulary image captioning with constrained beam search, 2017, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936–945, https://arxiv.org/abs/1612.00576
- Chris Hokamp and Qun Liu, 2017, Lexically constrained decoding for sequence generation using grid beam search. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1535–1546, https://arxiv.org/abs/1704.07138
- G Keren, Feb 2023, A Token-Wise Beam Search Algorithm for RNN-T, arXiv preprint arXiv:2302.14357, https://arxiv.org/abs/2302.14357
- Gian Wiher, Clara Meister, Ryan Cotterell, Mar 2022, On Decoding Strategies for Neural Text Generators, https://arxiv.org/abs/2203.15721 (An evaluation of a variety of decoding algorithms including beam search, top-k, and top-p.)
- Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. CoRR, abs/1610.02424, https://arxiv.org/abs/1610.02424 (An algorithm variant called "diverse beam search" decoding.)
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le, 2014, Sequence to sequence learning with neural networks, arXiv preprint arXiv:1409.3215, https://arxiv.org/abs/1409.3215 (Early paper using a kind of beam search decoding and top-k decoding.)
- Kenton Murray, David Chiang, Aug 2018, Correcting Length Bias in Neural Machine Translation, https://arxiv.org/abs/1808.10006 (Brevity problems in beam search decoding.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia, 2024. SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, April 2024, Pages 932–949, https://doi.org/10.1145/3620666.3651335 https://dl.acm.org/doi/abs/10.1145/3620666.3651335 Code: https://github.com/flexflow/FlexFlow/
- Jared Lichtarge, Christopher Alberti, Shankar Kumar, Noam Shazeer, and Niki Parmar. 2018. Weakly supervised grammatical error correction using iterative decoding. CoRR, abs/1811.01710. https://arxiv.org/abs/1811.01710 (Beam search decoding with a high threshold to emit corrections.)
- Jindrich Libovicky, Jindrich Helcl, Marek Tlusty, Ondrej Bojar, and Pavel Pecina. 2016. CUNI system for WMT16 automatic post-editing and multimodal translation tasks. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 646–654, Berlin, Germany. https://arxiv.org/abs/1606.07481 (Post-editing of machine translation.)
- Daniel Dahlmeier, Hwee Tou Ng, 2012, A Beam-Search Decoder for Grammatical Error Correction, Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 568–578, Jeju Island, Korea, 12–14 July 2012, https://aclanthology.org/D12-1052.pdf
- Xiaoming (Jason) Cui, Ashraf Bhuiyan, 2023, Optimizing Transformer Model Inference on Intel® Processors, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html
- Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7371–7379. AAAI Press. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17329
- Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, Wai Lam, 10 Feb 2024, A Thorough Examination of Decoding Methods in the Era of LLMs, https://arxiv.org/abs/2402.06925 (Evaluates a number of decoding algorithms with several 7B models including Llama2-7B, and also with 4-bit and 8-bit quantization.)
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (The original Paged Attention and vLLM paper, focusing on optimizing memory size of the KV cache using methods similar to operating-system memory paging.)
- Zhaorun Chen, Zhuokai Zhao, Hongyin Luo, Huaxiu Yao, Bo Li, Jiawei Zhou, July 2024, HALC: Object Hallucination Reduction via Adaptive Focal-Contrast Decoding, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:7824-7846, 2024, https://proceedings.mlr.press/v235/chen24bi.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/chen24bi/chen24bi.pdf https://github.com/BillChan226/HALC
- Tinghui Zhu, Kai Zhang, Jian Xie, Yu Su, 4 Feb 2024 (v2), Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning, https://arxiv.org/abs/2401.17686
- Jungo Kasai, Keisuke Sakaguchi, Ronan Le Bras, Dragomir Radev, Yejin Choi, and Noah A. Smith. 2024. A Call for Clarity in Beam Search: How It Works and When It Stops. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 77–90, Torino, Italia. ELRA and ICCL. https://aclanthology.org/2024.lrec-main.7/ https://aclanthology.org/2024.lrec-main.7.pdf
- Zongyue Qin, Zifan He, Neha Prakriya, Jason Cong, Yizhou Sun, 25 Sep 2024, Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference, https://arxiv.org/abs/2409.16560
- Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
- Yejin Lee, Anna Sun, Basil Hosmer, Bilge Acun, Can Balioglu, Changhan Wang, Charles David Hernandez, Christian Puhrsch, Daniel Haziza, Driss Guessous, Francisco Massa, Jacob Kahn, Jeffrey Wan, Jeremy Reizenstein, Jiaqi Zhai, Joe Isaacson, Joel Schlosser, Juan Pino, Kaushik Ram Sadagopan, Leonid Shamis, Linjian Ma, Min-Jae Hwang, Mingda Chen, Mostafa Elhoushi, Pedro Rodriguez, Ram Pasunuru, Scott Yih, Sravya Popuri, Xing Liu, Carole-Jean Wu, 30 Sep 2024, Characterizing and Efficiently Accelerating Multimodal Generation Model Inference, https://arxiv.org/abs/2410.00215 (Analyzes the bottlenecks in inference, finding the usual problems of autoregression, but also more interesting issues such as that linear kernels can be expensive, and KV cache reordering is a bottleneck in beam search, and layer skipping is analyzed.)
- Xinyu Lin, Chaoqun Yang, Wenjie Wang, Yongqi Li, Cunxiao Du, Fuli Feng, See-Kiong Ng, Tat-Seng Chua, 8 Oct 2024 (v2), Efficient Inference for Large Language Model-based Generative Recommendation, https://arxiv.org/abs/2410.05165
- Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
More Research on Decoding Algorithms
- Decoding algorithms (overview)
— Non-autoregressive decoding
— Greedy decoding
— Top-k decoding
— Top-p decoding
— Min-P Sampling
— Flash decoding
— Beam search decoding
— Edit decoding
— Contrastive decoding
— Constrained decoding - Parallel decoding (overview)
— Blockwise parallel decoding
— n-gram parallel decoding
— Lookahead decoding
— Medusa decoding
— Consensus decoding - Speculative decoding (overview)
— Generalized speculative decoding
— Aggressive decoding
— Lookup decoding
— Retrieval lookup decoding
— Prompt lookup decoding
— Self speculative decoding
— Tree speculative decoding
— Superposed decoding
— Hierarchical speculative decoding
— Heuristic speculative decoding
— Multi-token speculative decoding
— Sequential speculative decoding
More AI Research
Read more about: