Aussie AI

Non-Autoregression Optimizations

  • Last Updated 8 December, 2024
  • by David Spuler, Ph.D.

One of the biggest obstacles to fast inference of Large Language Models (LLMs) is that they emit one token at a time (e.g. one word at a time). This limits parallelism and means that the entire model must be re-run multiple times, once for each word (or subword token).

Why Autoregression?

The reason for this limitation is that the next word to output inherently relies on the prior word, which is kind of an unavoidable property of human language. But in LLM coding circles, this is called the "autoregression" problem, possibly because researchers tend to like big words.

Because of this issue, the LLM is designed so that when it emits a word, that word is then input into the model's next iteration to help it emit the next word. And that's slow because of multiple reasons:

  • The model runs for every token.
  • The model never produces 2 tokens (or more) in parallel.
  • The model cannot start working on the 2nd token before finishing the 1st token, which limits pipelining (a type of parallelism).

There is various research on fixing this latency problem, and achieving more parallelism. The research area is called "non-autoregression" optimizations.

Tokens and Non-Autoregression

Although much of the research into autoregression is major surgery to the LLM architecture, there's a simpler way to mitigate the inefficiency: bigger tokens. If the tokens are longer, then fewer are emitted for each piece of work done by the AI engine. So the model can run faster in terms of fewer iterations if the tokenizer chooses whole words rather than sub-words, or maybe even handles two-word common phrases as separate single tokens (i.e. multi-word tokens). Longer tokens therefore reduce inefficiencies from autoregression, but also reduce the total length of the input sequence, which also further reduces model execution (the transformer's attention algorithm is well-known to be quadratic in the size of the input sequence).

The downside to this is that it means more unique tokens, which increases the vocabulary size. And the model's complexity is somewhat dependent on the vocabulary size, so this increase with longer tokens means that the whole model is larger, and it runs slower.

Therefore, longer tokens reduce the latency time in terms of reducing the autoregression issue, but increase latency time by making the model larger overall. Maybe there's some happy trade-off here? Most of the current models seem to use a vocabulary of around 50,000 words. The vocabulary size becomes one of the meta-parameters of the model.

Research on Non-Autoregression Optimizations

Various strategies have been researched for improving the autoregressive bottleneck. Some example strategies include:

General research papers on autoregression improvements include:

  • Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer, Mask predict: Parallel decoding of conditional masked language models, arXiv preprint arXiv:1904.09324, 2019, https://arxiv.org/abs/1904.09324
  • Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher, Non-autoregressive neural machine translation, arXiv preprint arXiv:1711.02281, 2017, https://arxiv.org/abs/1711.02281
  • Junliang Guo, Linli Xu, and Enhong Chen, Jointly masked sequence-to-sequence model for non-autoregressive neural machine translation, In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 376–385, 2020, https://aclanthology.org/2020.acl-main.36/
  • Jason Lee, Elman Mansimov, and Kyunghyun Cho, Deterministic non-autoregressive neural sequence modeling by iterative refinement, arXiv preprint arXiv:1802.06901, 2018, https://arxiv.org/abs/1802.06901
  • Zhuohan Li, Zi Lin, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu, Hint-based training for non-autoregressive machine translation, arXiv preprint arXiv:1909.06708, 2019, https://arxiv.org/abs/1909.06708
  • Chenze Shao, Jinchao Zhang, Yang Feng, Fandong Meng, and Jie Zhou. Minimizing the bag-of-ngrams difference for non-autoregressive neural machine translation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 198–205, 2020, https://arxiv.org/abs/1911.09320
  • Zhiqing Sun, Zhuohan Li, Haoqing Wang, Di He, Zi Lin, and Zhihong Deng. Fast structured decoding for sequence models. Advances in Neural Information Processing Systems, 32, 2019, https://arxiv.org/abs/1910.11555
  • Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu, Non-autoregressive machine translation with auxiliary regularization, In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 5377–5384, 2019, https://arxiv.org/abs/1902.10245
  • Bingzhen Wei, Mingxuan Wang, Hao Zhou, Junyang Lin, Jun Xie, and Xu Sun. Imitation learning for non-autoregressive neural machine translation. arXiv preprint arXiv:1906.02041, 2019, https://arxiv.org/abs/1906.02041
  • Michiel de Jong, Yury Zemlyanskiy, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai, Fei Sha, and William Cohen. Fido: Fusion-in-decoder optimized for stronger performance and faster inference. arXiv preprint arXiv:2212.08153, Dec 2022, https://arxiv.org/abs/2212.08153
  • Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W Mahoney, et al. Full stack optimization of transformer inference: a survey. arXiv preprint arXiv:2302.14017, 2023, https://arxiv.org/abs/2302.14017
  • Chitwan Saharia, William Chan, Saurabh Saxena, and Mohammad Norouzi. 2020. Non-autoregressive machine translation with latent alignments. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1098–1108, Online. Association for Computational Linguistics. https://arxiv.org/abs/2004.07437
  • Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, and Omer Levy. 2020. Aligned cross entropy for non-autoregressive machine translation. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 3515–3523. PMLR. http://proceedings.mlr.press/v119/ghazvininejad20a.html
  • Marjan Ghazvininejad, Omer Levy, and Luke Zettlemoyer. 2020. Semi-autoregressive training improves mask-predict decoding. arXiv preprint arXiv:2001.08785, https://arxiv.org/abs/2001.08785
  • Lihua Qian, Hao Zhou, Yu Bao, Mingxuan Wang, Lin Qiu, Weinan Zhang, Yong Yu, and Lei Li. 2020. Glancing transformer for non-autoregressive neural machine translation. arXiv preprint arXiv:2008.07905, https://arxiv.org/abs/2008.07905
  • Jiatao Gu, Xiang Kong, Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade, Dec 2020, https://arxiv.org/abs/2012.15833
  • Jason D. Lee, Elman Mansimov, and Kyunghyun Cho. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proc. of EMNLP, 2018. https://arxiv.org/abs/1802.06901
  • Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke S. Zettlemoyer. Mask-predict: Parallel decoding of conditional masked language models. In Proc. of EMNLP, 2019. https://arxiv.org/abs/1904.09324
  • Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J., Feb 2023, Accelerating large language model decoding with speculative sampling, arXiv preprint arXiv:2302.01318, https://arxiv.org/abs/2302.01318
  • Leviathan, Y., Kalman, M., and Matias, Y., Fast inference from transformers via speculative decoding, May 2023, https://arxiv.org/abs/2211.17192
  • Stern, M., Shazeer, N., and Uszkoreit, J., Nov 2018, Blockwise parallel decoding for deep autoregressive models, Advances in Neural Information Processing Systems, 31, https://arxiv.org/abs/1811.03115
  • Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann, Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers, arXiv preprint, 2023, https://arxiv.org/abs/2305.15805
  • Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang. Instantaneous grammatical error correction with shallow aggressive decoding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5937–5947, 2021. https://arxiv.org/abs/2106.04970, Code: https://github.com/AutoTemp/Shallow-Aggressive-Decoding (Aggressive decoding emits as many tokens as possible, combined with a shallow decoder architecture here.)
  • T. Ge, H. Xia, X. Sun, S. Chen, and F. Wei. Lossless acceleration for seq2seq generation with aggressive decoding. ArXiv, abs/2205.10350, 2022. https://arxiv.org/abs/2205.10350, Code: https://github.com/microsoft/unilm/tree/master/decoding (Aggressive decoding means emitting multiple tokens at a time, reducing autoregression; has a generalization that is similar to speculative decoding here.)
  • Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6114–6123. https://arxiv.org/abs/1904.09324 (Parallel decoding or "bidirectional" decoding, rather than left-to-right generation of tokens.)
  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.0480 https://arxiv.org/abs/1810.04805 (Rather than left-to-right, uses "bidirectional" decoding)
  • Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera y Arcas, Max Vladymyrov, Razvan Pascanu, João Sacramento, Sep 2023, Uncovering mesa-optimization algorithms in Transformers, https://arxiv.org/abs/2309.05858 (Uses linear attention algorithm.)
  • X Li, S Chen, S Zhang, L Hou, Y Zhu, Z Xiao, 2023, Human Activity Recognition Using IR-UWB Radar: A Lightweight Transformer Approach, IEEE Geoscience and Remote Sensing Letters (Early Access), https://ieeexplore.ieee.org/document/10247554 (Linear attention.)
  • J Kasai, 2023, Towards Efficient, Customizable, and Communal Natural Language Processing, Ph.D. thesis, Computer Science and Engineering, University of Washington, https://www.proquest.com/openview/604084b574dcd05e41eb6e33682a3537/1 (More about shallow decoders.)
  • Y Chen, Y Li, A Xu, Q Sun, X Chen, C Xu, 2023, WAG-NAT: Window Attention and Generator Based Non-Autoregressive Transformer for Time Series Forecasting, ICANN 2023: Artificial Neural Networks and Machine Learning, pp. 293–304, https://link.springer.com/chapter/10.1007/978-3-031-44223-0_24, Code: https://github.com/cybisolated/WAG-NAT
  • S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
  • Nikolay Savinov, Junyoung Chung, Mikolaj Binkowski, Erich Elsen, and Aaron van den Oord. 2022. Step-unrolled denoising autoencoders for text generation. International Conference on Learning Representations. https://arxiv.org/abs/2112.06749
  • Y Zhang, Y Zhang, L Cui, G Fu, Oct 2023, Non-autoregressive Text Editing with Copy-aware Latent Alignments, arXiv preprint arXiv:2310.07821, https://arxiv.org/pdf/2310.07821.pdf
  • S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
  • Tri Dao, Daniel Haziza, Francisco Massa, Grigory Sizov, October 13, 2023, Flash-Decoding for long-context inference, PyTorch Blog, https://pytorch.org/blog/flash-decoding/
  • Jesse Mu, Xiang Lisa Li, and Noah Goodman. July 2023. Learning to compress prompts with gist tokens. arXiv preprint arXiv:2304.08467. https://arxiv.org/abs/2304.08467 (Prompt compression.)
  • Yassir Fathullah, Puria Radmard, Adian Liusie, Mark J. F. Gales, 2024, Who Needs Decoders? Efficient Estimation of Sequence-Level Attributes with Proxies, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics Volume 1: Long Papers, pages 1478–1496 March 17-22, 2024, https://aclanthology.org/2024.eacl-long.89.pdf (Non-autoregressive decoding methods in special use cases such as machine language translation.)
  • Ruchao Fan, 2024, Improving the Accuracy and Inference Efficiency for Low-resource Automatic Speech Recognition, Ph.D thesis, Electrical and Computer Engineering, University of California Los Angeles, https://escholarship.org/content/qt9281v84q/qt9281v84q_noSplash_28de3ba38c8c7a613d2fa945d28c1613.pdf (Uses bidirectional autoregressive predicting encoding for speech recognition.)
  • Zanlin Ni, Yulin Wang, Renping Zhou, Jiayi Guo, Jiny iHu, Zhiyuan Liu, Shiji Song, Yuan Yao, Gao Huang, 2024, Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis, https://openaccess.thecvf.com/content/CVPR2024/papers/Ni_Revisiting_Non-Autoregressive_Transformers_for_Efficient_Image_Synthesis_CVPR_2024_paper.pdf Code: https://github.com/LeapLabTHU/ImprovedNAT
  • Hanling Yi, Feng Lin, Hongbin Li, Peiyang Ning, Xiaotian Yu, Rong Xiao, 16 Apr 2024 (v2)], Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding, https://arxiv.org/abs/2402.11809 (Semi-autoregressive draft model with parallel verification.)
  • Feng Li,Jingxian Chen, Xuejun Zhang, 2023, A Survey of Non-Autoregressive Neural Machine Translation, Electronics 2023, 12(13), 2980, https://doi.org/10.3390/electronics12132980, https://www.mdpi.com/2079-9292/12/13/2980 https://www.mdpi.com/2079-9292/12/13/2980/pdf?version=1688953962 (A survey of language translation with non-autoregressive architectures.)
  • Yufan Jiang, Qiaozhi He, Xiaomin Zhuang, Zhihua Wu, Kunpeng Wang, Wenlai Zhao, Guangwen Yang, 8 Aug 2023 (v2), RecycleGPT: An Autoregressive Language Model with Recyclable Module, https://arxiv.org/abs/2308.03421 (Uses the idea of guessing the next token based on only a few preceding tokens with extra layers inside a Transformer.)
  • Aishwarya P S, Pranav Ajit Nair, Yashas Samaga, Toby Boyd, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 26 Mar 2024 (v3), Tandem Transformers for Inference Efficient LLMs, https://arxiv.org/abs/2402.08644 (A two-model architecture with a small autoregressive model and a larger model with non-autoregressive block decoding, which is similar to big-little inference and speculative decoding methods.)
  • Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre Defossez, Jade Copet, Gabriel Synnaeve, Yossi Adi, Jan 2024, Masked Audio Generation using a Single Non-Autoregressive Transformer https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT/ https://arxiv.org/pdf/2401.04577.pdf Code: https://github.com/facebookresearch/audiocraft/blob/main/docs/MAGNET.md
  • Qi Zhang, Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang, 2024, Look Ahead or Look Around? ATheoretical Comparison Between Autoregressive and Masked Pretraining, https://openreview.net/pdf?id=2rPoTgEmjV Code: https://github.com/PKU-ML/LookAheadLookAround (Evaluates autoregressive and masked methods in training.)
  • Y Lin, Oct 2023, ProNet: Progressive Neural Network for Multi-Horizon Time Series Forecasting, arXiv preprint arXiv:2310.19322, https://arxiv.org/pdf/2310.19322.pdf
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019. https://arxiv.org/abs/1901.02860
  • Ashwin K. Vijayakumar, Michael Cogswell, Ramprasaath R. Selvaraju, Qing Sun, Stefan Lee, David J. Crandall, and Dhruv Batra. 2018. Diverse beam search for improved description of complex scenes. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 7371–7379. AAAI Press. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17329
  • Xiaohui Wang, Ying Xiong, Yang Wei, Mingxuan Wang, Lei Li Apr 2021, LightSeq: A High Performance Inference Library for Transformers, https://arxiv.org/pdf/2010.13887.pdf
  • Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019, Non-autoregressive machine translation with auxiliary regularization. In Proc. of AAAI, 2019, https://arxiv.org/abs/1902.10245.
  • Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard H. Hovy. FlowSeq: Non-autoregressive conditional sequence generation with generative flow. In Proc. of EMNLP, 2019. https://arxiv.org/abs/1909.02480.
  • Xiaosong Jia, Shaoshuai Shi, Zijun Chen, Li Jiang, Wenlong Liao, Tao He, Junchi Yan, 21 Mar 2024, AMP: Autoregressive Motion Prediction Revisited with Next Token Prediction for Autonomous Driving, https://arxiv.org/abs/2403.13331
  • David Spuler, March 2024, Chapter 26. Decoding Algorithms, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Z Wang, L Wang, J Su, J Yao, Z Tu, 2023, Revisiting Non-Autoregressive Translation at Scale, https://arxiv.org/abs/2305.16155
  • S Norouzi, R Hosseinzadeh, F Perez, 2023, DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers for Machine Translation, https://aclanthology.org/2023.findings-acl.542/
  • Raphael Shu, Jason Lee, Hideki Nakayama, and Kyunghyun Cho. 2020, Latent-variable non-autoregressive neural machine translation with deterministic inference using a delta posterior. In Proc. of AAAI, 2020. https://arxiv.org/abs/1908.07181
  • Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn, Oct 2022, EdiT5: Semi-Autoregressive Text-Editing with T5 Warm-Start, https://arxiv.org/abs/2205.12209
  • Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. https://openai.com/blog/sparse-transformers, 2019, https://arxiv.org/abs/1904.10509
  • Yisheng Xiao, Lijun Wu, Junliang Guo, Juntao Li, Min Zhang, Tao Qin, Tie-yan Liu, 6 Jul 2023 (v2), A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond, https://arxiv.org/pdf/2204.09269.pdf
  • Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Li, S., Unanue, I.J., Piccardi, M. (2024). LayerGLAT: A Flexible Non-autoregressive Transformer for Single-Pass and Multi-pass Prediction. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14942. Springer, Cham. https://doi.org/10.1007/978-3-031-70344-7_14 https://link.springer.com/chapter/10.1007/978-3-031-70344-7_14 https://github.com/lsj72123/layer-GLAT
  • David Spuler, March 2024, Tokens and Non-Autoregression, in Generative AI in C++, https://www.aussieai.com/book/ch26-tokens-auto-regression
  • Du Cunxiao, 2024, Towards Faster Inference of Transformers: Strategies for Accelerating Decoding Processes, Ph.D. thesis, Computer Science, School of Computing and Information Systems, Singapore Management University, https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1611&context=etd_coll (Examines non-autoregressive decoding, speculative decoding and attention optimizations.)
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Ziqi Pang, Tianyuan Zhang, Fujun Luan, Yunze Man, Hao Tan, Kai Zhang, William T. Freeman, Yu-Xiong Wang, 2 Dec 2024, RandAR: Decoder-only Autoregressive Visual Generation in Random Orders, https://arxiv.org/abs/2412.01827 https://rand-ar.github.io/ (Attempt to parallelize image generation decoding by randomizing the order at which to create patches of an image.)
  • Yefei He, Feng Chen, Yuanyu He, Shaoxuan He, Hong Zhou, Kaipeng Zhang, Bohan Zhuang, 5 Dec 2024, ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality, https://arxiv.org/abs/2412.04062

More Research on Decoding Algorithms

More AI Research

Read more about: