Aussie AI
Attention Optimization
-
Last Updated 21 March, 2025
-
by David Spuler, Ph.D.
What is Attention?
Attention is the method whereby an LLM attributes weights to other tokens or words in the input sequence, so as to determine an output token. The attention mechanism is one of the major breakthroughs that allowed advanced AI to take shape. After all, the seminal 2017 Transformer paper was titled 'Attention is all you need.' (Vaswani et al., 2017). It used a type of self-attention called "scaled dot-product attention." However, the self-attention mechanism is also computationally expensive, and it's perhaps a case of "too much of a good thing". The quadratic complexity of self-attention in a vanilla Transformer is well-known, and there has been much research on how to optimize attention to a linear-complexity algorithm. The main attempts have been:
- Efficient attention algorithms: the use of faster, modified memory-efficient attention algorithms.
- Removing some attention heads, called attention head pruning.
- Approximate attention (e.g., local attention).
- Alternative architectures, instead of attention (in next-gen architectures).
- Code optimizations such as KV caching
Types of Attention Algorithms
Some of the main classes of attention algorithms include:
- Dot product attention
- Scaled dot product attention
- Self-attention
- Additive attention
- Multiplicative attention
Some of the more recent advancements:
- Flash attention
- Paged attention
- Star attention
- Flex attention
- Ring attention
- Lightning attention
- Mechanistic interpretability
- Medusa attention
Types of Efficient Attention Algorithms
For the above types of attention, various research has occurred on how to do this faster. There are two basic strategies:
- Memory-efficient attention — the same number of multiplications, but faster memory access patterns.
- Approximate attention methods — avoiding some of the multiplications, which makes it an approximation.
Some of the memory-efficient attention algorithms focused on improved performance via access patterns include:
- Multi-Head Attention (MHA) — the basic parallelization method.
- Muti-Query Attention (MQA) — shared heads and KV cache data.
- Group-Query Attention (GQA) — also shared attention heads and KV cache data.
- Flash Attention — various memory management improvements.
- Paged attention — also a memory cost reduction.
The default method of computing attention for every token, called "global attention" or "full attention," is quadratic in complexity in terms of the input token length. Some of the ways to speed things up, aiming for linear compexlity, by avoiding some of the computations include:
- Sparse attention — sparse matrices for attention.
- Local attention — only computing attention weightings for nearby tokens.
- Linear attention — several methods of reducing attention computations to linear complexity (rather than quadratic).
- Tree attention — hierarchical tree-based organization of attention.
- Low-rank matrices — factorizing to smaller matrices that are then used for attention.
- Approximate attention — various other types of approximate arithmetic.
- Hybrid local-global attention — e.g., alternating layers of local/linear and global attention.
KV Caching Optimizations
A lot of the computation in the attention heads can be avoided using caching of the KV data, especially in the prefill phase of computation. The types of KV caching in LLMs include:
- KV caching
- KV caching in early exit
- KV cache compression
- KV cache sparsity
- KV cache token pruning
- KV cache eviction policies
- KV cache quantization
- KV cache layer fusion
- KV cache layer pruning
- KV cache reuse
- KV cache global (multi-query KV caching)
- Prefix KV cache
- Session KV cache (multi-turn KV caching)
- Substring KV cache (Lengthwise-fused KV caching)
Token Pruning Research
One way to pay attention to less tokens is to get rid of some tokens, which is called "token pruning." For text inputs, this is like deleting some unimportant words, and for image or vision transformers, this is akin to ignoring unimportant parts of the input image. Research areas include:
- Token pruning
- Token merging
- Dynamic token pruning
- Image token pruning
- Vision token pruning
- Token skipping
- Token dropping
- Prompt compression
- Context compression
- Length pruning
Milestone Research Papers on Attention Algorithms
Various major papers on generative AI attention include:
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, Attention is all you need, 2017, arXive preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762 (This is where the trouble all started...)
- Ankur P. Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit, Sep 2016, A Decomposable Attention Model for Natural Language Inference, https://arxiv.org/abs/1606.01933 ("Decomposable attention" was the precursor to self-attention.)
- Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai, 23 Dec 2023 (v3), GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245 (Group Query Attention)
- Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 (Multi-Query Attention)
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 (The original FlashAttention version 1, now superceded by FlashAttention 2.)
- Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (Paged Attention paper.)
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Tri Dao, September 11, 2023, Medusa: Simple framework for accelerating LLM generation with multiple decoding heads, https://www.together.ai/blog/medusa (Medusa attention paper.)
- Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention
Memory-Efficient Attention Algorithms
Memory-efficient attention algorithms are an inference optimization method that improves the QKV matrix computations by using better memory access patterns. Memory bottlenecks can be reduced by avoiding too many accesses to the same data and reducing the amount of temporary data. One class of optimizations to attention algorithms is to improve their memory efficiency by changing access orders and computation. The alternative is the class of "linear" or "local" attention algorithms that simply access fewer weights during the computations.
Types of memory-optimized attention include:
- Multi-Query Attention
- Group Query Attention
- Flash attention
- Paged attention
Multi-Head Attention (MHA)
Multi-head attention is the first parallelization of attention in the original 2017 Transformer paper. It's really the concept that created "attention heads" rather than one big attention module.
Research papers on MHA:
- Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, Ivan Titov, 7 Jun 2019 (v2), Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned, https://arxiv.org/abs/1905.09418
- Shuhao Gu, Yang Feng, 31 Aug 2019, Improving Multi-Head Attention with Capsule Networks, https://arxiv.org/abs/1909.00188
- Joris Baan, Maartje ter Hoeve, Marlies van der Wees, Anne Schuth, Maarten de Rijke, 10 Nov 2019, Understanding Multi-Head Attention in Abstractive Summarization, https://arxiv.org/abs/1911.03898
- Sneha Mehta, Huzefa Rangwala, Naren Ramakrishnan, 10 Aug 2020 (v2), Low Rank Factorization for Compact Multi-Head Self-Attention, https://arxiv.org/abs/1912.00835
- Puneesh Deora, Rouzbeh Ghaderi, Hossein Taheri, Christos Thrampoulidis, 19 Oct 2023, On the Optimization and Generalization of Multi-head Attention, https://arxiv.org/abs/2310.12680
- Sitan Chen, Yuanzhi Li, 6 Feb 2024, Provably learning a multi-head attention layer, https://arxiv.org/abs/2402.04084
- Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
- DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434, 2024. https://arxiv.org/abs/2405.04434 Code: https://github.com/deepseek-ai/DeepSeek-V2 (Introduces various architectural optimizations, notably RoPE handling and KV cache compression via low-rank matrices.)
- Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
- Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- David Spuler, March 2024, Multi-Head Attention, in Generative AI in C++, https://www.aussieai.com/book/ch20-multi-head-attention
- Agarwal, Saurabh, Aug 2024, Minimizing Data Movement in Machine Learning Systems, Ph.D. Thesis, Computer Sciences, University of Wisconsin--Madison, https://digital.library.wisc.edu/1711.dl/MKLIYRPB24A5R9D https://search.library.wisc.edu/digital/AMKLIYRPB24A5R9D PDF: https://asset.library.wisc.edu/1711.dl/QXSTVAIXECHQA8L/R/file-62b54.pdf?dl https://www.proquest.com/openview/c1ae2a92106d7ec681a7296cd163e0c1/1 (Dataflow optimization in training and also "clustered head attention" for memory-efficient inference, an extension of multi-head attention similar to layer-wise head fusion/pruning.)
- R Parthasarathy, R Shuttleworth, Sep 2024, Analyzing Inference Optimizations for Transformers, 6.5930 Final Project Report, https://reeceshuttle.me/assets/6.5930_Project.pdf
- Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
- Seul-Ki Yeom, Tae-Ho Kim, 3 Dec 2024, UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices, https://arxiv.org/abs/2412.02344 (Shared attention matrix generalizes MHA with fused attention matrixes across layers.)
- LM Po, Aug 3, 2024, The Race for Faster Transformers: Innovations in Self-Attention, https://medium.com/@lmpo/the-race-for-faster-transformers-innovations-in-self-attention-e602fb1b5f20
- NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
- Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
- Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin, 30 Dec 2024, Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA, https://arxiv.org/abs/2412.20677
- F. Wang and M. Shen, "TileMap: Mapping Multi-Head Attention on Spatial Accelerators with Tile-based Analysis," 2024 IEEE 42nd International Conference on Computer Design (ICCD), Milan, Italy, 2024, pp. 653-660, doi: 10.1109/ICCD63220.2024.00104. https://ieeexplore.ieee.org/abstract/document/10818092 (Operator fusion applied to MHA.)
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
Multi-Query Attention (MQA)
Multi-Query Attention is an optimization to the attention algorithm that focues on reducing the cost of the KV cache data. Both the computation cost and the memory storage cost are improved by better handling of KV cache data.
Research papers on MQA:
- William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
- Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150
- Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai, 23 Dec 2023 (v3), GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245
- Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
- Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Sai Sena Chinnakonduru, Astarag Mohapatra, 15 Jul 2024, Weighted Grouped Query Attention in Transformers, https://arxiv.org/abs/2407.10855
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
- Hugging Face, 2024, Optimizing LLMs for Speed and Memory, https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization
- Nils Graef, 18 Apr 2024, Transformer tricks: Removing weights for skipless transformers, https://arxiv.org/abs/2404.12362
- Zihao Ye, Ruihang Lai, Bo-Ru Lu, Chien-Yu Lin, Size Zheng, Lequn Chen, Tianqi Chen, Luis Ceze, Feb 2, 2024 , Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding, https://flashinfer.ai/2024/02/02/cascade-inference.html
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
- LM Po, Aug 3, 2024, The Race for Faster Transformers: Innovations in Self-Attention, https://medium.com/@lmpo/the-race-for-faster-transformers-innovations-in-self-attention-e602fb1b5f20
- NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
- Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin, 30 Dec 2024, Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA, https://arxiv.org/abs/2412.20677
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
- Dr. Ashish Bamania, Feb 2025, Multi-Head Latent Attention Is The Powerful Engine Behind DeepSeek: A deep dive Into DeepSeek’s innovative Attention mechanism that makes its LLMs so good https://levelup.gitconnected.com/multi-head-latent-attention-is-the-powerful-engine-behind-deepseek-0ecfd29e0b04 (MLA versus GQA/MQA attention and how MLA achieves KV cache compression.)
Group-Query Attention (GQA)
Group-Query Attention (GQA) further extends MQA, and further optimizes the handling of the KV cache data in the attention heads of a Transformer architecture.
Research papers on GQA:
- William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
- Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- Jesus Rodriguez, Apr 22, 2024, Some Technical Notes About Llama 3: New tokenizer, optimized pretraining and some other details about Meta AI’s new model, Towards AI, https://pub.towardsai.net/some-technical-notes-about-llama-3-042c0b19db14
- Shuming Shi, Enbo Zhao, Deng Cai, Leyang Cui, Xinting Huang, Huayang Li, 16 Jan 2024, Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models, https://arxiv.org/abs/2401.08294 Source: https://github.com/inferflow/inferflow
- Omri Mallis, February 25, 2024 , Techniques for KV Cache Optimization in Large Language Models, https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/
- Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai, 23 Dec 2023 (v3), GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245
- Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
- Ankit Patel, June 14, 2024, NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models, NVIDIA Blog, https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/ (NVIDIA releases Nemotron-4 340B model, under an open source license, for the creation of synthetic data, with a decoder-only architecture using grouped-query attention and RoPE.)
- NVIDIA, June 2024, Nemotron-4 340B Technical Report, https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T_0.pdf (Architecture is decoder-only with GQA, SentencePiece tokenizer, causal attention masks, RoPE, 96 layers, 96 heads, 8 KV heads, 256,000 vocabulary, 18432 internal dimension, context window 4096, and uses squared RELU.)
- Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903 (A survey of inference optimization methods and further analysis of Medusa-type speculative decoding and KV cache compression. Also explores hardware co-design, ML compilers and LLM-assisted code debugging.)
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Sai Sena Chinnakonduru, Astarag Mohapatra, 15 Jul 2024, Weighted Grouped Query Attention in Transformers, https://arxiv.org/abs/2407.10855
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Felippe Vieira Zacarias, Kiran Palli, Sudharshan Vazhkudai, Evelyn Grevelink, July 2024, Analyzing LLM performance: The impact of high-bandwidth memory on model inference, https://www.micron.com/content/dam/micron/global/public/documents/products/product-flyer/llm-inference-engineering-report.pdf
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Oshin Dutta, Ritvik Gupta, Sumeet Agarwal, 2024, Efficient LLM Pruning with Global Token-Dependency Awareness and Hardware-Adapted Inference, https://openreview.net/pdf?id=cqhAzteLzc
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Hugging Face, 2024, Optimizing LLMs for Speed and Memory, https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization
- Nils Graef, 18 Apr 2024, Transformer tricks: Removing weights for skipless transformers, https://arxiv.org/abs/2404.12362
- Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn, 2 Sep 2024, Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching, https://arxiv.org/abs/2409.01141
- R Parthasarathy, R Shuttleworth, Sep 2024, Analyzing Inference Optimizations for Transformers, 6.5930 Final Project Report, https://reeceshuttle.me/assets/6.5930_Project.pdf
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
- LM Po, Aug 3, 2024, The Race for Faster Transformers: Innovations in Self-Attention, https://medium.com/@lmpo/the-race-for-faster-transformers-innovations-in-self-attention-e602fb1b5f20
- NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
- Qingyun Jin, Xiaohui Song, Feng Zhou, Zengchang Qin, 30 Dec 2024, Align Attention Heads Before Merging Them: An Effective Way for Converting MHA to GQA, https://arxiv.org/abs/2412.20677
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
- Dr. Ashish Bamania, Feb 2025, Multi-Head Latent Attention Is The Powerful Engine Behind DeepSeek: A deep dive Into DeepSeek’s innovative Attention mechanism that makes its LLMs so good https://levelup.gitconnected.com/multi-head-latent-attention-is-the-powerful-engine-behind-deepseek-0ecfd29e0b04 (MLA versus GQA/MQA attention and how MLA achieves KV cache compression.)
Linear Attention
Linear attention is a group of LLM inference optimizations that reduce the number of attention matrix computations. These are approximate methods that perform fewer arithmetic multiplication operations. Common types include local attention and sliding window attention.
Linear attention is not necessarily a particular type of attention optimization, although the term is sometimes used to refer to local attention. Rather, linear attention is the general class of optimizations that reduce attention from quadratic complexity in terms of input prompt length down to linear complexity.
Some examples of linear attention algorithms:
- Local attention
- Sliding window attemtion
Research papers on linear attention methods:
- Zhen Qin, Yuxin Mao, Xuyang Shen, Dong Li, Jing Zhang, Yuchao Dai, Yiran Zhong, 31 May 2024, You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet, https://arxiv.org/abs/2405.21022
- Jialong Guo, Xinghao Chen, Yehui Tang, Yunhe Wang, 19 May 2024, SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization, https://arxiv.org/abs/2405.11582 Code: https://github.com/xinghaochen/SLAB Code: https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB (Progressively changes LayerNorm to BatchNorm during training.)
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs/2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Uses local attention versus global attention at different layers.)
- Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. ConvBERT: Improving BERT with Span-based Dynamic Convolution, NeurIPS, 33, 2020. https://arxiv.org/abs/2008.02496 (Replaces quadratic attention heads with linear convolution.)
- Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan (Celine) Lin, 11 Jun 2024, When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models, https://arxiv.org/abs/2406.07368 Code: https://github.com/GATECH-EIC/Linearized-LLM
- Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong, 3 Apr 2024, Linear Attention Sequence Parallelism, https://arxiv.org/abs/2404.02882 Code: https://github.com/OpenNLPLab/LASP
- Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, Yunhe Wang, 1 Apr 2024 (v2), DiJiang: Efficient Large Language Models through Compact Kernelization, https://arxiv.org/abs/2403.19928 (Using the Monte Carlo method to achieve a linear attention approximation.)
- Rares Dolga, Marius Cobzarenco, David Barber, 4 Mar 2024 (v2), Latent Attention for Linear Time Transformers, https://arxiv.org/abs/2402.17512
- Max Vladymyrov, Johannes von Oswald, Mark Sandler, Rong Ge, 21 Feb 2024, Linear Transformers are Versatile In-Context Learners, https://arxiv.org/abs/2402.14180
- Desh Raj, July 11, 2021, A round-up of linear transformers, https://desh2608.github.io/2021-07-11-linear-transformers/
- Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, François Fleuret, 31 Aug 2020 (v3), Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention, https://arxiv.org/abs/2006.16236
- Tan Nguyen, Richard G. Baraniuk, Robert M. Kirby, Stanley J. Osher, Bao Wang, 1 Aug 2022, Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization, https://arxiv.org/abs/2208.00579
- Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, Gao Huang, 1 Sep 2023 (v2), FLatten Transformer: Vision Transformer using Focused Linear Attentionm https://arxiv.org/abs/2308.00442
- Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie, Suvrit Sra, 13 Mar 2024 (v2), Linear attention is (maybe) all you need (to understand transformer optimization), https://arxiv.org/abs/2310.01082
- S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” CoRR, vol. abs/2006.04768, 2020. https://arxiv.org/abs/2006.04768 (Low-rank approximation of attention.)
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Markus N. Rabe, Charles Staats, 10 Oct 2022 (v3), Self-attention Does Not Need O(n^2) Memory, https://arxiv.org/abs/2112.05682
- Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 28 Jul 2024 (v2), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 Project: https://github.com/zcli-charlie/Awesome-KV-Cache
- Minh Lenhat, Viet Anh Nguyen, Khoa Nguyen, Duong Duc Hieu, Dao Huu Hung, Truong Son Hy, 10 Aug 2024, SAMSA: Efficient Transformer for Many Data Modalities, https://arxiv.org/abs/2408.05391 https://github.com/HySonLab/SAMSA
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
- Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou, 23 Aug 2024, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 (Training using low-rank matrices to approximate attention.)
- Josh Alman, Zhao Song, 9 May 2023 (v2), Fast Attention Requires Bounded Entries, https://arxiv.org/abs/2302.13214 (Low-rank matrices in attention for fast inference.)
- Jiuxiang Gu, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou, 26 May 2024, Tensor Attention Training: Provably Efficient Learning of Higher-order Transformers, https://arxiv.org/abs/2405.16411 (Higher-order attention using tensors to generalize QKV matrices.)
- Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang, 21 Feb 2024, Do Efficient Transformers Really Save Computation? https://arxiv.org/abs/2402.13934
- Michael Zhang, Kush Bhatia, Hermann Kumbong, Christopher Ré, 6 Feb 2024, The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry, https://arxiv.org/abs/2402.04347
- Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
- Yue Niu, Saurav Prakash, Salman Avestimehr, 1 Mar 2024, ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys, https://arxiv.org/abs/2403.02352
- Rundong Zuo, Guozhong Li, Rui Cao, Byron Choi, Jianliang Xu, and Sourav S Bhowmick. 2024. DARKER: Efficient Transformer with Data-Driven Attention Mechanism for Time Series. Proc. VLDB Endow. 17, 11 (July 2024), 3229–3242. https://doi.org/10.14778/3681954.3681996 https://dl.acm.org/doi/abs/10.14778/3681954.3681996
- Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu, 11 Sep 2024, Gated Slot Attention for Efficient Linear-Time Sequence Modeling, https://arxiv.org/abs/2409.07146 https://github.com/sustcsonglin/flash-linear-attention https://huggingface.co/fla-hub
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Gabriel Mongaras, Trevor Dohm, Eric C. Larson, 27 Sep 2024, Cottention: Linear Transformers With Cosine Attention, https://arxiv.org/abs/2409.18747 (Nearest-neighbor to replace Softmax attention, for near-linear attention.)
- Costin-Andrei Oncescu, Sanket Purandare, Stratos Idreos, Sham Kakade, 16 Oct 2024, Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond, https://arxiv.org/abs/2410.12982
- Anonymous authors, Oct 2024, Dense Attention: No-CCompromise Exact All NxN Interactions Algorithm with O(N) Space and Time Complexity, https://openreview.net/pdf?id=2bIQBDSfRk
- Théodor Lemerle, Harrison Vanderbyl, Vaibhav Srivastav, Nicolas Obin, Axel Roebel, 30 Oct 2024, Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis, https://arxiv.org/abs/2410.23320 https://theodorblackbird.github.io/blog/demo_lina/
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
- Zekun Hao, David W. Romero, Tsung-Yi Lin, Ming-Yu Liu, 12 Dec 2024, Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale, https://arxiv.org/abs/2412.09548 https://research.nvidia.com/labs/dir/meshtron/ (Optimizations to avoid the quadratic Transformer cost, in both training and inference, include "hourglass neural architecture" analogous to widthwise pruning or slimming, sliding window attention, rolling KV cache, truncated sequence training, and a "robust sampling strategy" that is effectively a type of constrained decoding based on mesh layouts.)
- Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen, Liwei Wang, Federico Tombari, Bernt Schiele, 30 Oct 2024, TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters, https://haiyang-w.github.io/tokenformer.github.io/ (Unique novel token-based attention mechanism.)
- Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong, 15 Jan 2024 (v2), Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models, https://arxiv.org/abs/2401.04658 https://github.com/OpenNLPLab/lightning-attention
- Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Xiao Luo, Yu Qiao, Yiran Zhong, 19 Jan 2024 (v2), TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer, https://arxiv.org/abs/2307.14995 https://github.com/OpenNLPLab/TransnormerLLM (Lightning attention first version.)
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
Flash Attention
Flash attention is a type of memory-efficient attention algorithm that improves the speed of LLM inference. Flash attention is a very successful optimization and has been implemented in several inference frameworks. It optimizes attention by paying attention, haha, to the memory access cost of the various QKV operations, and thus reducing the overall memory cost, which in turn reduces the time cost of computing attention.
Research papers on Flash Attention:
- Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 (The original FlashAttention version 1, now superceded by FlashAttention 2.)
- Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Rühle, Saravan Rajmohan, 17 May 2024, Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers, https://arxiv.org/abs/2405.10480
- Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren, 21 Apr 2024, SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile, https://arxiv.org/abs/2404.13528 (Choosing optimal tensor memory layouts to optimize low-level operator kernels.)
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- M Pagliardini, D Paliotta, M Jaggi, F Fleuret, 2023, Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention, https://openreview.net/pdf?id=UINHuKeWUa
- Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Splash: Sparse Flash Attention, 2024, https://github.com/google/jax/blob/main/jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_kernel.py
- 1 Jun 2023, Faster Causal Attention Over Large Sequences Through Sparse Flash Attention, Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret, https://arxiv.org/abs/2306.01160
- Wu, R., Zhu, X., Chen, J. et al. 2024, SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer. J Supercomput (2024). https://doi.org/10.1007/s11227-024-05890-8, https://link.springer.com/article/10.1007/s11227-024-05890-8 (A modified version of Flash Attention on a supercomputer.)
- 8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2, HippoML Blog, Jan 2024, https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482
- Longteng Zhang, Xiang Liu, Zeyu Li, Xinglin Pan, Peijie Dong, Ruibo Fan, Rui Guo, Xin Wang, Qiong Luo, Shaohuai Shi, Xiaowen Chu, Dec 2023, Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models, https://arxiv.org/abs/2311.03687 (Benchmarks model speed for training, fine-tuning and inference with various optimizations such as ZeRO, quantization, offloading/recomputation, and Flash Attention.)
- Gavin Li, Nov 19, 2023, Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique, AI Advances https://ai.gopubby.com/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb
- Amazon, Oct 2023, MistralLite Model, https://huggingface.co/amazon/MistralLite
- NVIDIA, Developer Guide (CuDNN), https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html PDF: https://docs.nvidia.com/deeplearning/cudnn/pdf/cuDNN-Developer-Guide.pdf
- Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
- Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
- Karl Stratos, 2024, Efficient Attention, https://karlstratos.com/notes/attention.pdf
- Together AI, Nov 13, 2023, Announcing Together Inference Engine – the fastest inference available, https://www.together.ai/blog/together-inference-engine-v1
- Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
- Neal Vaidya, Nick Comly, Joe DeLaere, Ankit Patel and Fred Oh, Sep 9th 2023, NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs, NVIDIA Blog, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Team PyTorch: Horace He, Driss Guessous, Yanbo Liang, Joy Dong, August 07, 2024, FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention, https://pytorch.org/blog/flexattention/
- Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 28 Jul 2024 (v2), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 Project: https://github.com/zcli-charlie/Awesome-KV-Cache
- Barna Saha, Christopher Ye, July 2024, I/O Complexity of Attention, or How Optimal is FlashAttention? Proceedings of the 41st International Conference on Machine Learning, PMLR 235:43024-43042, 2024, https://proceedings.mlr.press/v235/saha24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/saha24a/saha24a.pdf
- Hugging Face, 2024, Optimizing LLMs for Speed and Memory, https://huggingface.co/docs/transformers/main/en/llm_tutorial_optimization
- Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, https://arxiv.org/abs/2312.00678
- Together AI, July 17, 2023, Introducing Together AI Chief Scientist Tri Dao, as he releases FlashAttention-2 to speed up model training and inference, https://www.together.ai/blog/tri-dao-flash-attention
- Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
- Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
- Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh, 1 Dec 2023 (v3), HyperAttention: Long-context Attention in Near-Linear Time, https://arxiv.org/abs/2310.05869
- Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb, 6 Sep 2024, Theory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 https://github.com/apple/ml-sigmoid-attention
- Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu, 11 Sep 2024, Gated Slot Attention for Efficient Linear-Time Sequence Modeling, https://arxiv.org/abs/2409.07146 https://github.com/sustcsonglin/flash-linear-attention https://huggingface.co/fla-hub
- Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
- Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
- Vijay Thakkar and Fred Oh, Jul 11, 2024, Next Generation of FlashAttention, https://developer.nvidia.com/blog/next-generation-of-flashattention/
- Christian Mills, September 15, 2024, GPU MODE Lecture 12: Flash Attention, https://christianjmills.com/posts/cuda-mode-notes/lecture-012/
- Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang, 26 Sep 2024 (v2), INT-FlashAttention: Enabling Flash Attention for INT8 Quantization, https://arxiv.org/abs/2409.16997 https://github.com/INT-FlashAttention2024/INT-FlashAttention
- Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen, 3 Oct 2024, SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration, https://arxiv.org/abs/2410.02367 (Quantized attention outperforms Flash attention.)
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Anonymous authors, Oct 2024, Dense Attention: No-CCompromise Exact All NxN Interactions Algorithm with O(N) Space and Time Complexity, https://openreview.net/pdf?id=2bIQBDSfRk
- Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
- Umar Jamil, Nov 2024, Flash Attention derived and coded from first principles with Triton (Python), https://www.youtube.com/watch?v=zy8ChVd_oTM
- Markus Rabe, Carl Case, November 14, 2024, Rethinking LLM Inference: Why Developer AI Needs a Different Approach, https://www.augmentcode.com/blog/rethinking-llm-inference-why-developer-ai-needs-a-different-approach
- Akshat Sharma, Hangliang Ding, Jianping Li, Neel Dani, Minjia Zhang, 27 Nov 2024, Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache, https://arxiv.org/abs/2411.18077
- Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
- LM Po, Aug 3, 2024, The Race for Faster Transformers: Innovations in Self-Attention, https://medium.com/@lmpo/the-race-for-faster-transformers-innovations-in-self-attention-e602fb1b5f20
- Vincent Abbott, Gioele Zardini, 4 Dec 2024, FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness, https://arxiv.org/abs/2412.03317
- Benjamin Marie, Aug 13, 2024, FlexAttention: A Flexible Pytorch API for Implementing Attention Optimizations. It’s going to be easier to optimize attention computation. https://medium.com/@bnjmn_marie/flexattention-a-flexible-pytorch-api-for-implementing-attention-optimizations-bc4fae65eb9d
- Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Ruhle, Saravan Rajmohan, 11 Dec 2024, TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs, https://arxiv.org/abs/2412.08585
- Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
- zhuzilin, Jan 2025, ring-flash-attention: Ring attention implementation with flash attention, https://github.com/zhuzilin/ring-flash-attention
- Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong, 15 Jan 2024 (v2), Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models, https://arxiv.org/abs/2401.04658 https://github.com/OpenNLPLab/lightning-attention
- Zhendong Zhang, 14 Jan 2025 (v2), Flash Window Attention: speedup the attention computation for Swin Transformer, https://arxiv.org/abs/2501.06480
- Pradeep Ramani, Jason Knight, Philippe Tillet, Thomas Raoux, Paweł Szczerbuk and Peter Bell, Feb 05, 2025, OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability, https://developer.nvidia.com/blog/openai-triton-on-nvidia-blackwell-boosts-ai-performance-and-programmability/
- Nuno Gonçalves, Marcos Treviso, André F. T. Martins, 17 Feb 2025, AdaSplash: Adaptive Sparse Flash Attention, https://arxiv.org/abs/2502.12082
- DeepSeek, Feb 2025 (accessed), FlashMLA: FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving, https://github.com/deepseek-ai/FlashMLA
- Nickie Louise, February 24, 2025, DeepSeek launches FlashMLA: A breakthrough in AI speed and efficiency for NVIDIA GPUs, https://techstartups.com/2025/02/24/deepseek-launches-flashmla-a-breakthrough-in-ai-speed-and-efficiency-for-nvidia-gpus/
- M Beck, K Pöppel, P Lippe, S Hochreiter, Mar 2025, Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels, ICLR 2025 review, https://openreview.net/pdf?id=P6CPFdexbk
Paged Attention
Paged attention is a memory-efficient attention algorithm in LLM inference. It reduces the memory cost and thereby the computation cost of computing the attention metrics in LLM inference.
Paged attention is another successful optimization of attention algorithms, also focusing on memory access costs. By mimicking the concept of "paging" of operating system memory within the attention algorithm, the overall time cost of attention is significantly reduced.
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165
- Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 7 May 2024, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437 (Further optimizes paged attention algorithm for KV caching in attention, by storing the KV cache in contiguous memory and using underlying system paging.)
- Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren, 21 Apr 2024, SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile, https://arxiv.org/abs/2404.13528 (Choosing optimal tensor memory layouts to optimize low-level operator kernels.)
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Omri Mallis, February 25, 2024 , Techniques for KV Cache Optimization in Large Language Models, https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/
- Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W.H. Lau, 30 May 2024 (v3), RelayAttention for Efficient Large Language Model Serving with Long System Prompts, https://arxiv.org/abs/2402.14808 (Reduces the number of memory accesses for attention computations and the KV cache.)
- Nikhil Jha, Kevin Wang, 2023, https://people.eecs.berkeley.edu/~kubitron/courses/cs262a-F23/projects/reports/project1010_paper_64287652274076362722.pdf (Extends Paged Attention to a global multi-query KV cache and also implements prefix KV caching.)
- Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- vLLM, 2024, vLLM Paged Attention, https://docs.vllm.ai/en/stable/dev/kernel/paged_attention.html
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Karl Stratos, 2024, Efficient Attention, https://karlstratos.com/notes/attention.pdf
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Yu, Lingfan, 2024, Improve Language Model Serving Efficiency With Fine-Grained and Stateful Scheduling, Ph.D. Thesis, Department of Computer Science, New York University, ProQuest Dissertations & Theses, 31139782, https://www.proquest.com/openview/7200cdfc0906f1d4edb8008b4368bcf9 PDF: https://cs.nyu.edu/media/publications/lingfan_yu_phd_thesis.pdf (Examines efficiency of batching methods and how to create a "stateful" version with cached multi-turn conversation history using session-based KV caching.)
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
- https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
- Cade Daniel, Chen Shen, Eric Liang and Richard Liaw , June 22, 2023, How continuous batching enables 23x throughput in LLM inference while reducing p50 latency, https://www.anyscale.com/blog/continuous-batching-llm-inference
- Zihao Ye,, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, Luis Ceze, Feb 2, 2024, Accelerating Self-Attentions for LLM Serving with FlashInfer, https://flashinfer.ai/2024/02/02/introduce-flashinfer.html
- Seungrok Jung. 15, Mar 2024, Large language model inference optimizations on AMD GPUs, ROCm Blogs, https://rocm.blogs.amd.com/artificial-intelligence/llm-inference-optimize/README.html
- Zhuohan L, Oct 2024, Empowering Large Language Models with Efficient and Automated Systems, Ph.D. Thesis, Computer Science, University of California, Berkeley, https://escholarship.org/content/qt2kp379f3/qt2kp379f3.pdf (Examines pipeline parallelism and paged attention.)
- Isaac Rehg, 7 Oct 2024 (v2), KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head, https://arxiv.org/abs/2410.00161
- OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
- OpenVINO™ toolkit, Sep 26, 2024, How To Efficiently Serve Today’s Large Language Models, https://medium.com/openvino-toolkit/how-to-efficiently-serve-todays-large-language-models-b3f1e8d33fdf
- LM Po, Aug 3, 2024, The Race for Faster Transformers: Innovations in Self-Attention, https://medium.com/@lmpo/the-race-for-faster-transformers-innovations-in-self-attention-e602fb1b5f20
- Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
- Anjali Shah, Kshitiz Gupta, Jiahong Liu and Haohang Huang, Dec 11, 2024, NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-now-accelerates-encoder-decoder-models-with-in-flight-batching/
- NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
Medusa Attention
See Medusa attention research.
Attention-Free LLM Inference
Attention-free optimizations are LLM modifications that avoid using the classic attention module. There are various alternatives to the attention computation that have been tried, but none are widely used.
Since computing attention is so expensive, maybe we can just avoid it? That's the idea in some "attention-free" methods:
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Graph Attention
Graph attention is the use of graph data structures in the computation of LLM inference. This is a generalization of the classic self-attention in LLMs that relies on sequences of tokens without any structure.
The idea of integrating graph representations of knowledge into attention has received some research attention:
- Lena Sasal, Daniel Busby, Abdenour Hadid, 29 Aug 2024, TempoKGAT: A Novel Graph Attention Network Approach for Temporal Graph Analysis, https://arxiv.org/abs/2408.16391
- Akshay Kolli, Reza Azadeh, Kshitj Jerath, 27 Aug 2024, Graph Attention Inference of Network Topology in Multi-Agent Systems, https://arxiv.org/abs/2408.15449
Bilinear Attention
Bilinear attention is an optimization of LLM attention algorithms by examining tokens in both directions. This can lead to increased accuracy in LLM inference algorithms, but it is not widely used.
A novel bilinear approach for attention:
- Shaojian Qiu, Huihao Huang, Jianxiang Luo, Yingjie Kuang, Haoyu Luo, 11 Feb 2024, BAFLineDP: Code Bilinear Attention Fusion Framework for Line-Level Defect Prediction, https://arxiv.org/pdf/2402.07132
- Philipp Froehlich, Heinz Koeppl, 13 Feb 2024 (v2), Graph Structure Inference with BAM: Introducing the Bilinear Attention Mechanism, https://arxiv.org/abs/2402.07735
Mechanistic Interpretability
See Mechanistic interpretability research.
Attention Steering
See: Attention steering research.
Attention Sink
The "attention sink" is the surprising fact that LLMs tend to place a lot of attention on the very first input token (e.g., the first word in a sentence). This effect is not programmed into LLMs, but arises by accident in the way that LLMs are trained. Sometimes this is correct, but it is also often inappropriate to place so much weight on the first token. For example, the first word of the user's query can be important, but the first token in an LLM input is often part of a global instruction prompt or a RAG chunk, so this effect is undesirable in such cases. Various papers examine this weird effect, and try to mitigate it.
- Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin, 14 Oct 2024, When Attention Sink Emerges in Language Models: An Empirical View, https://arxiv.org/abs/2410.10781 https://github.com/sail-sg/Attention-Sink
- Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei, 17 Oct 2024, Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs, https://arxiv.org/abs/2410.13835
- Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, Kehong Yuan, 25 Jan 2025, RotateKV: Accurate and Robust 2-Bit KV Cache Quantization for LLMs via Outlier-Aware Adaptive Rotations, https://arxiv.org/abs/2501.16383 (INT2 KV caching with special handling of outliers, RoPE, and attention sinks, and the resulting architecture works in Chain-of-Thought.)
- Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie, 4 Feb 2025, On the Emergence of Position Bias in Transformers, https://arxiv.org/abs/2502.01951
- Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Chunked Attention
Chunked attention is the idea of optimizing LLM inference by splitting up attention computations into fixed-size blocks. The attention computation can perform computations at a block level, and can avoid some computations involving entire blocks. This is similar to "chunked prefill" and "batched inference" ideas, in terms of simplifying the scheduling of the computations required.
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Zhenyu Li, Yike Zhang, Tengyu Pan, Yutao Sun, Zhichao Duan, Junjie Fang, Rong Han, Zixuan Wang, Jianyong Wang, 21 Aug 2024, FocusLLM: Scaling LLM's Context by Parallel Decoding, https://arxiv.org/abs/2408.11745 Code: https://github.com/leezythu/FocusLLM
- Kyle Wiggers, September 11, 2024, Mistral releases Pixtral 12B, its first multimodal model, https://techcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/
Mixture-of-Heads (MOH) Attention (or MoA)
MOH is a combination of Mixture-of-Experts (MoE) architecture with the standard Multi-Head Attention (MHA). i.e., MHA + MoE = MOH! It's also been called Mixture-of-Attention (MoA) in some papers.
- Asankhaya Sharma, 26 Jul 2024, Patched MOA: optimizing inference for diverse software development tasks, https://arxiv.org/abs/2407.18521
- Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville, 4 Oct 2024 (v2), Scattered Mixture-of-Experts Implementation, https://arxiv.org/abs/2403.08245
- Dr. Ashish Bamania, Oct 27, 2024, Amazing Things Happen When Attention Heads Are Supercharged Using Mixture-Of-Experts: A deep dive into how the Attention mechanism works and how it is being enhanced by the Mixture-of-Experts architecture, resulting in Mixture-of-Head Attention (MoH) that makes our existing LLMs more efficient than ever. https://levelup.gitconnected.com/amazing-things-happen-when-attention-heads-are-supercharged-using-mixture-of-experts-b55a6b9a0ac8
- Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean, 23 Jan 2017, Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, https://arxiv.org/abs/1701.06538
- Xiaofeng Zhang, Yikang Shen, Zeyu Huang, Jie Zhou, Wenge Rong, Zhang Xiong, 11 Oct 2022, Mixture of Attention Heads: Selecting Attention Heads Per Token, https://arxiv.org/abs/2210.05144 https://aclanthology.org/2022.emnlp-main.278/
- Peng Jin, Bo Zhu, Li Yuan, Shuicheng Yan, 15 Oct 2024, MoH: Multi-Head Attention as Mixture-of-Head Attention, https://arxiv.org/abs/2410.11842 https://github.com/SkyworkAI/MoH
- Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber, 30 Sep 2024 (v3), SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention, https://arxiv.org/abs/2312.07987
- Hao Peng, Roy Schwartz, Dianqi Li, Noah A. Smith, 13 May 2020, A Mixture of h−1 Heads is Better than h Heads, https://arxiv.org/abs/2005.06537
- Xun Wu, Shaohan Huang, Wenhui Wang, Furu Wei, 23 Apr 2024, Multi-Head Mixture-of-Experts, https://arxiv.org/abs/2404.15045
- Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Block Attention
Block attention is a granular optimization of LLM attention algorithms to the block level. This allows more efficient QKV tensor calculations, and can also involve pruning entire blocks of computations, especially in "block-sparse" attention optimizations.
Block attention is somewhat similar to chunked attention, in that it focuses on blocks within the attention computation. It is also related to block-level quantization and other similar block optimizations.
Research on block attention methods:
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Zhenyu Li, Yike Zhang, Tengyu Pan, Yutao Sun, Zhichao Duan, Junjie Fang, Rong Han, Zixuan Wang, Jianyong Wang, 21 Aug 2024, FocusLLM: Scaling LLM's Context by Parallel Decoding, https://arxiv.org/abs/2408.11745 Code: https://github.com/leezythu/FocusLLM
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
- Kyle Wiggers, September 11, 2024, Mistral releases Pixtral 12B, its first multimodal model, https://techcrunch.com/2024/09/11/mistral-releases-pixtral-its-first-multimodal-model/
- Amr Elmeleegy, Shang Zhang and Jie-Fang Zhang, Nov 21, 2024, NVIDIA TensorRT-LLM Multiblock Attention Boosts Throughput by More Than 3x for Long Sequence Lengths on NVIDIA HGX H200, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-multiblock-attention-boosts-throughput-by-more-than-3x-for-long-sequence-lengths-on-nvidia-hgx-h200/
- Shantanu Acharya, Fei Jia, Boris Ginsburg, 26 Nov 2024, Star Attention: Efficient LLM Inference over Long Sequences, https://arxiv.org/abs/2411.17116
- East Sun, Yan Wang, Lan Tian, 17 Oct 2024 (v4), Block-Attention for Efficient RAG, https://arxiv.org/abs/2409.15355
- Hao Liu, Matei Zaharia, Pieter Abbeel, 27 Nov 2023 (v4), Ring Attention with Blockwise Transformers for Near-Infinite Context, https://arxiv.org/abs/2310.01889 https://github.com/lhao499/llm_large_context (Original paper for ring attention.)
- Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze, 2 Jan 2025, FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving, https://arxiv.org/abs/2501.01005
- Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu, 18 Feb 2025, MoBA: Mixture of Block Attention for Long-Context LLMs, https://arxiv.org/abs/2502.13189 https://github.com/MoonshotAI/MoBA
- Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han, 20 Feb 2025, LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention, https://arxiv.org/abs/2502.14866
- Emily Xiao, Chin-Jou Li, Yilin Zhang, Graham Neubig, Amanda Bertsch, 11 Mar 2025, Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention, https://arxiv.org/abs/2503.08640
Low-Rank Matrix Attention
Low-rank matrix attention is the use of low-rank matrix factorization, or tensor decomposition, for a faster computation of LLM attention. This method is faster since there are fewer computations, but it is also therefore an approximation with potential accuracy loss. One way to approximate attention to get a type of linear attention is to use low-rank matrix factorization (decomposition).
Research papers on the use of low-rank matrix factorization and tensor decomposition for attention:
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin, 8 May 2024, Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers, https://arxiv.org/abs/2405.05219 (Attention optimization using multiple low-rank matrices.)
- Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou, 23 Aug 2024, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 (Training using low-rank matrices to approximate attention.)
- Josh Alman, Zhao Song, 9 May 2023 (v2), Fast Attention Requires Bounded Entries, https://arxiv.org/abs/2302.13214 (Low-rank matrices in attention for fast inference.)
- Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
- Yue Niu, Saurav Prakash, Salman Avestimehr, 1 Mar 2024, ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys, https://arxiv.org/abs/2403.02352
- S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” CoRR, vol. abs/2006.04768, 2020. https://arxiv.org/abs/2006.04768 (Low-rank approximation of attention.)
- Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020. https://arxiv.org/abs/2006.16236 (Linear O(N) attention algorithm based on low-rank factorization.)
- Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://arxiv.org/abs/2110.15343 (Attention optimization using both sparse attention and low-rank matrix attention.)
- Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (Low-rank matrix attention allows up to 100k context windows.)
- Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
- Sneha Mehta, Huzefa Rangwala, Naren Ramakrishnan, 10 Aug 2020 (v2), Low Rank Factorization for Compact Multi-Head Self-Attention, https://arxiv.org/abs/1912.00835
- Bosheng Qin, Juncheng Li, Siliang Tang, Yueting Zhuang, 24 Nov 2022, DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention, https://arxiv.org/abs/2211.16368
- Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum, 26 Dec 2024, Multi-matrix Factorization Attention, https://arxiv.org/abs/2412.19255
General Research Papers on Attention Optimization
Papers that specifically aim to optimize attention include:
- Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020. https://arxiv.org/abs/2009.06732 (Broad survey of Transformer efficiency that contains details about attention and fixing its quadratic cost.)
- Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 (The original FlashAttention version 1, now superceded by FlashAttention 2.)
- Jianpeng Cheng, Li Dong, Mirella Lapata, Sep 2016, Long Short-Term Memory-Networks for Machine Reading, https://arxiv.org/abs/1601.06733
- Minh-Thang Luong, Hieu Pham, Christopher D. Manning, Sep 2016, Effective Approaches to Attention-based Neural Machine Translation, https://arxiv.org/abs/1508.04025
- Alex Graves, Greg Wayne, Ivo Danihelka, Dec 2014, Neural Turing Machines, https://arxiv.org/abs/1410.5401
- Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In The International Conference on Machine Learning (ICML), 2020, https://arxiv.org/abs/2001.04451 (Sparse attention with locality-sensitive hashing.)
- Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021, https://arxiv.org/abs/2003.05997 (Sparse attention.)
- Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020. https://arxiv.org/abs/2006.16236 (Linear O(N) attention algorithm based on low-rank factorization.)
- Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020. https://arxiv.org/abs/2006.04768 (Low-rank matrix O(N) linear complexity method.)
- Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. https://arxiv.org/abs/2007.14062 (Sparse linear attention algorithm.)
- Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://arxiv.org/abs/2110.15343 (Attention optimization using both sparse attention and low-rank matrix attention.)
- Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller, Rethinking Attention with Performers, In International Conference on Learning Representations (ICLR), 2020. https://arxiv.org/abs/2009.14794 (Novel method for linear attention complexity.)
- Iz Beltagy, Matthew E Peters, Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. https://arxiv.org/abs/2004.05150 (Linear attention over long context sequences.)
- Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2020. https://arxiv.org/abs/2011.04006, Code: https://github.com/google-research/long-range-arena (Evaluate models over long context sequences.)
- Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. SMYRF: Efficient Attention using Asymmetric Clustering. Advances in Neural Information Processing Systems, 33:6476–6489, 2020. https://arxiv.org/abs/2010.05315 (Approximation of attention using locality-sensitive hashing.)
- Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N. Iandola, Ernie Chang, Yangyang Shi, Vikas Chandra, Sep 2023, Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition, https://arxiv.org/abs/2309.07988
- X Wang, P Guo, Y Zhang, 2023, Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer, ECML PKDD 2023: Machine Learning and Knowledge Discovery in Databases: Research Track pp 309–325, https://arxiv.org/abs/2201.05887
- Shiyi Zhu, Jing Ye, Wei Jiang, Qi Zhang, Yifan Wu, Jianguo Li, Sep 2023, Cure the headache of Transformers via Collinear Constrained Attention, arXiv preprint arXiv:2309.08646, https://arxiv.org/abs/2309.08646
- Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (Low-rank matrix attention allows up to 100k context windows.)
- Y Chen, Y Li, A Xu, Q Sun, X Chen, C Xu, 2023, WAG-NAT: Window Attention and Generator Based Non-Autoregressive Transformer for Time Series Forecasting, ICANN 2023: Artificial Neural Networks and Machine Learning, pp. 293–304, https://link.springer.com/chapter/10.1007/978-3-031-44223-0_24, Code: https://github.com/cybisolated/WAG-NAT
- Together AI, Jul 28, 2023, Preparing for the era of 32K context: Early learnings and explorations, https://together.ai/blog/llama-2-7b-32k (Uses position interpolation and Flash Attention.)
- Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with sparse attention, linearized attention, and other attention optimization methods.)
- Hao Liu, Matei Zaharia, Pieter Abbeel, Oct 2023, Ring Attention with Blockwise Transformers for Near-Infinite Context, https://arxiv.org/abs/2310.01889
- H Lee, J Kim, J Willette, SJ Hwang, Oct 2023, SEA: Sparse Linear Attention with Estimated Attention Mask, arXiv preprint arXiv:2310.01777, https://browse.arxiv.org/pdf/2310.01777.pdf
- Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Uses grouped-query attention and sliding window attention.)
- G Xiao, Y Tian, B Chen, S Han, M Lewis, Sep 2023, Efficient Streaming Language Models with Attention Sinks, arXiv preprint arXiv:2309.17453, https://arxiv.org/abs/2309.17453
- Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used Multi-Query Attention.)
- Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu, 19 Mar 2024, DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring, https://arxiv.org/abs/2403.13163 (Optimizes a deblurring Transformer with self-attention improvements and other optimizations.)
- Bohdan Ivaniuk-Skulskyi; Nadiya Shvai; Arcadi Llanza; Amir Nakib, 2024, Towards Lightweight Transformer Architecture: an Analysis on Semantic Segmentation, 2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA), https://ieeexplore.ieee.org/abstract/document/10467926 (Examines skip-attention and pool-unpool attention methods for faster inference.)
- Jiing-Ping Wang, Ming-Guang Lin, An-Yeu (Andy) Wu, 11 Apr 2024, LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer, https://arxiv.org/abs/2404.07519 (Approximate 4-bit vector dot product used as an estimate in attention heads, with computation reuse to 8-bit dot products.)
- Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas, 11 Apr 2024, RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, Google Research, https://arxiv.org/abs/2404.07839 (Local attention.)
- Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075 (Sparse attention)
- Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, 10 Apr 2024, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, https://arxiv.org/abs/2404.07143
- Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
- Seokju Yun, Dongheon Lee, Youngmin Ro, 4 Jun 2024, MetaMixer Is All You Need, https://arxiv.org/abs/2406.02021
- Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
- Zhen Qin, Yuxin Mao, Xuyang Shen, Dong Li, Jing Zhang, Yuchao Dai, Yiran Zhong, 31 May 2024, You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet, https://arxiv.org/abs/2405.21022
- Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Rühle, Saravan Rajmohan, 17 May 2024, Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers, https://arxiv.org/abs/2405.10480
- Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
- Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
- Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar, 10 May 2024, Linearizing Large Language Models, https://arxiv.org/abs/2405.06640 Code: https://github.com/TRI-ML/linear_open_lm
- Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 7 May 2024, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437 (Further optimizes paged attention algorithm for KV caching in attention, by storing the KV cache in contiguous memory and using underlying system paging.)
- Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
- Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
- Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
- Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
- Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs/2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Uses local attention versus global attention at different layers.)
- Cong Wei, Brendan Duke, Ruowei Jiang, Parham Aarabi, Graham W. Taylor, Florian Shkurti, Mar 2023, Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers, CVPR 2023, https://arxiv.org/abs/2303.13755 https://openaccess.thecvf.com/content/CVPR2023/papers/Wei_Sparsifiner_Learning_Sparse_Instance-Dependent_Attention_for_Efficient_Vision_Transformers_CVPR_2023_paper.pdf
- Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu, 19 Mar 2024, DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring, https://arxiv.org/abs/2403.13163v1 (Implements "feature fusion" which is analogous to kernel fusion.) (Optimizes a deblurring Transformer with self-attention improvements and other optimizations.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber, 14 Dec 2023, SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention, https://arxiv.org/abs/2312.07987 Code: https://github.com/robertcsordas/moe_attention
- David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le, Jan 2022, Primer: Searching for Efficient Transformers for Language Modeling, https://arxiv.org/abs/2109.08668
- Datadrifters, Nov 16, 2023, The World’s Fastest LLM Inference: 3x Faster Than vLLM and TGI, https://medium.com/@datadrifters/the-worlds-fastest-llm-inference-engine-3x-faster-than-vllm-and-tgi-a2ed9e33c55f
- Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c
- Hongyang Li, Yu Liu, Wanli Ouyang, and Xiaogang Wang. 2019. Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision 127, 3 (2019), 225–238. https://arxiv.org/abs/1709.04347 (Attention mechanism in image processing.)
- Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019, https://arxiv.org/abs/1901.02860 Code: https://github.com/kimiyoung/transformer-xl
- J Ainslie, T Lei, M de Jong, S Ontañón, 2023, Colt5: Faster long-range transformers with conditional computation, https://arxiv.org/abs/2303.09752
- H Zheng, 2023, Visual Content Manipulation With Correspondence Learning and Representation Learning, Ph.D. thesis, Department of Computer Science Arts, Sciences and Engineering, Edmund A. Hajim School of Engineering and Applied Sciences, University of Rochester, https://www.proquest.com/openview/2587c48ef34e3f7146132a08cf70e88f/1
- Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera y Arcas, Max Vladymyrov, Razvan Pascanu, João Sacramento, Sep 2023, Uncovering mesa-optimization algorithms in Transformers, https://arxiv.org/abs/2309.05858
- Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, and Zheng Zhang. 2019. BP-transformer: Modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070 https://arxiv.org/abs/1911.04070 Code: https://github.com/yzh119/BPT
- Y Wang, Y Han, C Wang, S Song, Q Tian, G Huang, 2023, Computation-efficient Deep Learning for Computer Vision: A Survey, arXiv preprint arXiv:2308.13998, https://arxiv.org/abs/2308.13998
- Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm, 10 Jun 2024, Symmetric Dot-Product Attention for Efficient Training of BERT Language Models, https://arxiv.org/abs/2406.06366
- Piotr Kluska, Adri´an Castello, Florian Scheidegger, A. Cristiano I. Malossi, 2024, QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers https://openaccess.thecvf.com/content/CVPR2024W/eLVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf
- Splash: Sparse Flash Attention, 2024, https://github.com/google/jax/blob/main/jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_kernel.py
- 1 Jun 2023, Faster Causal Attention Over Large Sequences Through Sparse Flash Attention, Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret, https://arxiv.org/abs/2306.01160
- Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, 3 Apr 2024, Optimizing the Deployment of Tiny Transformers on Low-Power MCUs, https://arxiv.org/abs/2404.02945 (Uses an approach called "Fused Weight Self-Attention" that fuses some of the QKV matrices and also tiling in multi-head attention, along with 8-bit integer quantization and integerized Softmax.)
- Moein Heidari, Reza Azad, Sina Ghorbani Kolahi, René Arimond, Leon Niggemeier, Alaa Sulaiman, Afshin Bozorgpour, Ehsan Khodapanah Aghdam, Amirhossein Kazerouni, Ilker Hacihaliloglu, Dorit Merhof, 28 Mar 2024, Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights, https://arxiv.org/abs/2403.19882 Project: https://github.com/mindflow-institue/Awesome-Attention-Mechanism-in-Medical-Imaging (Survey of optimization techniques for Vision Transformers, with particular focus on attention optimizations.)
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai, 18 Feb 2024, In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness, https://arxiv.org/abs/2402.11639
- Rares Dolga, Marius Cobzarenco, David Barber, 4 Mar 2024 (v2), Latent Attention for Linear Time Transformers, https://arxiv.org/abs/2402.17512
- Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W.H. Lau, 29 Feb 2024 (v2), RelayAttention for Efficient Large Language Model Serving with Long System Prompts, https://arxiv.org/abs/2402.14808 Code: https://github.com/rayleizhu/vllm-ra (A new attention mechanism called RelayAttention.)
- Wu, R., Zhu, X., Chen, J. et al. 2024, SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer. J Supercomput (2024). https://doi.org/10.1007/s11227-024-05890-8, https://link.springer.com/article/10.1007/s11227-024-05890-8 (A modified version of Flash Attention on a supercomputer.)
- Fireworks.ai, Jan 9, 2024, FireAttention — Serving Open Source Models 4x faster than vLLM by quantizing with ~no tradeoffs, https://blog.fireworks.ai/fireattention-serving-open-source-models-4x-faster-than-vllm-by-quantizing-with-no-tradeoffs-a29a85ad28d0
- 8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2, HippoML Blog, Jan 2024, https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482
- Jungmin Yun, Mihyeon Kim, Youngbin Kim, 2023, Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification https://aclanthology.org/2023.findings-emnlp.909.pdf
- Yuzhen Mao, Martin Ester, Ke Li, 2023, IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs, NeurIPS 2023, https://neurips2023-enlsp.github.io/papers/paper_24.pdf
- NVIDIA, Developer Guide (CuDNN), https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html PDF: https://docs.nvidia.com/deeplearning/cudnn/pdf/cuDNN-Developer-Guide.pdf
- Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc¸ois Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daume III and Aarti Singh ´ (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5156–5165. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/katharopoulos20a.html.
- Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarl ´ os, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, ´ David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=Ua6zuk0WRH.
- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pre-training Approach. arXiv:1907.11692 [cs], July 2019. URL http://arxiv.org/abs/19 07.11692.
- Peter Izsak, Moshe Berchansky, and Omer Levy. How to Train BERT with an Academic Budget. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10644–10652, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.831. https://aclanthology.org/2021.emnlp-main.831
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002, IEEE, 2021. https://arxiv.org/abs/2103.14030 Code: https://github.com/microsoft/Swin-Transformer
- Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
- Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
- Toan Q. Nguyen, Julian Salazar, 2019, Transformers without Tears: Improving the Normalization of Self-Attention, https://arxiv.org/abs/1910.05895
- Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura, Mar 2022, FaceFormer: Speech-Driven 3D Facial Animation with Transformers, https://arxiv.org/abs/2112.05329
- S Dai, H Genc, R Venkatesan, B Khailany, 2023 Efficient Transformer Inference with Statically Structured Sparse Attention, https://ieeexplore.ieee.org/abstract/document/10247993
- J Du, J Jiang, J Zheng, H Zhang, D Huang, Y Lu, August 2023, Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs, ACM Transactions on Architecture and Code Optimization, https://dl.acm.org/doi/10.1145/3617689 PDF: https://dl.acm.org/doi/pdf/10.1145/3617689
- William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley, 15 Nov 2023, Striped Attention: Faster Ring Attention for Causal Transformers, https://arxiv.org/abs/2311.09431
- David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- A Vyas, A Katharopoulos, 2020, Fast transformers with clustered attention, https://proceedings.neurips.cc/paper/2020/file/f6a8dd1c954c8506aadc764cc32b895e-Paper.pdf
- SC Kao, S Subramanian, G Agrawal, 2023, FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks https://dl.acm.org/doi/pdf/10.1145/3575693.3575747
- Deyi Xiong, Biao Zhang, and Jinsong Su. 2018. Accelerating neural Transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Long Papers, pages 1789–1798, Melbourne, Australia https://aclweb.org/anthology/P18-1166
- Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin, 2019, Adaptive attention span in transformers. CoRR, abs/1905.07799, 2019, http://arxiv.org/abs/1905.07799.
- Hanhwi Jang, Joonsung Kim, Jae-Eon Jo, Jaewon Lee, and Jangwoo Kim. 2019. Mnnfast: A fast and scalable system architecture for memory-augmented neural networks. In Proceedings of the 46th International Symposium on Computer Architecture. 250ś263. https://doi.org/10.1145/3307650.3322214
- Zheng Qu, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, Yuan Xie, 2022, DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration, ASPLOS ’22, February 28 ś March 4, 2022, Lausanne, Switzerland, PDF: https://dl.acm.org/doi/pdf/10.1145/3503222.3507738
- J Ding, S Ma, L Dong, X Zhang, S Huang 2023, Longnet: Scaling transformers to 1,000,000,000 tokens, https://arxiv.org/abs/2307.02486
- Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
- Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 29 May 2024 (v2), DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference, https://arxiv.org/abs/2404.00242 https://openreview.net/forum?id=HqfLHoX8bR
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Markus N. Rabe, Charles Staats, 10 Oct 2022 (v3), Self-attention Does Not Need O(n^2) Memory, https://arxiv.org/abs/2112.05682
- Shwai He, Guoheng Sun, Zheyu Shen, Ang Li, 22 Jun 2024, What Matters in Transformers? Not All Attention is Needed, https://arxiv.org/abs/2406.15786 https://github.com/Shwai-He/LLM-Drop
- Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
- Ruiqing Yan, Xingbo Du, Haoyu Deng, Linghan Zheng, Qiuzhuang Sun, Jifang Hu, Yuhang Shao, Penghao Jiang, Jinrong Jiang, Lian Zhao, 3 Jul 2024 (v2), Unveiling and Controlling Anomalous Attention Distribution in Transformers, https://arxiv.org/abs/2407.01601 (Examination of why the very first token in a sequence always gets more attention than others, including the effect of positional encoding, and its impact on KV cache compression.)
- Yuxin Liu; Wenxin Yu; Zhiqiang Zhang; Qi Wang; Lu Che, May 2024, Axial Attention Transformer for Fast High-quality Image Style Transfer, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), 19-22 May 2024, https://ieeexplore.ieee.org/abstract/document/10558531
- Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
- Karl Stratos, 2024, Efficient Attention, https://karlstratos.com/notes/attention.pdf
- Myung Beom Her, Jisu Jeong, Hojoon Song, Ji-Hyeong Han, 5 Jul 2024, Batch Transformer: Look for Attention in Batch, https://arxiv.org/abs/2407.04218
- An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
- Sai Sena Chinnakonduru, Astarag Mohapatra, 15 Jul 2024, Weighted Grouped Query Attention in Transformers, https://arxiv.org/abs/2407.10855
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
- Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
- Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
- Team PyTorch: Horace He, Driss Guessous, Yanbo Liang, Joy Dong, August 07, 2024, FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention, https://pytorch.org/blog/flexattention/
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 28 Jul 2024 (v2), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 Project: https://github.com/zcli-charlie/Awesome-KV-Cache
- Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy, 10 Aug 2024, Eigen Attention: Attention in Low-Rank Space for KV Cache Compression, https://arxiv.org/abs/2408.05646
- Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li, 11 Aug 2024, Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators, https://arxiv.org/abs/2408.05710 (Reduce the attention cost in diffusion models by what is effectively token merging between the Q and K data.)
- Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Nils Graef, Aarush Gupta, 2024 (accessed), Approximate attention: infinite context length with constant complexity per token [work in progress], https://github.com/OpenMachine-ai/transformer-tricks/blob/main/approximate.pdf
- Nils Graef, 18 Apr 2024, Transformer tricks: Removing weights for skipless transformers, https://arxiv.org/abs/2404.12362
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
- Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin, 8 May 2024, Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers, https://arxiv.org/abs/2405.05219 (Attention optimization using multiple low-rank matrices.)
- Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
- Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
- Rundong Zuo, Guozhong Li, Rui Cao, Byron Choi, Jianliang Xu, and Sourav S Bhowmick. 2024. DARKER: Efficient Transformer with Data-Driven Attention Mechanism for Time Series. Proc. VLDB Endow. 17, 11 (July 2024), 3229–3242. https://doi.org/10.14778/3681954.3681996 https://dl.acm.org/doi/abs/10.14778/3681954.3681996
- Du Cunxiao, 2024, Towards Faster Inference of Transformers: Strategies for Accelerating Decoding Processes, Ph.D. thesis, Computer Science, School of Computing and Information Systems, Singapore Management University, https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1611&context=etd_coll (Examines non-autoregressive decoding, speculative decoding and attention optimizations.)
- The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
- DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
- Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb, 6 Sep 2024, Theory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 https://github.com/apple/ml-sigmoid-attention
- Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, Zhiyu Li, 5 Sep 2024, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 https://github.com/IAAR-Shanghai/Awesome-Attention-Heads (This survey is about making attention mechanisms more performant, accurate and intelligent, rather than improving efficiency.)
- Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu, 11 Sep 2024, Gated Slot Attention for Efficient Linear-Time Sequence Modeling, https://arxiv.org/abs/2409.07146 https://github.com/sustcsonglin/flash-linear-attention https://huggingface.co/fla-hub
- Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
- Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 https://github.com/tonyzhang617/nomad-dist
- Junyang Zhang, Mu Yuan, Ruiguang Zhong, Puhan Luo, Huiyou Zhan, Ningkang Zhang, Chengchen Hu, Xiangyang Li, 23 Sep 2024, A-VL: Adaptive Attention for Large Vision-Language Models, https://arxiv.org/abs/2409.14846 (Separate handling of text and image attention modules.)
- Hossein Rajabzadeh, Aref Jafari, Aman Sharma, Benyamin Jami, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 22 Sep 2024, EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models, https://arxiv.org/abs/2409.14595
- Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
- Ehsan Kabir, Md. Arafat Kabir, Austin R.J. Downey, Jason D. Bakos, David Andrews, Miaoqing Huang, 21 Sep 2024, FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs, https://arxiv.org/abs/2409.14023
- Gabriel Mongaras, Trevor Dohm, Eric C. Larson, 27 Sep 2024, Cottention: Linear Transformers With Cosine Attention, https://arxiv.org/abs/2409.18747 (Nearest-neighbor to replace Softmax attention, for near-linear attention.)
- Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
- Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen, 3 Oct 2024, SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration, https://arxiv.org/abs/2410.02367 (Quantized attention outperforms Flash attention.)
- Nathan Leroux, Paul-Philipp Manea, Chirag Sudarshan, Jan Finkbeiner, Sebastian Siegel, John Paul Strachan, Emre Neftci, 28 Sep 2024, Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models, https://arxiv.org/abs/2409.19315 (Analog hardware implementation of self-attention with sliding window attention.)
- Yaniv Leviathan, Matan Kalman, Yossi Matias, 3 Oct 2024, Selective Attention Improves Transformer, https://arxiv.org/abs/2410.02703 (Allowing adjacent tokens to predict whether required for attention.)
- Josh Alman, Hantao Yu, 5 Oct 2024, Fundamental Limitations on Subquadratic Alternatives to Transformers, https://arxiv.org/abs/2410.04271
- Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin, 9 Oct 2024, Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions, https://arxiv.org/abs/2410.06577
- Ravindran Kannan, Chiranjib Bhattacharyya, Praneeth Kacham, David P. Woodruff, 7 Oct 2024, LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions, https://arxiv.org/abs/2410.05462
- Y Yang, TT Yang, Oct 2024, DAA: Dynamic attention allocation improves large-scale model reasoning, https://doi.org/10.21203/rs.3.rs-5025111/v1 https://assets-eu.researchsquare.com/files/rs-5025111/v1_covered_81d99ea9-82e8-494a-81e0-2ee73cbf1249.pdf?c=1728617590
- Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 14 Oct 2024, DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads, https://arxiv.org/abs/2410.10819 https://github.com/mit-han-lab/duo-attention
- Bettayeb, M., Halawani, Y., Khan, M.U. et al. Efficient memristor accelerator for transformer self-attention functionality. Sci Rep 14, 24173 (2024). https://doi.org/10.1038/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z.pdf
- Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou, 15 Oct 2024, Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix, https://arxiv.org/abs/2410.11261
- Anonymous authors, Oct 2024, Dense Attention: No-CCompromise Exact All NxN Interactions Algorithm with O(N) Space and Time Complexity, https://openreview.net/pdf?id=2bIQBDSfRk
- Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
- Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, Rameswar Panda, 23 Oct 2024, Stick-breaking Attention, https://arxiv.org/abs/2410.17980
- Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan, 23 Oct 2024, Value Residual Learning For Alleviating Attention Concentration In Transformers, https://arxiv.org/abs/2410.17897
- Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019, Augmenting self-attention with persistent memory. CoRR, abs/1907.01470, 2019. http://arxiv.org/abs/1907.01470
- Dr. Ashish Bamania, Oct 27, 2024, Amazing Things Happen When Attention Heads Are Supercharged Using Mixture-Of-Experts: A deep dive into how the Attention mechanism works and how it is being enhanced by the Mixture-of-Experts architecture, resulting in Mixture-of-Head Attention (MoH) that makes our existing LLMs more efficient than ever. https://levelup.gitconnected.com/amazing-things-happen-when-attention-heads-are-supercharged-using-mixture-of-experts-b55a6b9a0ac8
- Themistoklis Haris, 6 Nov 2024, kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers, https://arxiv.org/abs/2411.04013
- Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
- Ampazis, Nicholas, and Flora Sakketou. 2024. "Diversifying Multi-Head Attention in the Transformer Model" Machine Learning and Knowledge Extraction 6, no. 4: 2618-2638. https://doi.org/10.3390/make6040126 https://www.mdpi.com/2504-4990/6/4/126
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
- Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen, 17 Nov 2024, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration, https://arxiv.org/abs/2411.10958 https://github.com/thu-ml/SageAttention
- Zhuohan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang, 21 Nov 2024 (v2), LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts, https://arxiv.org/abs/2411.13009
- Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in Vision: A Survey. arXiv:2101.01169 https://arxiv.org/abs/2101.01169 (Survey of Vision transformers in 2021.)
- Conner Takehana, Aaryan Singhal, Nov 28, 2024, ThunderMittens For Your ThunderKittens, https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx (Porting TK to Apple Metal and MLX on the M2 chips.)
- Mohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu, 20 Nov 2024, MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices, https://arxiv.org/abs/2411.17720
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv, 28 Nov 2024, Puzzle: Distillation-Based NAS for Inference-Optimized LLMs,NVIDIA Research, https://arxiv.org/abs/2411.19146 (This is dynamic NAS on a vast scale in a search space of size 10^138, because the optimization is applied with low granularity to each block in attention and FFN subcomponents of each layer.)
- LM Po, Aug 3, 2024, The Race for Faster Transformers: Innovations in Self-Attention, https://medium.com/@lmpo/the-race-for-faster-transformers-innovations-in-self-attention-e602fb1b5f20
- Benjamin Marie, Aug 13, 2024, FlexAttention: A Flexible Pytorch API for Implementing Attention Optimizations. It’s going to be easier to optimize attention computation. https://medium.com/@bnjmn_marie/flexattention-a-flexible-pytorch-api-for-implementing-attention-optimizations-bc4fae65eb9d
- Ginés Carreto Picón, Illia Oleksiienko, Lukas Hedegaard, Arian Bakhtiarnia, Alexandros Iosifidis, 5 Dec 2024 (v2), Continual Low-Rank Scaled Dot-product Attention, https://arxiv.org/abs/2412.03214
- Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)
- Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Ruhle, Saravan Rajmohan, 11 Dec 2024, TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs, https://arxiv.org/abs/2412.08585
- A. M. Tayeb and T. -H. Kim, "UNestFormer: Enhancing Decoders and Skip Connections with Nested Transformers for Medical Image Segmentation," in IEEE Access, doi: 10.1109/ACCESS.2024.3516079. https://ieeexplore.ieee.org/document/10795135 https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10795135 (Hybrid CNN-Transformer architecture that uses "omni attention" by combining four types of attention: local, global, channel and spatial.)
- Haiyang Wang, Yue Fan, Muhammad Ferjad Naeem, Yongqin Xian, Jan Eric Lenssen, Liwei Wang, Federico Tombari, Bernt Schiele, 30 Oct 2024, TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters, https://haiyang-w.github.io/tokenformer.github.io/ (Unique novel token-based attention mechanism.)
- Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
- SK Jha, S Jha, R Ewetz, A Velasquez, 2024, Onthe Design of Novel XOR-based A!ention Mechanism for Enhanced E"iciency of Transformers, DAC’24, June 23–27, 2024, San Francisco, CA, USA, https://sumitkumarjha.com/assets/pdf/2024_DAC_Jha_XOR_Attention_Transformer.pdf
- Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Xinting Huang, Dong Yu, Zhicheng Dou, 23 Dec 2024, A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression, https://arxiv.org/abs/2412.17483
- Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
- Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
- Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
- Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum, 26 Dec 2024, Multi-matrix Factorization Attention, https://arxiv.org/abs/2412.19255
- MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu, 14 Jan 2025, MiniMax-01: Scaling Foundation Models with Lightning Attention, https://arxiv.org/abs/2501.08313 https://github.com/MiniMax-AI (Content window over 1 million tokens.)
- Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao, 11 Jan 2025, Tensor Product Attention Is All You Need, https://arxiv.org/abs/2501.06425 https://github.com/tensorgi/T6
- Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)
- Pouya Hamadanian, Sadjad Fouladi, 20 Jan 2025, Glinthawk: A Two-Tiered Architecture for High-Throughput LLM Inference, https://arxiv.org/abs/2501.11779 https://github.com/microsoft/glinthawk (Separate memory-bound attention computation from other parts of model such as compute-bound FFNs, but only in the decoding phase (not prefill), whereby attention and KV cache management can be performed on a greater number of lower-end GPUs or CPU.)
- Jia Gao, Guiran Liu, Binrong Zhu, Shicheng Zhou, Hongye Zheng, Xiaoxuan Liao, 23 Jan 2025, Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer, https://arxiv.org/abs/2501.13467
- Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
- Nathaniel Tomczak, Sanmukh Kuppannagari, 31 Jan 2025, Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques, https://arxiv.org/abs/2502.01659 (Approaching attention optimization as a graph-theoretical problem.)
- Terry Chen, Bing Xu and Kirthi Devleker, Feb 12, 2025, Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling, https://developer.nvidia.com/blog/automating-gpu-kernel-generation-with-deepseek-r1-and-inference-time-scaling/
- Yaohua Tang, Zhicheng Hu, Kun Cheng, Fan Mo, Qiheng Lv, Hua Wang, Zhi Chen, 24 Feb 2025 (v2), Round Attention: A Novel Round-Level Attention Mechanism to Accelerate LLM Inference, https://arxiv.org/abs/2502.15294
- Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
- Mozhgan Navardi, Romina Aalishah, Yuzhe Fu, Yueqian Lin, Hai Li, Yiran Chen, Tinoosh Mohsenin, 19 Feb 2025, GenAI at the Edge: Comprehensive Survey on Empowering Edge Devices, https://arxiv.org/abs/2502.15816
- Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou, 28 Feb 2025, FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference, https://arxiv.org/abs/2502.20766 (Prefill optimization that dynamically applies different attention patterns, including sparse attention, for KV computations, based on the input query.)
- Asif Razzaq, March 5, 2025, Qwen Releases QwQ-32B: A 32B Reasoning Model that Achieves Significantly Enhanced Performance in Downstream Task, https://www.marktechpost.com/2025/03/05/qwen-releases-qwq-32b-a-32b-reasoning-model-that-achieves-significantly-enhanced-performance-in-downstream-task/ (Features 32B parameters, 32K context length, 64 layers, RoPE, SwiGLU, RMSNorm, and attention enhancements.)
- Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré, Mar 15, 2025, ThunderKittens Now on Blackwells! https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwell https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/attn/b200/b200.cu https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/matmul/B200/matmul.cu https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/matmul/FP8_B200/matmul.cu
Attention Head Approximation
Much of the research into attention heads has been in regard to attention head pruning (removing redundant or under-utilized attention head components) or reducing the quadratic cost of attention in terms of sequence length (related to non-autoregressive algorithms). However, there are also some "simple" attention heads that have been considered to replace the original Transformer components.
- Haoneng Luo, Shiliang Zhang, Ming Lei, Lei Xie, Nov 2020, Simplified Self-Attention for Transformer-based End-to-End Speech Recognition, https://arxiv.org/abs/2005.10463
- Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1789–1798. Association for Computational Linguistics, https://arxiv.org/abs/1805.00631 (A simpler version of the attention heads.)
- Biao Zhang, Ivan Titov, Rico Sennrich, Aug 2019, Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention, https://arxiv.org/abs/1908.11365 (Merged simplified attention heads.)
- Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur, “A time-restricted self-attention layer for ASR,” in Proc. ICASSP. IEEE, 2018, pp. 5874–5878. https://ieeexplore.ieee.org/document/8462497
- Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. ConvBERT: Improving BERT with Span-based Dynamic Convolution, NeurIPS, 33, 2020. https://arxiv.org/abs/2008.02496 (Replaces quadratic attention heads with linear convolution.)
- T. J. Ham, S. J. Jung, S. Kim et al., “A3: Accelerating attention mechanisms in neural networks with approximation,” in Proc. of HPCA. IEEE, 2020, pp. 328–341. https://arxiv.org/abs/2002.10941 (Optimizes both the attention mechanism and exponentiation in softmax.)
- Forrest Iandola, Albert Shaw, Ravi Krishna, and Kurt Keutzer. 2020. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 124–135. Association for Computational Linguistics. https://arxiv.org/abs/2006.11316 (Replaces self-attention with grouped convolutions for a hybrid method.)
- Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite Transformer with Long-Short Range Attention. In International Conference on Learning Representations. https://arxiv.org/abs/2004.11886, Code: https://github.com/mit-han-lab/lite-transformer (Hybrid attention method that uses convolutions.)
- Ruokai Yin, Yuhang Li, Abhishek Moitra, Priyadarshini Panda, Dec 2022, Training Integer-Only Deep Recurrent Neural Networks https://arxiv.org/abs/2212.11791 (Integer-only version of RNNs called iRNN, with integer-only layer normalization, integer-only attention, and piecewise linear approximation for integer-only activation functions such as tanh and sigmoid.)
- Lucas D. Lingle, Sep 2023, Transformer-VQ: Linear-Time Transformers via Vector Quantization https://arxiv.org/abs/2309.16354 Code: https://github.com/transformer-vq/transformer_vq (Linear attention in an encoder-only Transformer.)
- Jiing-Ping Wang, Ming-Guang Lin, An-Yeu (Andy) Wu, 11 Apr 2024, LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer, https://arxiv.org/abs/2404.07519 (Approximate 4-bit vector dot product used as an estimate in attention heads, with computation reuse to 8-bit dot products.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- 3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
- Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
- Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli, 16 Jul 2024, PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer, https://arxiv.org/abs/2407.11306
- Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Nils Graef, Aarush Gupta, 2024 (accessed), Approximate attention: infinite context length with constant complexity per token [work in progress], https://github.com/OpenMachine-ai/transformer-tricks/blob/main/approximate.pdf
- Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
- Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou, 23 Aug 2024, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 (Training using low-rank matrices to approximate attention.)
- Josh Alman, Zhao Song, 9 May 2023 (v2), Fast Attention Requires Bounded Entries, https://arxiv.org/abs/2302.13214 (Low-rank matrices in attention for fast inference.)
- Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
- David Spuler, March 2024, Attention Head Approximation, in Generative AI in C++, https://www.aussieai.com/book/ch20-attention-head-approximation
- Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
- Themistoklis Haris, 6 Nov 2024, kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers, https://arxiv.org/abs/2411.04013
- Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu, 4 Jan 2025, AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference, https://arxiv.org/abs/2501.02336 (Optimally skipping sublayer components in FFN and attention during prefill and decoding phases.)
- Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, Rameswar Panda, 23 Oct 2024, Stick-breaking Attention, https://arxiv.org/abs/2410.17980
- Shantanu Acharya, Fei Jia, Boris Ginsburg, 26 Nov 2024, Star Attention: Efficient LLM Inference over Long Sequences, https://arxiv.org/abs/2411.17116
- A. M. Tayeb and T. -H. Kim, "UNestFormer: Enhancing Decoders and Skip Connections with Nested Transformers for Medical Image Segmentation," in IEEE Access, doi: 10.1109/ACCESS.2024.3516079. https://ieeexplore.ieee.org/document/10795135 https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10795135 (Hybrid CNN-Transformer architecture that uses "omni attention" by combining four types of attention: local, global, channel and spatial.)
- Pierre-Emmanuel Mazaré, Gergely Szilvasy, Maria Lomeli, Francisco Massa, Naila Murray, Hervé Jégou, Matthijs Douze, 12 Feb 2025, Inference-time sparse attention with asymmetric indexing, https://arxiv.org/abs/2502.08246
- Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Alternatives to Attention
Although attention has performed very well, there are still various attempts to replace it with something better.
Research papers on what comes next after attention include:
- Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. In Advances in neural information processing systems (NeurIPS), 2020. https://arxiv.org/abs/2008.07669
- Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in Neural Information Processing Systems, 34, 2021. https://arxiv.org/abs/2110.13985
- Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V Le. Transformer quality in linear time. arXiv preprint arXiv:2202.10447, 2022, https://arxiv.org/abs/2202.10447 (Proposes a "gated attention unit" with an overlaying linear approximation.)
- Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, and Josh Susskind. An attention free transformer. arXiv preprint arXiv:2105.14103, 2021. https://arxiv.org/abs/2105.14103 (A Transformer architecture without dot product self-attention using different processing of keys and values.)
- Irwan Bello. LambdaNetworks: Modeling long-range interactions without attention. arXiv preprint arXiv:2102.08602, 2021. https://arxiv.org/abs/2102.08602 (An alternative to self-attention using linear transforms.)
- Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang, Sep 2023, LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, https://arxiv.org/abs/2308.16137
- N Sevim, EO Özyedek, F Şahinuç, A Koç, 2022, Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier Layers arXiv preprint arXiv:2209.12816, https://arxiv.org/abs/2209.12816 (Replaces attention with a Fourier transform method.)
- Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu, May 2023, RWKV: Reinventing RNNs for the Transformer Era, https://arxiv.org/pdf/2305.13048.pdf, Code: https://github.com/BlinkDL/RWKV-LM (Hybrid RNN-Transformer that replaces QKV attention with Receptance Weighted Key Value (RWKV), which is similar to Flash Attention).
- Keyu An, Shiliang Zhang, Sep 2023, Exploring RWKV for Memory Efficient and Low Latency Streaming ASR, arXiv preprint arXiv:2309.14758, https://arxiv.org/pdf/2309.14758.pdf (Analysis of the RWKV Transformer-RNN hybrid architecture.)
- A Langedijk, H Mohebbi, G Sarti, W Zuidema, J Jumelet, Oct 2023, DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers, https://arxiv.org/abs/2310.03686, https://pure.rug.nl/ws/portalfiles/portal/799424386/2310.03686v1.pdf (Allows the decoder to cross-attend to earlier layers of the encoder, rather than only the final output layer.)
- J Alman, Z Song, Oct 2023, How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation, arXiv preprint arXiv:2310.04064, https://arxiv.org/abs/2310.04064 (Uses more advanced QKV attention mechanism with even more computations than vanilla Transformer.)
- Bohdan Ivaniuk-Skulskyi; Nadiya Shvai; Arcadi Llanza; Amir Nakib, 2024, Towards Lightweight Transformer Architecture: an Analysis on Semantic Segmentation, 2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA), https://ieeexplore.ieee.org/abstract/document/10467926 (Examines skip-attention and pool-unpool attention methods for faster inference.)
- Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli, 16 Jul 2024, PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer, https://arxiv.org/abs/2407.11306
- David Spuler, March 2024, Alternatives to Attention, in Generative AI in C++, https://www.aussieai.com/book/ch20-alternatives-attention
- Benjamin L. Badger, 2 Sep 2024, Masked Mixers for Language Generation and Retrieval, https://arxiv.org/abs/2409.01482
- Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 https://github.com/tonyzhang617/nomad-dist
- Gabriel Mongaras, Trevor Dohm, Eric C. Larson, 27 Sep 2024, Cottention: Linear Transformers With Cosine Attention, https://arxiv.org/abs/2409.18747 (Nearest-neighbor to replace Softmax attention, for near-linear attention.)
- A. M. Tayeb and T. -H. Kim, "UNestFormer: Enhancing Decoders and Skip Connections with Nested Transformers for Medical Image Segmentation," in IEEE Access, doi: 10.1109/ACCESS.2024.3516079. https://ieeexplore.ieee.org/document/10795135 https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10795135 (Hybrid CNN-Transformer architecture that uses "omni attention" by combining four types of attention: local, global, channel and spatial.)
- Ali Behrouz, Peilin Zhong, Vahab Mirrokni, 31 Dec 2024, Titans: Learning to Memorize at Test Time, https://arxiv.org/abs/2501.00663 (Using "neural memory" rather than attention.)
- Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Sliding Window Attention
Sliding Window Attention (SWA) is a type of local attention whereby each token only pays attention to its near neighbor tokens. This can be very efficient, which results in a type of linear attention, but can also result in accuracy loss.
Research papers on sliding window attention:
- Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
- Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
- Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, Dec 2023, Efficient Streaming Language Models with Attention Sinks https://arxiv.org/abs/2309.17453 Code: https://github.com/mit-han-lab/streaming-llm
- Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
- Omri Mallis, February 25, 2024 , Techniques for KV Cache Optimization in Large Language Models, https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/
- Mistral AI, https://github.com/mistralai/mistral-src
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang, 24 Jun 2024, Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache, https://arxiv.org/abs/2406.17808 (Extends the KV cache eviction policy in sliding window attention so that the KV partially looks back further than the window.)
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, 4 Jan 2024 (v2), LLM in a flash: Efficient Large Language Model Inference with Limited Memory, https://arxiv.org/abs/2312.11514 (Storing model parameters in flash memory on phones.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
- Nathan Leroux, Paul-Philipp Manea, Chirag Sudarshan, Jan Finkbeiner, Sebastian Siegel, John Paul Strachan, Emre Neftci, 28 Sep 2024, Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models, https://arxiv.org/abs/2409.19315 (Analog hardware implementation of self-attention with sliding window attention.)
- Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin, 9 Oct 2024, Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions, https://arxiv.org/abs/2410.06577
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
- The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
- Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
- LM Po, Aug 3, 2024, The Race for Faster Transformers: Innovations in Self-Attention, https://medium.com/@lmpo/the-race-for-faster-transformers-innovations-in-self-attention-e602fb1b5f20
- NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
- Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun, 18 Dec 2024, LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer, https://arxiv.org/abs/2412.13871
- Zhendong Zhang, 14 Jan 2025 (v2), Flash Window Attention: speedup the attention computation for Swin Transformer, https://arxiv.org/abs/2501.06480
- Zeyu Tang, Zhenhao Chen, Loka Li, Xiangchen Song, Yunlong Deng, Yifan Shen, Guangyi Chen, Peter Spirtes, Kun Zhang, 5 Feb 2025, Reflection-Window Decoding: Text Generation with Selective Refinement, https://arxiv.org/abs/2502.03678 (Combination of sliding window attention with pausing.)
- Difan Deng, Marius Lindauer, 20 Feb 2025 (v2), Neural Attention Search, https://arxiv.org/abs/2502.13251 (Deciding whether a token deserves global attention, local attention, or sliding window attention, reducing KV caches.)
- Jiangning Wei, Lixiong Qin, Bo Yu, Tianjian Zou, Chuhan Yan, Dandan Xiao, Yang Yu, Lan Yang, Ke Li, Jun Liu, 14 Mar 2025, VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention, https://arxiv.org/abs/2503.11004
Tree Attention
Tree attention is the use of tree-structured data in LLM attention algorithms. This is not widely used on its own as an attention optimization, but can be used in situations where tree-structured data is already available, including beam search and various types of speculative decoding.
Research papers on tree attention:
- Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
- Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui, 1 May 2024, Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge, https://arxiv.org/abs/2405.00263 (Speculative decoding improvement by extending Medusa to tree attention with a cross attention block.)
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang, 14 Jun 2024, HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning, https://arxiv.org/abs/2406.09827 (Sparse attention using the top-k features and a tree-based structure.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Vasudev Shyam, Jonathan Pilault, Emily Shepperd, Quentin Anthony, Beren Millidge, 9 Aug 2024 (v2), Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters, https://arxiv.org/abs/2408.04093 Code: https://github.com/Zyphra/tree_attention
- Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
Local Attention
Local attention is an LLM inference optimization whereby each token's attention computations are only based on nearby tokens. This gives a faster attention algorithm with linear complexity, but can also lose accuracy. A generalization is to use a hybrid local-global attention algorithm, with interleaved laters of local attention and global attention.
Research papers on local attention:
- Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas, 11 Apr 2024, RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, Google Research, https://arxiv.org/abs/2404.07839
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs/2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Uses local attention versus global attention at different layers.)
- Mirko Farina, Usman Ahmad, Ahmad Taha, Hussein Younes, Yusuf Mesbah, Xiao Yu, Witold Pedrycz, 2024, Sparsity in transformers: A systematic literature review, Neurocomputing, Volume 582, 14 May 2024, 127468, https://www.sciencedirect.com/science/article/abs/pii/S092523122400239X (General survey of sparsity methods, and techniques that create sparsity.)
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
- Difan Deng, Marius Lindauer, 20 Feb 2025 (v2), Neural Attention Search, https://arxiv.org/abs/2502.13251 (Deciding whether a token deserves global attention, local attention, or sliding window attention, reducing KV caches.)
Local-Global Attention
Local-global attention is the LLM inference optimization of using a hybrid of both local attention and global attention in alternating laters. Typically, more local layers are used for efficiency, such as three-to-one ratio or more, whereas adding global layers increases accuracy at some performance cost.
A partial trade-off of faster local attention with more accurate global attention is to combine them. There are various complex ways, or you can simply alternate with some layers using local attention and some layers using global attention, as done by the Character.AI inference backend.
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
- Chull Hwan Song, Hye Joo Han, Yannis Avrithis, 16 Jul 2021, All the attention you need: Global-local, spatial-channel attention for image retrieval, https://arxiv.org/abs/2107.08000
- Libo Wang, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, Peter M. Atkinson, 26 Jun 2022 (v4), UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery, https://arxiv.org/abs/2109.08937
- Yiping Sun, 1 Jul 2024, A Global-Local Attention Mechanism for Relation Classification, 2024 20th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), https://arxiv.org/abs/2407.01424
- Z. Zheng, G. An, D. Wu and Q. Ruan, Global and Local Knowledge-Aware Attention Network for Action Recognition, IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 1, pp. 334-347, Jan. 2021, doi: 10.1109/TNNLS.2020.2978613, https://ieeexplore.ieee.org/document/9050644 Code:https://github.com/ZhenxingZheng/attention-network
- Abhishek Srivastava, Sukalpa Chanda, Debesh Jha, Michael A. Riegler, Pål Halvorsen, Dag Johansen, Umapada Pal, 20 Nov 2021, PAANet: Progressive Alternating Attention for Automatic Medical Image Segmentation, https://arxiv.org/abs/2111.10618
- Jinpeng Li, Yichao Yan, Shengcai Liao, Xiaokang Yang, Ling Shao, 10 Jul 2021, Local-to-Global Self-Attention in Vision Transformers, https://arxiv.org/abs/2107.04735
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
- Fangyuan Xu, Tanya Goyal, Eunsol Choi, 8 Nov 2024, Recycled Attention: Efficient inference for long-context language models, https://arxiv.org/abs/2411.05787
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
- Y Yang, TT Yang, Oct 2024, DAA: Dynamic attention allocation improves large-scale model reasoning, https://doi.org/10.21203/rs.3.rs-5025111/v1 https://assets-eu.researchsquare.com/files/rs-5025111/v1_covered_81d99ea9-82e8-494a-81e0-2ee73cbf1249.pdf?c=1728617590
- A. M. Tayeb and T. -H. Kim, "UNestFormer: Enhancing Decoders and Skip Connections with Nested Transformers for Medical Image Segmentation," in IEEE Access, doi: 10.1109/ACCESS.2024.3516079. https://ieeexplore.ieee.org/document/10795135 https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10795135 (Hybrid CNN-Transformer architecture that uses "omni attention" by combining four types of attention: local, global, channel and spatial.)
- Hongjin Qian, Zheng Liu, Peitian Zhang, Zhicheng Dou, Defu Lian, 18 Dec 2024 (v2), Boosting Long-Context Management via Query-Guided Activation Refilling, https://arxiv.org/abs/2412.12486 (Maintaining two KV caches, one global, one local.)
- Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
Additive Attention
Additive attention is the use of addition operations for faster LLM attention computations. Various types of "zero-multiplication" models can use addition instead of multiplication, such as low-bit quantization or various other multiplication-free models.
Research papers on additive attention:
- Chuhan Wu, Fangzhao Wu, Tao Qi, Yongfeng Huang, Xing Xie, 5 Sep 2021 (v6), Fastformer: Additive Attention Can Be All You Need, https://arxiv.org/abs/2108.09084
- Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan; 2023, SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications, Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17425-17436, https://openaccess.thecvf.com/content/ICCV2023/html/Shaker_SwiftFormer_Efficient_Additive_Attention_for_Transformer-based_Real-time_Mobile_Vision_Applications_ICCV_2023_paper.html
- K. W. Cheuk, Y. -J. Luo, E. Benetos and D. Herremans, "Revisiting the Onsets and Frames Model with Additive Attention," 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 2021, pp. 1-8, doi: 10.1109/IJCNN52387.2021.9533407, https://ieeexplore.ieee.org/abstract/document/9533407
- Emilio Morales-Juarez, Gibran Fuentes-Pineda, 21 Aug 2024 (v2), Efficient generative adversarial networks using linear additive-attention Transformers, https://arxiv.org/abs/2401.09596
- Rickard Brännvall, 3 Oct 2023, The Inhibitor: ReLU and Addition-Based Attention for Efficient Transformers, https://arxiv.org/abs/2310.02041
- Tomek Korbak, 2020, Implementing additive and multiplicative attention in PyTorch, https://tomekkorbak.com/2020/06/26/implementing-attention-in-pytorch/
- Y. Wen, S. Chen and A. K. Shrestha, "Fast Vision Transformer via Additive Attention," 2024 IEEE Conference on Artificial Intelligence (CAI), Singapore, Singapore, 2024, pp. 573-574, doi: 10.1109/CAI59869.2024.00113, https://ieeexplore.ieee.org/abstract/document/10605345
- Gokul Srinivasagan, Simon Ostermann, 2024, HybridBERT - Making BERT Pretraining More Efficient Through Hybrid Mixture of Attention Mechanisms, https://aclanthology.org/2024.naacl-srw.30/ https://aclanthology.org/2024.naacl-srw.30.pdf https://github.com/gokulsg/HBERT/
Multiplicative Attention
Multiplicative attention is the idea of using multiplication operations in LLM attention. Standard attention uses matrix multiplications, which are cubic in arithmetic multiplication operations. An approximation is to use Hadamard matrix multiplications, which only perform arithmetic multiplication in an elementwise manner across matrices.
Research papers on multiplicative attention:
- Minh-Thang Luong, Hieu Pham, Christopher D. Manning, 20 Sep 2015 (v5), Effective Approaches to Attention-based Neural Machine Translation, https://arxiv.org/abs/1508.04025 https://aclanthology.org/D15-1166/ https://aclanthology.org/D15-1166.pdf
- Tomek Korbak, 2020, Implementing additive and multiplicative attention in PyTorch, https://tomekkorbak.com/2020/06/26/implementing-attention-in-pytorch/
More AI Research
Read more about:
- Attention head pruning
- Transformer architectures
- Transformer optimization
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home