Aussie AI

Attention Optimization

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

What is Attention?

Attention is the method whereby an LLM attributes weights to other tokens or words in the input sequence, so as to determine an output token. The attention mechanism is one of the major breakthroughs that allowed advanced AI to take shape. After all, the seminal 2017 Transformer paper was titled 'Attention is all you need.' (Vaswani et al., 2017). It used a type of self-attention called "scaled dot-product attention." However, the self-attention mechanism is also computationally expensive, and it's perhaps a case of "too much of a good thing". The quadratic complexity of self-attention in a vanilla Transformer is well-known, and there has been much research on how to optimize attention to a linear-complexity algorithm. The main attempts have been:

  • Efficient attention algorithms: the use of faster, modified memory-efficient attention algorithms.
  • Removing some attention heads, called attention head pruning.
  • Approximate attention (e.g., local attention).
  • Alternative architectures, instead of attention (in next-gen architectures).
  • Code optimizations such as KV caching

Types of Attention Algorithms

Some of the main classes of attention algorithms include:

  • Dot product attention
  • Scaled dot product attention
  • Self-attention
  • Additive attention
  • Multiplicative attention

Some of the more recent advancements:

  • Star attention
  • Flex attention

Types of Efficient Attention Algorithms

For the above types of attention, various research has occurred on how to do this faster. There are two basic strategies:

  • Memory-efficient attention — the same number of multiplications, but faster memory access patterns.
  • Approximate attention methods — avoiding some of the multiplications, which makes it an approximation.

Some of the memory-efficient attention algorithms focused on improved performance via access patterns include:

The default method of computing attention for every token, called "global attention" or "full attention," is quadratic in complexity in terms of the input token length. Some of the ways to speed things up, aiming for linear compexlity, by avoiding some of the computations include:

  • Sparse attentionsparse matrices for attention.
  • Local attention — only computing attention weightings for nearby tokens.
  • Linear attention — several methods of reducing attention computations to linear complexity (rather than quadratic).
  • Tree attention — hierarchical tree-based organization of attention.
  • Low-rank matrices — factorizing to smaller matrices that are then used for attention.
  • Approximate attention — various other types of approximate arithmetic.
  • Hybrid local-global attention — e.g., alternating layers of local/linear and global attention.

KV Caching Optimizations

A lot of the computation in the attention heads can be avoided using caching of the KV data, especially in the prefill phase of computation. The types of KV caching in LLMs include:

Token Pruning Research

One way to pay attention to less tokens is to get rid of some tokens, which is called "token pruning." For text inputs, this is like deleting some unimportant words, and for image or vision transformers, this is akin to ignoring unimportant parts of the input image. Research areas include:

Milestone Research Papers on Attention Algorithms

Various major papers on generative AI attention include:

  • Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, Attention is all you need, 2017, arXive preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762 (This is where the trouble all started...)
  • Ankur P. Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit, Sep 2016, A Decomposable Attention Model for Natural Language Inference, https://arxiv.org/abs/1606.01933 ("Decomposable attention" was the precursor to self-attention.)
  • Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai, 23 Dec 2023 (v3), GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, https://arxiv.org/abs/2305.13245 (Group Query Attention)
  • Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 (Multi-Query Attention)
  • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 (The original FlashAttention version 1, now superceded by FlashAttention 2.)
  • Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
  • Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, Oct 2023, Efficient Memory Management for Large Language Model Serving with PagedAttention, SOSP ’23, October 23–26, 2023, Koblenz, Germany, https://dl.acm.org/doi/pdf/10.1145/3600006.3613165 (Paged Attention paper.)
  • Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Tri Dao, September 11, 2023, Medusa: Simple framework for accelerating LLM generation with multiple decoding heads, https://www.together.ai/blog/medusa (Medusa attention paper.)
  • Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention

Memory-Efficient Attention Algorithms

Memory-efficient attention algorithms are an inference optimization method that improves the QKV matrix computations by using better memory access patterns. Memory bottlenecks can be reduced by avoiding too many accesses to the same data and reducing the amount of temporary data. One class of optimizations to attention algorithms is to improve their memory efficiency by changing access orders and computation. The alternative is the class of "linear" or "local" attention algorithms that simply access fewer weights during the computations.

Types of memory-optimized attention include:

  • Multi-Query Attention
  • Group Query Attention
  • Flash attention
  • Paged attention

Multi-Head Attention (MHA)

Multi-head attention is the first parallelization of attention in the original 2017 Transformer paper. It's really the concept that created "attention heads" rather than one big attention module.

Research papers on MHA:

Multi-Query Attention (MQA)

Multi-Query Attention is an optimization to the attention algorithm that focues on reducing the cost of the KV cache data. Both the computation cost and the memory storage cost are improved by better handling of KV cache data.

Research papers on MQA:

Group-Query Attention (GQA)

Group-Query Attention (GQA) further extends MQA, and further optimizes the handling of the KV cache data in the attention heads of a Transformer architecture.

Research papers on GQA:

Linear Attention

Linear attention is a group of LLM inference optimizations that reduce the number of attention matrix computations. These are approximate methods that perform fewer arithmetic multiplication operations. Common types include local attention and sliding window attention.

Linear attention is not necessarily a particular type of attention optimization, although the term is sometimes used to refer to local attention. Rather, linear attention is the general class of optimizations that reduce attention from quadratic complexity in terms of input prompt length down to linear complexity.

Some examples of linear attention algorithms:

  • Local attention
  • Sliding window attemtion

Research papers on linear attention methods:

Flash Attention

Flash attention is a type of memory-efficient attention algorithm that improves the speed of LLM inference. Flash attention is a very successful optimization and has been implemented in several inference frameworks. It optimizes attention by paying attention, haha, to the memory access cost of the various QKV operations, and thus reducing the overall memory cost, which in turn reduces the time cost of computing attention.

Research papers on Flash Attention:

Paged Attention

Paged attention is a memory-efficient attention algorithm in LLM inference. It reduces the memory cost and thereby the computation cost of computing the attention metrics in LLM inference.

Paged attention is another successful optimization of attention algorithms, also focusing on memory access costs. By mimicking the concept of "paging" of operating system memory within the attention algorithm, the overall time cost of attention is significantly reduced.

Medusa Attention

Medusa attention is an optimization to attention computations in LLM inference that merges multiple attention heads. This results in fewer computations and faster QKV matrix operations that speed up inference.

Research papers on Medusa attention:

Attention-Free LLM Inference

Attention-free optimizations are LLM modifications that avoid using the classic attention module. There are various alternatives to the attention computation that have been tried, but none are widely used.

Since computing attention is so expensive, maybe we can just avoid it? That's the idea in some "attention-free" methods:

  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418

Graph Attention

Graph attention is the use of graph data structures in the computation of LLM inference. This is a generalization of the classic self-attention in LLMs that relies on sequences of tokens without any structure.

The idea of integrating graph representations of knowledge into attention has received some research attention:

  • Lena Sasal, Daniel Busby, Abdenour Hadid, 29 Aug 2024, TempoKGAT: A Novel Graph Attention Network Approach for Temporal Graph Analysis, https://arxiv.org/abs/2408.16391
  • Akshay Kolli, Reza Azadeh, Kshitj Jerath, 27 Aug 2024, Graph Attention Inference of Network Topology in Multi-Agent Systems, https://arxiv.org/abs/2408.15449

Bilinear Attention

Bilinear attention is an optimization of LLM attention algorithms by examining tokens in both directions. This can lead to increased accuracy in LLM inference algorithms, but it is not widely used.

A novel bilinear approach for attention:

  • Shaojian Qiu, Huihao Huang, Jianxiang Luo, Yingjie Kuang, Haoyu Luo, 11 Feb 2024, BAFLineDP: Code Bilinear Attention Fusion Framework for Line-Level Defect Prediction, https://arxiv.org/pdf/2402.07132
  • Philipp Froehlich, Heinz Koeppl, 13 Feb 2024 (v2), Graph Structure Inference with BAM: Introducing the Bilinear Attention Mechanism, https://arxiv.org/abs/2402.07735

Attention Steering

Attention steering is a method to "steer" or focus the LLM attention algorithm onto a particular subset of the tokens. This aims for more accurate and faster attention computations. Research papers on attention steering:

Attention Sink

The "attention sink" is the surprising fact that LLMs tend to place a lot of attention on the very first input token (e.g., the first word in a sentence). This effect is not programmed into LLMs, but arises by accident in the way that LLMs are trained. Sometimes this is correct, but it is also often inappropriate to place so much weight on the first token. For example, the first word of the user's query can be important, but the first token in an LLM input is often part of a global instruction prompt or a RAG chunk, so this effect is undesirable in such cases. Various papers examine this weird effect, and try to mitigate it.

Chunked Attention

Chunked attention is the idea of optimizing LLM inference by splitting up attention computations into fixed-size blocks. The attention computation can perform computations at a block level, and can avoid some computations involving entire blocks. This is similar to "chunked prefill" and "batched inference" ideas, in terms of simplifying the scheduling of the computations required.

Mixture-of-Heads (MOH) Attention (or MoA)

MOH is a combination of Mixture-of-Experts (MoE) architecture with the standard Multi-Head Attention (MHA). i.e., MHA + MoE = MOH! It's also been called Mixture-of-Attention (MoA) in some papers.

Block Attention

Block attention is a granular optimization of LLM attention algorithms to the block level. This allows more efficient QKV tensor calculations, and can also involve pruning entire blocks of computations, especially in "block-sparse" attention optimizations.

Block attention is somewhat similar to chunked attention, in that it focuses on blocks within the attention computation. It is also related to block-level quantization and other similar block optimizations.

Research on block attention methods:

Low-Rank Matrix Attention

Low-rank matrix attention is the use of low-rank matrix factorization, or tensor decomposition, for a faster computation of LLM attention. This method is faster since there are fewer computations, but it is also therefore an approximation with potential accuracy loss. One way to approximate attention to get a type of linear attention is to use low-rank matrix factorization (decomposition).

Research papers on the use of low-rank matrix factorization and tensor decomposition for attention:

  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin, 8 May 2024, Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers, https://arxiv.org/abs/2405.05219 (Attention optimization using multiple low-rank matrices.)
  • Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou, 23 Aug 2024, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 (Training using low-rank matrices to approximate attention.)
  • Josh Alman, Zhao Song, 9 May 2023 (v2), Fast Attention Requires Bounded Entries, https://arxiv.org/abs/2302.13214 (Low-rank matrices in attention for fast inference.)
  • Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
  • Yue Niu, Saurav Prakash, Salman Avestimehr, 1 Mar 2024, ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys, https://arxiv.org/abs/2403.02352
  • S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” CoRR, vol. abs/2006.04768, 2020. https://arxiv.org/abs/2006.04768 (Low-rank approximation of attention.)
  • Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020. https://arxiv.org/abs/2006.16236 (Linear O(N) attention algorithm based on low-rank factorization.)
  • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://arxiv.org/abs/2110.15343 (Attention optimization using both sparse attention and low-rank matrix attention.)
  • Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (Low-rank matrix attention allows up to 100k context windows.)
  • Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
  • Sneha Mehta, Huzefa Rangwala, Naren Ramakrishnan, 10 Aug 2020 (v2), Low Rank Factorization for Compact Multi-Head Self-Attention, https://arxiv.org/abs/1912.00835
  • Bosheng Qin, Juncheng Li, Siliang Tang, Yueting Zhuang, 24 Nov 2022, DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention, https://arxiv.org/abs/2211.16368

General Research Papers on Attention Optimization

Papers that specifically aim to optimize attention include:

  • Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020. https://arxiv.org/abs/2009.06732 (Broad survey of Transformer efficiency that contains details about attention and fixing its quadratic cost.)
  • Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention
  • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 (The original FlashAttention version 1, now superceded by FlashAttention 2.)
  • Jianpeng Cheng, Li Dong, Mirella Lapata, Sep 2016, Long Short-Term Memory-Networks for Machine Reading, https://arxiv.org/abs/1601.06733
  • Minh-Thang Luong, Hieu Pham, Christopher D. Manning, Sep 2016, Effective Approaches to Attention-based Neural Machine Translation, https://arxiv.org/abs/1508.04025
  • Alex Graves, Greg Wayne, Ivo Danihelka, Dec 2014, Neural Turing Machines, https://arxiv.org/abs/1410.5401
  • Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. In The International Conference on Machine Learning (ICML), 2020, https://arxiv.org/abs/2001.04451 (Sparse attention with locality-sensitive hashing.)
  • Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021, https://arxiv.org/abs/2003.05997 (Sparse attention.)
  • Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, pages 5156–5165. PMLR, 2020. https://arxiv.org/abs/2006.16236 (Linear O(N) attention algorithm based on low-rank factorization.)
  • Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020. https://arxiv.org/abs/2006.04768 (Low-rank matrix O(N) linear complexity method.)
  • Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020. https://arxiv.org/abs/2007.14062 (Sparse linear attention algorithm.)
  • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://arxiv.org/abs/2110.15343 (Attention optimization using both sparse attention and low-rank matrix attention.)
  • Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller, Rethinking Attention with Performers, In International Conference on Learning Representations (ICLR), 2020. https://arxiv.org/abs/2009.14794 (Novel method for linear attention complexity.)
  • Iz Beltagy, Matthew E Peters, Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. https://arxiv.org/abs/2004.05150 (Linear attention over long context sequences.)
  • Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long range arena: A benchmark for efficient transformers. In International Conference on Learning Representations, 2020. https://arxiv.org/abs/2011.04006, Code: https://github.com/google-research/long-range-arena (Evaluate models over long context sequences.)
  • Giannis Daras, Nikita Kitaev, Augustus Odena, and Alexandros G Dimakis. SMYRF: Efficient Attention using Asymmetric Clustering. Advances in Neural Information Processing Systems, 33:6476–6489, 2020. https://arxiv.org/abs/2010.05315 (Approximation of attention using locality-sensitive hashing.)
  • Yang Li, Liangzhen Lai, Yuan Shangguan, Forrest N. Iandola, Ernie Chang, Yangyang Shi, Vikas Chandra, Sep 2023, Folding Attention: Memory and Power Optimization for On-Device Transformer-based Streaming Speech Recognition, https://arxiv.org/abs/2309.07988
  • X Wang, P Guo, Y Zhang, 2023, Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer, ECML PKDD 2023: Machine Learning and Knowledge Discovery in Databases: Research Track pp 309–325, https://arxiv.org/abs/2201.05887
  • Shiyi Zhu, Jing Ye, Wei Jiang, Qi Zhang, Yifan Wu, Jianguo Li, Sep 2023, Cure the headache of Transformers via Collinear Constrained Attention, arXiv preprint arXiv:2309.08646, https://arxiv.org/abs/2309.08646
  • Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (Low-rank matrix attention allows up to 100k context windows.)
  • Y Chen, Y Li, A Xu, Q Sun, X Chen, C Xu, 2023, WAG-NAT: Window Attention and Generator Based Non-Autoregressive Transformer for Time Series Forecasting, ICANN 2023: Artificial Neural Networks and Machine Learning, pp. 293–304, https://link.springer.com/chapter/10.1007/978-3-031-44223-0_24, Code: https://github.com/cybisolated/WAG-NAT
  • Together AI, Jul 28, 2023, Preparing for the era of 32K context: Early learnings and explorations, https://together.ai/blog/llama-2-7b-32k (Uses position interpolation and Flash Attention.)
  • Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with sparse attention, linearized attention, and other attention optimization methods.)
  • Hao Liu, Matei Zaharia, Pieter Abbeel, Oct 2023, Ring Attention with Blockwise Transformers for Near-Infinite Context, https://arxiv.org/abs/2310.01889
  • H Lee, J Kim, J Willette, SJ Hwang, Oct 2023, SEA: Sparse Linear Attention with Estimated Attention Mask, arXiv preprint arXiv:2310.01777, https://browse.arxiv.org/pdf/2310.01777.pdf
  • Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, William El Sayed, Oct 2023, Mistral 7B, https://arxiv.org/abs/2310.06825, Code: https://mistral.ai/news/announcing-mistral-7b/ (Uses grouped-query attention and sliding window attention.)
  • G Xiao, Y Tian, B Chen, S Han, M Lewis, Sep 2023, Efficient Streaming Language Models with Attention Sinks, arXiv preprint arXiv:2309.17453, https://arxiv.org/abs/2309.17453
  • Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture used Multi-Query Attention.)
  • Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu, 19 Mar 2024, DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring, https://arxiv.org/abs/2403.13163 (Optimizes a deblurring Transformer with self-attention improvements and other optimizations.)
  • Bohdan Ivaniuk-Skulskyi; Nadiya Shvai; Arcadi Llanza; Amir Nakib, 2024, Towards Lightweight Transformer Architecture: an Analysis on Semantic Segmentation, 2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA), https://ieeexplore.ieee.org/abstract/document/10467926 (Examines skip-attention and pool-unpool attention methods for faster inference.)
  • Jiing-Ping Wang, Ming-Guang Lin, An-Yeu (Andy) Wu, 11 Apr 2024, LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer, https://arxiv.org/abs/2404.07519 (Approximate 4-bit vector dot product used as an estimate in attention heads, with computation reuse to 8-bit dot products.)
  • Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas, 11 Apr 2024, RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, Google Research, https://arxiv.org/abs/2404.07839 (Local attention.)
  • Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075 (Sparse attention)
  • Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, 10 Apr 2024, Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention, https://arxiv.org/abs/2404.07143
  • Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
  • Seokju Yun, Dongheon Lee, Youngmin Ro, 4 Jun 2024, MetaMixer Is All You Need, https://arxiv.org/abs/2406.02021
  • Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
  • Zhen Qin, Yuxin Mao, Xuyang Shen, Dong Li, Jing Zhang, Yuchao Dai, Yiran Zhong, 31 May 2024, You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet, https://arxiv.org/abs/2405.21022
  • Rya Sanovar, Srikant Bharadwaj, Renee St. Amant, Victor Rühle, Saravan Rajmohan, 17 May 2024, Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers, https://arxiv.org/abs/2405.10480
  • Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
  • Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
  • Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar, 10 May 2024, Linearizing Large Language Models, https://arxiv.org/abs/2405.06640 Code: https://github.com/TRI-ML/linear_open_lm
  • Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar, 7 May 2024, vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention, https://arxiv.org/abs/2405.04437 (Further optimizes paged attention algorithm for KV caching in attention, by storing the KV cache in contiguous memory and using underlying system paging.)
  • Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
  • Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
  • Shashank Verma and Neal Vaidya, Nov 17, 2023 Mastering LLM Techniques: Inference Optimization, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/ (An overview that covers a lot of inference optimization techniques.)
  • Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
  • Mackenzie Morehead, Apr 16, 2024, Is Attention All You Need? https://www.mackenziemorehead.com/is-attention-all-you-need/
  • Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs/2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Uses local attention versus global attention at different layers.)
  • Cong Wei, Brendan Duke, Ruowei Jiang, Parham Aarabi, Graham W. Taylor, Florian Shkurti, Mar 2023, Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers, CVPR 2023, https://arxiv.org/abs/2303.13755 https://openaccess.thecvf.com/content/CVPR2023/papers/Wei_Sparsifiner_Learning_Sparse_Instance-Dependent_Attention_for_Efficient_Vision_Transformers_CVPR_2023_paper.pdf
  • Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu, 19 Mar 2024, DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring, https://arxiv.org/abs/2403.13163v1 (Implements "feature fusion" which is analogous to kernel fusion.) (Optimizes a deblurring Transformer with self-attention improvements and other optimizations.)
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber, 14 Dec 2023, SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention, https://arxiv.org/abs/2312.07987 Code: https://github.com/robertcsordas/moe_attention
  • David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le, Jan 2022, Primer: Searching for Efficient Transformers for Language Modeling, https://arxiv.org/abs/2109.08668
  • Datadrifters, Nov 16, 2023, The World’s Fastest LLM Inference: 3x Faster Than vLLM and TGI, https://medium.com/@datadrifters/the-worlds-fastest-llm-inference-engine-3x-faster-than-vllm-and-tgi-a2ed9e33c55f
  • Galina Alperovich, May 16, 2023, The Secret Sauce behind 100K context window in LLMs: all tricks in one place, https://blog.gopenai.com/how-to-speed-up-llms-and-use-100k-context-window-all-tricks-in-one-place-ffd40577b4c
  • Hongyang Li, Yu Liu, Wanli Ouyang, and Xiaogang Wang. 2019. Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision 127, 3 (2019), 225–238. https://arxiv.org/abs/1709.04347 (Attention mechanism in image processing.)
  • Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, 2019, https://arxiv.org/abs/1901.02860 Code: https://github.com/kimiyoung/transformer-xl
  • J Ainslie, T Lei, M de Jong, S Ontañón, 2023, Colt5: Faster long-range transformers with conditional computation, https://arxiv.org/abs/2303.09752
  • H Zheng, 2023, Visual Content Manipulation With Correspondence Learning and Representation Learning, Ph.D. thesis, Department of Computer Science Arts, Sciences and Engineering, Edmund A. Hajim School of Engineering and Applied Sciences, University of Rochester, https://www.proquest.com/openview/2587c48ef34e3f7146132a08cf70e88f/1
  • Johannes von Oswald, Eyvind Niklasson, Maximilian Schlegel, Seijin Kobayashi, Nicolas Zucchet, Nino Scherrer, Nolan Miller, Mark Sandler, Blaise Agüera y Arcas, Max Vladymyrov, Razvan Pascanu, João Sacramento, Sep 2023, Uncovering mesa-optimization algorithms in Transformers, https://arxiv.org/abs/2309.05858
  • Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, and Zheng Zhang. 2019. BP-transformer: Modelling long-range context via binary partitioning. arXiv preprint arXiv:1911.04070 https://arxiv.org/abs/1911.04070 Code: https://github.com/yzh119/BPT
  • Y Wang, Y Han, C Wang, S Song, Q Tian, G Huang, 2023, Computation-efficient Deep Learning for Computer Vision: A Survey, arXiv preprint arXiv:2308.13998, https://arxiv.org/abs/2308.13998
  • Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm, 10 Jun 2024, Symmetric Dot-Product Attention for Efficient Training of BERT Language Models, https://arxiv.org/abs/2406.06366
  • Piotr Kluska, Adri´an Castello, Florian Scheidegger, A. Cristiano I. Malossi, 2024, QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers https://openaccess.thecvf.com/content/CVPR2024W/eLVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf
  • Splash: Sparse Flash Attention, 2024, https://github.com/google/jax/blob/main/jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_kernel.py
  • 1 Jun 2023, Faster Causal Attention Over Large Sequences Through Sparse Flash Attention, Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret, https://arxiv.org/abs/2306.01160
  • Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, 3 Apr 2024, Optimizing the Deployment of Tiny Transformers on Low-Power MCUs, https://arxiv.org/abs/2404.02945 (Uses an approach called "Fused Weight Self-Attention" that fuses some of the QKV matrices and also tiling in multi-head attention, along with 8-bit integer quantization and integerized Softmax.)
  • Moein Heidari, Reza Azad, Sina Ghorbani Kolahi, René Arimond, Leon Niggemeier, Alaa Sulaiman, Afshin Bozorgpour, Ehsan Khodapanah Aghdam, Amirhossein Kazerouni, Ilker Hacihaliloglu, Dorit Merhof, 28 Mar 2024, Enhancing Efficiency in Vision Transformer Networks: Design Techniques and Insights, https://arxiv.org/abs/2403.19882 Project: https://github.com/mindflow-institue/Awesome-Attention-Mechanism-in-Medical-Imaging (Survey of optimization techniques for Vision Transformers, with particular focus on attention optimizations.)
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
  • Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
  • Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai, 18 Feb 2024, In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness, https://arxiv.org/abs/2402.11639
  • Rares Dolga, Marius Cobzarenco, David Barber, 4 Mar 2024 (v2), Latent Attention for Linear Time Transformers, https://arxiv.org/abs/2402.17512
  • Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W.H. Lau, 29 Feb 2024 (v2), RelayAttention for Efficient Large Language Model Serving with Long System Prompts, https://arxiv.org/abs/2402.14808 Code: https://github.com/rayleizhu/vllm-ra (A new attention mechanism called RelayAttention.)
  • Wu, R., Zhu, X., Chen, J. et al. 2024, SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer. J Supercomput (2024). https://doi.org/10.1007/s11227-024-05890-8, https://link.springer.com/article/10.1007/s11227-024-05890-8 (A modified version of Flash Attention on a supercomputer.)
  • Fireworks.ai, Jan 9, 2024, FireAttention — Serving Open Source Models 4x faster than vLLM by quantizing with ~no tradeoffs, https://blog.fireworks.ai/fireattention-serving-open-source-models-4x-faster-than-vllm-by-quantizing-with-no-tradeoffs-a29a85ad28d0
  • 8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2, HippoML Blog, Jan 2024, https://blog.hippoml.com/8bit-hippoattention-up-to-3x-faster-compared-to-flashattentionv2-8f9def90b482
  • Jungmin Yun, Mihyeon Kim, Youngbin Kim, 2023, Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification https://aclanthology.org/2023.findings-emnlp.909.pdf
  • Yuzhen Mao, Martin Ester, Ke Li, 2023, IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs, NeurIPS 2023, https://neurips2023-enlsp.github.io/papers/paper_24.pdf
  • NVIDIA, Developer Guide (CuDNN), https://docs.nvidia.com/deeplearning/cudnn/developer-guide/index.html PDF: https://docs.nvidia.com/deeplearning/cudnn/pdf/cuDNN-Developer-Guide.pdf
  • Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc¸ois Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. In Hal Daume III and Aarti Singh ´ (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 5156–5165. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/katharopoulos20a.html.
  • Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarl ´ os, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, ´ David Benjamin Belanger, Lucy J Colwell, and Adrian Weller. Rethinking attention with performers. In International Conference on Learning Representations, 2021. https://openreview.net/forum?id=Ua6zuk0WRH.
  • Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pre-training Approach. arXiv:1907.11692 [cs], July 2019. URL http://arxiv.org/abs/19 07.11692.
  • Peter Izsak, Moshe Berchansky, and Omer Levy. How to Train BERT with an Academic Budget. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 10644–10652, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.831. https://aclanthology.org/2021.emnlp-main.831
  • Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002, IEEE, 2021. https://arxiv.org/abs/2103.14030 Code: https://github.com/microsoft/Swin-Transformer
  • Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
  • Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
  • Toan Q. Nguyen, Julian Salazar, 2019, Transformers without Tears: Improving the Normalization of Self-Attention, https://arxiv.org/abs/1910.05895
  • Yingruo Fan, Zhaojiang Lin, Jun Saito, Wenping Wang, Taku Komura, Mar 2022, FaceFormer: Speech-Driven 3D Facial Animation with Transformers, https://arxiv.org/abs/2112.05329
  • S Dai, H Genc, R Venkatesan, B Khailany, 2023 Efficient Transformer Inference with Statically Structured Sparse Attention, https://ieeexplore.ieee.org/abstract/document/10247993
  • J Du, J Jiang, J Zheng, H Zhang, D Huang, Y Lu, August 2023, Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs, ACM Transactions on Architecture and Code Optimization, https://dl.acm.org/doi/10.1145/3617689 PDF: https://dl.acm.org/doi/pdf/10.1145/3617689
  • William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley, 15 Nov 2023, Striped Attention: Faster Ring Attention for Causal Transformers, https://arxiv.org/abs/2311.09431
  • David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • A Vyas, A Katharopoulos, 2020, Fast transformers with clustered attention, https://proceedings.neurips.cc/paper/2020/file/f6a8dd1c954c8506aadc764cc32b895e-Paper.pdf
  • SC Kao, S Subramanian, G Agrawal, 2023, FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks https://dl.acm.org/doi/pdf/10.1145/3575693.3575747
  • Deyi Xiong, Biao Zhang, and Jinsong Su. 2018. Accelerating neural Transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Long Papers, pages 1789–1798, Melbourne, Australia https://aclweb.org/anthology/P18-1166
  • Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin, 2019, Adaptive attention span in transformers. CoRR, abs/1905.07799, 2019, http://arxiv.org/abs/1905.07799.
  • Hanhwi Jang, Joonsung Kim, Jae-Eon Jo, Jaewon Lee, and Jangwoo Kim. 2019. Mnnfast: A fast and scalable system architecture for memory-augmented neural networks. In Proceedings of the 46th International Symposium on Computer Architecture. 250ś263. https://doi.org/10.1145/3307650.3322214
  • Zheng Qu, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, Yuan Xie, 2022, DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration, ASPLOS ’22, February 28 ś March 4, 2022, Lausanne, Switzerland, PDF: https://dl.acm.org/doi/pdf/10.1145/3503222.3507738
  • J Ding, S Ma, L Dong, X Zhang, S Huang 2023, Longnet: Scaling transformers to 1,000,000,000 tokens, https://arxiv.org/abs/2307.02486
  • Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu, 30 May 2024, Parrot: Efficient Serving of LLM-based Applications with Semantic Variable, https://arxiv.org/abs/2405.19888 (Uses prefix KV caching and a combined flash attention and paged attention module.)
  • Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 29 May 2024 (v2), DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference, https://arxiv.org/abs/2404.00242 https://openreview.net/forum?id=HqfLHoX8bR
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
  • Markus N. Rabe, Charles Staats, 10 Oct 2022 (v3), Self-attention Does Not Need O(n^2) Memory, https://arxiv.org/abs/2112.05682
  • Shwai He, Guoheng Sun, Zheyu Shen, Ang Li, 22 Jun 2024, What Matters in Transformers? Not All Attention is Needed, https://arxiv.org/abs/2406.15786 https://github.com/Shwai-He/LLM-Drop
  • Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
  • Ruiqing Yan, Xingbo Du, Haoyu Deng, Linghan Zheng, Qiuzhuang Sun, Jifang Hu, Yuhang Shao, Penghao Jiang, Jinrong Jiang, Lian Zhao, 3 Jul 2024 (v2), Unveiling and Controlling Anomalous Attention Distribution in Transformers, https://arxiv.org/abs/2407.01601 (Examination of why the very first token in a sequence always gets more attention than others, including the effect of positional encoding, and its impact on KV cache compression.)
  • Yuxin Liu; Wenxin Yu; Zhiqiang Zhang; Qi Wang; Lu Che, May 2024, Axial Attention Transformer for Fast High-quality Image Style Transfer, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), 19-22 May 2024, https://ieeexplore.ieee.org/abstract/document/10558531
  • Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
  • Karl Stratos, 2024, Efficient Attention, https://karlstratos.com/notes/attention.pdf
  • Myung Beom Her, Jisu Jeong, Hojoon Song, Ji-Hyeong Han, 5 Jul 2024, Batch Transformer: Look for Attention in Batch, https://arxiv.org/abs/2407.04218
  • An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Zeyu Cui, Zhenru Zhang, Zhihao Fan, 15 Jul 2024, Qwen2 Technical Report, https://arxiv.org/abs/2407.10671
  • Sai Sena Chinnakonduru, Astarag Mohapatra, 15 Jul 2024, Weighted Grouped Query Attention in Transformers, https://arxiv.org/abs/2407.10855
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Junyoung Park, Myeonggu Kang, Yunki Han, Yanggon Kim, Jaekang Shin, Lee-Sup Kim, 21 Jul 2024, Token-Picker: Accelerating Attention in Text Generation with Minimized Memory Transfer via Probability Estimation, https://arxiv.org/abs/2407.15131
  • Jiangfei Duan, Shuo Zhang, Zerui Wang, Lijuan Jiang, Wenwen Qu, Qinghao Hu, Guoteng Wang, Qizhen Weng, Hang Yan, Xingcheng Zhang, Xipeng Qiu, Dahua Lin, Yonggang Wen, Xin Jin, Tianwei Zhang, Peng Sun, 29 Jul 2024, Efficient Training of Large Language Models on Distributed Infrastructures: A Survey, https://arxiv.org/abs/2407.20018
  • Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
  • Team PyTorch: Horace He, Driss Guessous, Yanbo Liang, Joy Dong, August 07, 2024, FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention, https://pytorch.org/blog/flexattention/
  • Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
  • Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 28 Jul 2024 (v2), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 Project: https://github.com/zcli-charlie/Awesome-KV-Cache
  • Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy, 10 Aug 2024, Eigen Attention: Attention in Low-Rank Space for KV Cache Compression, https://arxiv.org/abs/2408.05646
  • Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li, 11 Aug 2024, Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators, https://arxiv.org/abs/2408.05710 (Reduce the attention cost in diffusion models by what is effectively token merging between the Q and K data.)
  • Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Nils Graef, Aarush Gupta, 2024 (accessed), Approximate attention: infinite context length with constant complexity per token [work in progress], https://github.com/OpenMachine-ai/transformer-tricks/blob/main/approximate.pdf
  • Nils Graef, 18 Apr 2024, Transformer tricks: Removing weights for skipless transformers, https://arxiv.org/abs/2404.12362
  • David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
  • https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
  • Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin, 8 May 2024, Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers, https://arxiv.org/abs/2405.05219 (Attention optimization using multiple low-rank matrices.)
  • Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
  • Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
  • Rundong Zuo, Guozhong Li, Rui Cao, Byron Choi, Jianliang Xu, and Sourav S Bhowmick. 2024. DARKER: Efficient Transformer with Data-Driven Attention Mechanism for Time Series. Proc. VLDB Endow. 17, 11 (July 2024), 3229–3242. https://doi.org/10.14778/3681954.3681996 https://dl.acm.org/doi/abs/10.14778/3681954.3681996
  • Du Cunxiao, 2024, Towards Faster Inference of Transformers: Strategies for Accelerating Decoding Processes, Ph.D. thesis, Computer Science, School of Computing and Information Systems, Singapore Management University, https://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1611&context=etd_coll (Examines non-autoregressive decoding, speculative decoding and attention optimizations.)
  • The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
  • DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
  • Jason Ramapuram, Federico Danieli, Eeshan Dhekane, Floris Weers, Dan Busbridge, Pierre Ablin, Tatiana Likhomanenko, Jagrit Digani, Zijin Gu, Amitis Shidani, Russ Webb, 6 Sep 2024, Theory, Analysis, and Best Practices for Sigmoid Self-Attention, https://arxiv.org/abs/2409.04431 https://github.com/apple/ml-sigmoid-attention
  • Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Bo Tang, Feiyu Xiong, Zhiyu Li, 5 Sep 2024, Attention Heads of Large Language Models: A Survey, https://arxiv.org/abs/2409.03752 https://github.com/IAAR-Shanghai/Awesome-Attention-Heads (This survey is about making attention mechanisms more performant, accurate and intelligent, rather than improving efficiency.)
  • Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu, 11 Sep 2024, Gated Slot Attention for Efficient Linear-Time Sequence Modeling, https://arxiv.org/abs/2409.07146 https://github.com/sustcsonglin/flash-linear-attention https://huggingface.co/fla-hub
  • Zhenyu Ning, Jieru Zhao, Qihao Jin, Wenchao Ding, Minyi Guo, 11 Sep 2024, Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU, https://arxiv.org/abs/2409.09086
  • Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 https://github.com/tonyzhang617/nomad-dist
  • Junyang Zhang, Mu Yuan, Ruiguang Zhong, Puhan Luo, Huiyou Zhan, Ningkang Zhang, Chengchen Hu, Xiangyang Li, 23 Sep 2024, A-VL: Adaptive Attention for Large Vision-Language Models, https://arxiv.org/abs/2409.14846 (Separate handling of text and image attention modules.)
  • Hossein Rajabzadeh, Aref Jafari, Aman Sharma, Benyamin Jami, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 22 Sep 2024, EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models, https://arxiv.org/abs/2409.14595
  • Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
  • Ehsan Kabir, Md. Arafat Kabir, Austin R.J. Downey, Jason D. Bakos, David Andrews, Miaoqing Huang, 21 Sep 2024, FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs, https://arxiv.org/abs/2409.14023
  • Gabriel Mongaras, Trevor Dohm, Eric C. Larson, 27 Sep 2024, Cottention: Linear Transformers With Cosine Attention, https://arxiv.org/abs/2409.18747 (Nearest-neighbor to replace Softmax attention, for near-linear attention.)
  • Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
  • Jintao Zhang, Jia wei, Pengle Zhang, Jun Zhu, Jianfei Chen, 3 Oct 2024, SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration, https://arxiv.org/abs/2410.02367 (Quantized attention outperforms Flash attention.)
  • Nathan Leroux, Paul-Philipp Manea, Chirag Sudarshan, Jan Finkbeiner, Sebastian Siegel, John Paul Strachan, Emre Neftci, 28 Sep 2024, Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models, https://arxiv.org/abs/2409.19315 (Analog hardware implementation of self-attention with sliding window attention.)
  • Yaniv Leviathan, Matan Kalman, Yossi Matias, 3 Oct 2024, Selective Attention Improves Transformer, https://arxiv.org/abs/2410.02703 (Allowing adjacent tokens to predict whether required for attention.)
  • Josh Alman, Hantao Yu, 5 Oct 2024, Fundamental Limitations on Subquadratic Alternatives to Transformers, https://arxiv.org/abs/2410.04271
  • Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin, 9 Oct 2024, Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions, https://arxiv.org/abs/2410.06577
  • Ravindran Kannan, Chiranjib Bhattacharyya, Praneeth Kacham, David P. Woodruff, 7 Oct 2024, LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions, https://arxiv.org/abs/2410.05462
  • Y Yang, TT Yang, Oct 2024, DAA: Dynamic attention allocation improves large-scale model reasoning, https://doi.org/10.21203/rs.3.rs-5025111/v1 https://assets-eu.researchsquare.com/files/rs-5025111/v1_covered_81d99ea9-82e8-494a-81e0-2ee73cbf1249.pdf?c=1728617590
  • Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 14 Oct 2024, DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads, https://arxiv.org/abs/2410.10819 https://github.com/mit-han-lab/duo-attention
  • Bettayeb, M., Halawani, Y., Khan, M.U. et al. Efficient memristor accelerator for transformer self-attention functionality. Sci Rep 14, 24173 (2024). https://doi.org/10.1038/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z.pdf
  • Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou, 15 Oct 2024, Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix, https://arxiv.org/abs/2410.11261
  • Anonymous authors, Oct 2024, Dense Attention: No-CCompromise Exact All NxN Interactions Algorithm with O(N) Space and Time Complexity, https://openreview.net/pdf?id=2bIQBDSfRk
  • Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
  • Shawn Tan, Yikang Shen, Songlin Yang, Aaron Courville, Rameswar Panda, 23 Oct 2024, Stick-breaking Attention, https://arxiv.org/abs/2410.17980
  • Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan, 23 Oct 2024, Value Residual Learning For Alleviating Attention Concentration In Transformers, https://arxiv.org/abs/2410.17897
  • Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019, Augmenting self-attention with persistent memory. CoRR, abs/1907.01470, 2019. http://arxiv.org/abs/1907.01470
  • Dr. Ashish Bamania, Oct 27, 2024, Amazing Things Happen When Attention Heads Are Supercharged Using Mixture-Of-Experts: A deep dive into how the Attention mechanism works and how it is being enhanced by the Mixture-of-Experts architecture, resulting in Mixture-of-Head Attention (MoH) that makes our existing LLMs more efficient than ever. https://levelup.gitconnected.com/amazing-things-happen-when-attention-heads-are-supercharged-using-mixture-of-experts-b55a6b9a0ac8
  • Themistoklis Haris, 6 Nov 2024, kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers, https://arxiv.org/abs/2411.04013
  • Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
  • Ampazis, Nicholas, and Flora Sakketou. 2024. "Diversifying Multi-Head Attention in the Transformer Model" Machine Learning and Knowledge Extraction 6, no. 4: 2618-2638. https://doi.org/10.3390/make6040126 https://www.mdpi.com/2504-4990/6/4/126
  • Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Monishwaran Maheswaran, June Paik, Michael W. Mahoney, Kurt Keutzer, Amir Gholami, 14 Nov 2024, Squeezed Attention: Accelerating Long Context Length LLM Inference, https://arxiv.org/abs/2411.09688 https://github.com/SqueezeAILab/SqueezedAttention (This is like a combination of semantic caching and prefix KV caching, and close to fused KV caching.)
  • Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, Jianfei Chen, 17 Nov 2024, SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration, https://arxiv.org/abs/2411.10958 https://github.com/thu-ml/SageAttention
  • Zhuohan Gu, Jiayi Yao, Kuntai Du, Junchen Jiang, 21 Nov 2024 (v2), LLMSteer: Improving Long-Context LLM Inference by Steering Attention on Reused Contexts, https://arxiv.org/abs/2411.13009
  • Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in Vision: A Survey. arXiv:2101.01169 https://arxiv.org/abs/2101.01169 (Survey of Vision transformers in 2021.)
  • Conner Takehana, Aaryan Singhal, Nov 28, 2024, ThunderMittens For Your ThunderKittens, https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx (Porting TK to Apple Metal and MLX on the M2 chips.)
  • Mohammadali Shakerdargah, Shan Lu, Chao Gao, Di Niu, 20 Nov 2024, MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices, https://arxiv.org/abs/2411.17720
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
  • Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv, 28 Nov 2024, Puzzle: Distillation-Based NAS for Inference-Optimized LLMs,NVIDIA Research, https://arxiv.org/abs/2411.19146 (This is dynamic NAS on a vast scale in a search space of size 10^138, because the optimization is applied with low granularity to each block in attention and FFN subcomponents of each layer.)
  • LM Po, Aug 3, 2024, The Race for Faster Transformers: Innovations in Self-Attention, https://medium.com/@lmpo/the-race-for-faster-transformers-innovations-in-self-attention-e602fb1b5f20
  • Benjamin Marie, Aug 13, 2024, FlexAttention: A Flexible Pytorch API for Implementing Attention Optimizations. It’s going to be easier to optimize attention computation. https://medium.com/@bnjmn_marie/flexattention-a-flexible-pytorch-api-for-implementing-attention-optimizations-bc4fae65eb9d
  • Ginés Carreto Picón, Illia Oleksiienko, Lukas Hedegaard, Arian Bakhtiarnia, Alexandros Iosifidis, 5 Dec 2024 (v2), Continual Low-Rank Scaled Dot-product Attention, https://arxiv.org/abs/2412.03214
  • Rongxiang Wang and Felix Xiaozhu Lin. 2024. Turbocharge Speech Understanding with Pilot Inference. In Proceedings of the 30th Annual International Conference on Mobile Computing and Networking (ACM MobiCom '24). Association for Computing Machinery, New York, NY, USA, 1299–1313. https://doi.org/10.1145/3636534.3690694 https://dl.acm.org/doi/abs/10.1145/3636534.3690694 https://dl.acm.org/doi/pdf/10.1145/3636534.3690694 ("Pilot inference" is a specialized mix of caching, computation reuse, and backtracking in beam search for speech understanding, and is somewhat related to speculative decoding, and similar to continual inference for processing a stream.)

Attention Head Approximation

Much of the research into attention heads has been in regard to attention head pruning (removing redundant or under-utilized attention head components) or reducing the quadratic cost of attention in terms of sequence length (related to non-autoregressive algorithms). However, there are also some "simple" attention heads that have been considered to replace the original Transformer components.

  • Haoneng Luo, Shiliang Zhang, Ming Lei, Lei Xie, Nov 2020, Simplified Self-Attention for Transformer-based End-to-End Speech Recognition, https://arxiv.org/abs/2005.10463
  • Biao Zhang, Deyi Xiong, and Jinsong Su. 2018. Accelerating neural transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 1789–1798. Association for Computational Linguistics, https://arxiv.org/abs/1805.00631 (A simpler version of the attention heads.)
  • Biao Zhang, Ivan Titov, Rico Sennrich, Aug 2019, Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention, https://arxiv.org/abs/1908.11365 (Merged simplified attention heads.)
  • Daniel Povey, Hossein Hadian, Pegah Ghahremani, Ke Li, and Sanjeev Khudanpur, “A time-restricted self-attention layer for ASR,” in Proc. ICASSP. IEEE, 2018, pp. 5874–5878. https://ieeexplore.ieee.org/document/8462497
  • Zihang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan. ConvBERT: Improving BERT with Span-based Dynamic Convolution, NeurIPS, 33, 2020. https://arxiv.org/abs/2008.02496 (Replaces quadratic attention heads with linear convolution.)
  • T. J. Ham, S. J. Jung, S. Kim et al., “A3: Accelerating attention mechanisms in neural networks with approximation,” in Proc. of HPCA. IEEE, 2020, pp. 328–341. https://arxiv.org/abs/2002.10941 (Optimizes both the attention mechanism and exponentiation in softmax.)
  • Forrest Iandola, Albert Shaw, Ravi Krishna, and Kurt Keutzer. 2020. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 124–135. Association for Computational Linguistics. https://arxiv.org/abs/2006.11316 (Replaces self-attention with grouped convolutions for a hybrid method.)
  • Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite Transformer with Long-Short Range Attention. In International Conference on Learning Representations. https://arxiv.org/abs/2004.11886, Code: https://github.com/mit-han-lab/lite-transformer (Hybrid attention method that uses convolutions.)
  • Ruokai Yin, Yuhang Li, Abhishek Moitra, Priyadarshini Panda, Dec 2022, Training Integer-Only Deep Recurrent Neural Networks https://arxiv.org/abs/2212.11791 (Integer-only version of RNNs called iRNN, with integer-only layer normalization, integer-only attention, and piecewise linear approximation for integer-only activation functions such as tanh and sigmoid.)
  • Lucas D. Lingle, Sep 2023, Transformer-VQ: Linear-Time Transformers via Vector Quantization https://arxiv.org/abs/2309.16354 Code: https://github.com/transformer-vq/transformer_vq (Linear attention in an encoder-only Transformer.)
  • Jiing-Ping Wang, Ming-Guang Lin, An-Yeu (Andy) Wu, 11 Apr 2024, LATTE: Low-Precision Approximate Attention with Head-wise Trainable Threshold for Efficient Transformer, https://arxiv.org/abs/2404.07519 (Approximate 4-bit vector dot product used as an estimate in attention heads, with computation reuse to 8-bit dot products.)
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • 3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
  • Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
  • Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli, 16 Jul 2024, PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer, https://arxiv.org/abs/2407.11306
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Nils Graef, Aarush Gupta, 2024 (accessed), Approximate attention: infinite context length with constant complexity per token [work in progress], https://github.com/OpenMachine-ai/transformer-tricks/blob/main/approximate.pdf
  • Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
  • Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou, 23 Aug 2024, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 (Training using low-rank matrices to approximate attention.)
  • Josh Alman, Zhao Song, 9 May 2023 (v2), Fast Attention Requires Bounded Entries, https://arxiv.org/abs/2302.13214 (Low-rank matrices in attention for fast inference.)
  • Tymofii Reizin, 2024, Fast Algorithms for Attention Mechanism, Bachelor Thesis, Department of Applied Mathematics, Charles University, Prague, https://dspace.cuni.cz/bitstream/handle/20.500.11956/192084/130390128.pdf?sequence=1
  • David Spuler, March 2024, Attention Head Approximation, in Generative AI in C++, https://www.aussieai.com/book/ch20-attention-head-approximation
  • Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
  • Themistoklis Haris, 6 Nov 2024, kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers, https://arxiv.org/abs/2411.04013

Alternatives to Attention

Although attention has performed very well, there are still various attempts to replace it with something better.

Research papers on what comes next after attention include:

  • Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. In Advances in neural information processing systems (NeurIPS), 2020. https://arxiv.org/abs/2008.07669
  • Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in Neural Information Processing Systems, 34, 2021. https://arxiv.org/abs/2110.13985
  • Weizhe Hua, Zihang Dai, Hanxiao Liu, and Quoc V Le. Transformer quality in linear time. arXiv preprint arXiv:2202.10447, 2022, https://arxiv.org/abs/2202.10447 (Proposes a "gated attention unit" with an overlaying linear approximation.)
  • Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, and Josh Susskind. An attention free transformer. arXiv preprint arXiv:2105.14103, 2021. https://arxiv.org/abs/2105.14103 (A Transformer architecture without dot product self-attention using different processing of keys and values.)
  • Irwan Bello. LambdaNetworks: Modeling long-range interactions without attention. arXiv preprint arXiv:2102.08602, 2021. https://arxiv.org/abs/2102.08602 (An alternative to self-attention using linear transforms.)
  • Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, Sinong Wang, Sep 2023, LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models, https://arxiv.org/abs/2308.16137
  • N Sevim, EO Özyedek, F Şahinuç, A Koç, 2022, Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier Layers arXiv preprint arXiv:2209.12816, https://arxiv.org/abs/2209.12816 (Replaces attention with a Fourier transform method.)
  • Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grella, Kranthi Kiran GV, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartlomiej Koptyra, Hayden Lau, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Xiangru Tang, Bolun Wang, Johan S. Wind, Stansilaw Wozniak, Ruichong Zhang, Zhenyuan Zhang, Qihang Zhao, Peng Zhou, Jian Zhu, Rui-Jie Zhu, May 2023, RWKV: Reinventing RNNs for the Transformer Era, https://arxiv.org/pdf/2305.13048.pdf, Code: https://github.com/BlinkDL/RWKV-LM (Hybrid RNN-Transformer that replaces QKV attention with Receptance Weighted Key Value (RWKV), which is similar to Flash Attention).
  • Keyu An, Shiliang Zhang, Sep 2023, Exploring RWKV for Memory Efficient and Low Latency Streaming ASR, arXiv preprint arXiv:2309.14758, https://arxiv.org/pdf/2309.14758.pdf (Analysis of the RWKV Transformer-RNN hybrid architecture.)
  • A Langedijk, H Mohebbi, G Sarti, W Zuidema, J Jumelet, Oct 2023, DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers, https://arxiv.org/abs/2310.03686, https://pure.rug.nl/ws/portalfiles/portal/799424386/2310.03686v1.pdf (Allows the decoder to cross-attend to earlier layers of the encoder, rather than only the final output layer.)
  • J Alman, Z Song, Oct 2023, How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation, arXiv preprint arXiv:2310.04064, https://arxiv.org/abs/2310.04064 (Uses more advanced QKV attention mechanism with even more computations than vanilla Transformer.)
  • Bohdan Ivaniuk-Skulskyi; Nadiya Shvai; Arcadi Llanza; Amir Nakib, 2024, Towards Lightweight Transformer Architecture: an Analysis on Semantic Segmentation, 2024 International Conference on Artificial Intelligence, Computer, Data Sciences and Applications (ACDSA), https://ieeexplore.ieee.org/abstract/document/10467926 (Examines skip-attention and pool-unpool attention methods for faster inference.)
  • Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli, 16 Jul 2024, PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer, https://arxiv.org/abs/2407.11306
  • David Spuler, March 2024, Alternatives to Attention, in Generative AI in C++, https://www.aussieai.com/book/ch20-alternatives-attention
  • Benjamin L. Badger, 2 Sep 2024, Masked Mixers for Language Generation and Retrieval, https://arxiv.org/abs/2409.01482
  • Tianyi Zhang, Jonah Wonkyu Yi, Bowen Yao, Zhaozhuo Xu, Anshumali Shrivastava, 2 Mar 2024, NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention, https://arxiv.org/abs/2403.01273 https://github.com/tonyzhang617/nomad-dist
  • Gabriel Mongaras, Trevor Dohm, Eric C. Larson, 27 Sep 2024, Cottention: Linear Transformers With Cosine Attention, https://arxiv.org/abs/2409.18747 (Nearest-neighbor to replace Softmax attention, for near-linear attention.)

Cross Attention

Cross attention is the attention module in a decoder that is part of an encoder-decoder architecture. This is how the decoder pays attention to the encoder.

This area of the architecture doesn't get a lot of research attention anymore, since most modern transformers are decoder-only architectures (e.g. GPT), so there's no cross attention module, and no encoder at all. The prefill phase that replaces the encoder doesn't really have an equivalent of cross attention.

Note that a lot of the papers on "cross attention" are not about this type of cross attention between the encoder and decoder. Rather, these papers may refer to attention between multimodal models (e.g. across text versus image encodings), between model layers, or between inputs and embedding computations along the length dimension.

Research papers on encoder-decoder basic cross attention:

  • Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
  • Bin Xiao, Chunan Shi, Xiaonan Nie, Fan Yang, Xiangwei Deng, Lei Su, Weipeng Chen, Bin Cui, 1 May 2024, Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge, https://arxiv.org/abs/2405.00263 (Speculative decoding improvement by extending Medusa to tree attention with a cross attention block.)
  • João Monteiro, Étienne Marcotte, Pierre-André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian, 23 Apr 2024, XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference, https://arxiv.org/abs/2404.15420
  • Shi Luohe, Zhang Hongyi, Yao Yao, Li Zuchao, Zhao Hai, 25 Jul 2024, Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003
  • Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon, 22 Aug 2024, Building and better understanding vision-language models: insights and future directions, https://arxiv.org/abs/2408.12637
  • David Spuler, March 2024, What is Cross Attention? in Generative AI in C++, https://www.aussieai.com/book/ch20-what-is-cross-attention
  • Sebastian Raschka, Nov 03, 2024, Understanding Multimodal LLMs: An introduction to the main techniques and latest models, https://magazine.sebastianraschka.com/p/understanding-multimodal-llms

Sparse Attention

Sparse attention is the use of sparsity to optimize LLM attention algorithms by zeroing some weights and pruning some computations. The idea of sparse models can be used broadly, and sparse attention is using sparsity on the QKV attention weights, the KV cache values, and/or the activation computations. It has been established by research that the attention weightings are often very sparse, with a few larger outliers having the main influence on final results, and hence sparse attention kernels can be fast.

Research papers on sparse attention:

  • Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
  • Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
  • Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
  • Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • M Pagliardini, D Paliotta, M Jaggi, F Fleuret, 2023, Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention, https://openreview.net/pdf?id=UINHuKeWUa
  • Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
  • Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
  • Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
  • Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, Longformer: The Long-Document Transformer, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
  • Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
  • 3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
  • S Dai, H Genc, R Venkatesan, B Khailany, 2023 Efficient Transformer Inference with Statically Structured Sparse Attention, https://ieeexplore.ieee.org/abstract/document/10247993
  • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
  • Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
  • Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang, 14 Jun 2024, HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning, https://arxiv.org/abs/2406.09827 (Sparse attention using the top-k features and a tree-based structure.)
  • Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
  • Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
  • Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
  • Bokyeong Yoon; Ah-Hyun Lee; Jinsung Kim; Gordon Euhyun Mo, 9 July 2024, Exploring Attention Sparsity to Accelerate Transformer Training on GPUs, IEEE Access ( Early Access ), DOI: 10.1109/ACCESS.2024.3425638, https://ieeexplore.ieee.org/document/10589623
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Minh Lenhat, Viet Anh Nguyen, Khoa Nguyen, Duong Duc Hieu, Dao Huu Hung, Truong Son Hy, 10 Aug 2024, SAMSA: Efficient Transformer for Many Data Modalities, https://arxiv.org/abs/2408.05391 https://github.com/HySonLab/SAMSA
  • Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
  • Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
  • Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
  • Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang, 21 Feb 2024, Do Efficient Transformers Really Save Computation? https://arxiv.org/abs/2402.13934
  • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://arxiv.org/abs/2110.15343 (Attention optimization using both sparse attention and low-rank matrix attention.)
  • Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
  • Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
  • Pierre Marion, Raphaël Berthier, Gérard Biau, Claire Boyer, 2 Oct 2024, Attention layers provably solve single-location regression, https://arxiv.org/abs/2410.01537
  • Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng, 2 Oct 2024, A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts, https://arxiv.org/abs/2410.01485
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Lijie Yang, Zhihao Zhang, Zhuofu Chen, Zikun Li, Zhihao Jia, 7 Oct 2024, TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention, https://arxiv.org/abs/2410.05076
  • Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
  • Bo Chen, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, 14 Oct 2024, HSR-Enhanced Sparse Attention Acceleration, https://arxiv.org/abs/2410.10165
  • Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang, 18 Oct 2024 (v2), SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs, https://arxiv.org/abs/2410.13276
  • Qwen Team, November 15, 2024, Extending the Context Length to 1M Tokens! https://qwenlm.github.io/blog/qwen2.5-turbo/ (Qwen extended to long context via sparse attention.)
  • Shantanu Acharya, Fei Jia, Boris Ginsburg, 26 Nov 2024, Star Attention: Efficient LLM Inference over Long Sequences, https://arxiv.org/abs/2411.17116
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418

Sliding Window Attention

Sliding Window Attention (SWA) is a type of local attention whereby each token only pays attention to its near neighbor tokens. This can be very efficient, which results in a type of linear attention, but can also result in accuracy loss.

Research papers on sliding window attention:

  • Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
  • Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
  • Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, Dec 2023, Efficient Streaming Language Models with Attention Sinks https://arxiv.org/abs/2309.17453 Code: https://github.com/mit-han-lab/streaming-llm
  • Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, Yulu Jia, Sun He, Hongmin Chen, Zhihao Bai, Qi Hou, Shipeng Yan, Ding Zhou, Yiyao Sheng, Zhuo Jiang, Haohan Xu, Haoran Wei, Zhang Zhang, Pengfei Nie, Leqi Zou, Sida Zhao, Liang Xiang, Zherui Liu, Zhe Li, Xiaoying Jia, Jianxi Ye, Xin Jin, Xin Liu, 23 Feb 2024, MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, https://arxiv.org/abs/2402.15627
  • Omri Mallis, February 25, 2024 , Techniques for KV Cache Optimization in Large Language Models, https://www.omrimallis.com/posts/techniques-for-kv-cache-optimization/
  • Mistral AI, https://github.com/mistralai/mistral-src
  • Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
  • Jeffrey Willette, Heejun Lee, Youngwan Lee, Myeongjae Jeon, Sung Ju Hwang, 24 Jun 2024, Training-Free Exponential Extension of Sliding Window Context with Cascading KV Cache, https://arxiv.org/abs/2406.17808 (Extends the KV cache eviction policy in sliding window attention so that the KV partially looks back further than the window.)
  • Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
  • Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, 4 Jan 2024 (v2), LLM in a flash: Efficient Large Language Model Inference with Limited Memory, https://arxiv.org/abs/2312.11514 (Storing model parameters in flash memory on phones.)
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
  • Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
  • Nathan Leroux, Paul-Philipp Manea, Chirag Sudarshan, Jan Finkbeiner, Sebastian Siegel, John Paul Strachan, Emre Neftci, 28 Sep 2024, Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Models, https://arxiv.org/abs/2409.19315 (Analog hardware implementation of self-attention with sliding window attention.)
  • Zhihao He, Hang Yu, Zi Gong, Shizhan Liu, Jianguo Li, Weiyao Lin, 9 Oct 2024, Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions, https://arxiv.org/abs/2410.06577
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Leo Donisch, Sigurd Schacht, Carsten Lanquillon, 6 Aug 2024, Inference Optimizations for Large Language Models: Effects, Challenges, and Practical Considerations, https://arxiv.org/abs/2408.03130
  • The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
  • Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
  • AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
  • LM Po, Aug 3, 2024, The Race for Faster Transformers: Innovations in Self-Attention, https://medium.com/@lmpo/the-race-for-faster-transformers-innovations-in-self-attention-e602fb1b5f20

Tree Attention

Tree attention is the use of tree-structured data in LLM attention algorithms. This is not widely used on its own as an attention optimization, but can be used in situations where tree-structured data is already available, including beam search and various types of speculative decoding.

Research papers on tree attention:

Local Attention

Local attention is an LLM inference optimization whereby each token's attention computations are only based on nearby tokens. This gives a faster attention algorithm with linear complexity, but can also lose accuracy. A generalization is to use a hybrid local-global attention algorithm, with interleaved laters of local attention and global attention.

Research papers on local attention:

  • Aleksandar Botev, Soham De, Samuel L Smith, Anushan Fernando, George-Cristian Muraru, Ruba Haroun, Leonard Berrada, Razvan Pascanu, Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Sertan Girgin, Olivier Bachem, Alek Andreev, Kathleen Kenealy, Thomas Mesnard, Cassidy Hardin, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Armand Joulin, Noah Fiedel, Evan Senter, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, David Budden, Arnaud Doucet, Sharad Vikram, Adam Paszke, Trevor Gale, Sebastian Borgeaud, Charlie Chen, Andy Brock, Antonia Paterson, Jenny Brennan, Meg Risdal, Raj Gundluru, Nesh Devanathan, Paul Mooney, Nilay Chauhan, Phil Culliton, Luiz GUStavo Martins, Elisa Bandy, David Huntsperger, Glenn Cameron, Arthur Zucker, Tris Warkentin, Ludovic Peran, Minh Giang, Zoubin Ghahramani, Clément Farabet, Koray Kavukcuoglu, Demis Hassabis, Raia Hadsell, Yee Whye Teh, Nando de Frietas, 11 Apr 2024, RecurrentGemma: Moving Past Transformers for Efficient Open Language Models, Google Research, https://arxiv.org/abs/2404.07839
  • Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs/2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Uses local attention versus global attention at different layers.)
  • Mirko Farina, Usman Ahmad, Ahmad Taha, Hussein Younes, Yusuf Mesbah, Xiao Yu, Witold Pedrycz, 2024, Sparsity in transformers: A systematic literature review, Neurocomputing, Volume 582, 14 May 2024, 127468, https://www.sciencedirect.com/science/article/abs/pii/S092523122400239X (General survey of sparsity methods, and techniques that create sparsity.)
  • Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748

Local-Global Attention

Local-global attention is the LLM inference optimization of using a hybrid of both local attention and global attention in alternating laters. Typically, more local layers are used for efficiency, such as three-to-one ratio or more, whereas adding global layers increases accuracy at some performance cost.

A partial trade-off of faster local attention with more accurate global attention is to combine them. There are various complex ways, or you can simply alternate with some layers using local attention and some layers using global attention, as done by the Character.AI inference backend.

  • Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
  • Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun, 4 Jun 2024, Block Transformer: Global-to-Local Language Modeling for Fast Inference, https://arxiv.org/abs//2406.02657 Code: https://github.com/itsnamgyu/block-transformer (Impressive technique of combining tokens into blocks, then doing inference on the blocks, then unblocking to get tokens.)
  • Chull Hwan Song, Hye Joo Han, Yannis Avrithis, 16 Jul 2021, All the attention you need: Global-local, spatial-channel attention for image retrieval, https://arxiv.org/abs/2107.08000
  • Libo Wang, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, Peter M. Atkinson, 26 Jun 2022 (v4), UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery, https://arxiv.org/abs/2109.08937
  • Yiping Sun, 1 Jul 2024, A Global-Local Attention Mechanism for Relation Classification, 2024 20th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), https://arxiv.org/abs/2407.01424
  • Z. Zheng, G. An, D. Wu and Q. Ruan, Global and Local Knowledge-Aware Attention Network for Action Recognition, IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 1, pp. 334-347, Jan. 2021, doi: 10.1109/TNNLS.2020.2978613, https://ieeexplore.ieee.org/document/9050644 Code:https://github.com/ZhenxingZheng/attention-network
  • Abhishek Srivastava, Sukalpa Chanda, Debesh Jha, Michael A. Riegler, Pål Halvorsen, Dag Johansen, Umapada Pal, 20 Nov 2021, PAANet: Progressive Alternating Attention for Automatic Medical Image Segmentation, https://arxiv.org/abs/2111.10618
  • Jinpeng Li, Yichao Yan, Shengcai Liao, Xiaokang Yang, Ling Shao, 10 Jul 2021, Local-to-Global Self-Attention in Vision Transformers, https://arxiv.org/abs/2107.04735
  • David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
  • Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
  • Fangyuan Xu, Tanya Goyal, Eunsol Choi, 8 Nov 2024, Recycled Attention: Efficient inference for long-context language models, https://arxiv.org/abs/2411.05787
  • AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)

Additive Attention

Additive attention is the use of addition operations for faster LLM attention computations. Various types of "zero-multiplication" models can use addition instead of multiplication, such as low-bit quantization or various other multiplication-free models.

Research papers on additive attention:

Multiplicative Attention

Multiplicative attention is the idea of using multiplication operations in LLM attention. Standard attention uses matrix multiplications, which are cubic in arithmetic multiplication operations. An approximation is to use Hadamard matrix multiplications, which only perform arithmetic multiplication in an elementwise manner across matrices.

Research papers on multiplicative attention:

QKV Computations

The attention heads perform calculations on three matrices named Q (Query), K (Key), and V (value).

Research papers on QKV computations include:

  • Shaoyuan Chen, Yutong Lin, Mingxing Zhang, Yongwei Wu, 3 May 2024, Efficient and Economic Large Language Model Inference with Attention Offloading, https://arxiv.org/abs/2405.01814 (Separates the process-bound and memory-bound parts of inference for speedup, with focus on prefill, decoding, and the sub-tasks such as QKV and FFN use of GEMM kernels, versus the different pattern of attention computations and the KV cache.)
  • Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
  • Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 30 Mar 2024, DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference, https://arxiv.org/abs/2404.00242
  • Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, 3 Apr 2024, Optimizing the Deployment of Tiny Transformers on Low-Power MCUs, https://arxiv.org/abs/2404.02945 (Uses an approach called "Fused Weight Self-Attention" that fuses some of the QKV matrices and also tiling in multi-head attention, along with 8-bit integer quantization and integerized Softmax.)
  • Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura, 12 Jun 2024, Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference, https://arxiv.org/abs/2406.08413
  • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
  • David Spuler, March 2024, Chapter 20. Attention, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Zeyu Zhang, Haiying Shen, 7 Aug 2024, Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference, https://arxiv.org/abs/2408.04107
  • Yifan Pu, Zhuofan Xia, Jiayi Guo, Dongchen Han, Qixiu Li, Duo Li, Yuhui Yuan, Ji Li, Yizeng Han, Shiji Song, Gao Huang, Xiu Li, 11 Aug 2024, Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators, https://arxiv.org/abs/2408.05710 (Reduce the attention cost in diffusion models by what is effectively token merging between the Q and K data.)
  • Nils Graef, 18 Apr 2024, Transformer tricks: Removing weights for skipless transformers, https://arxiv.org/abs/2404.12362
  • W Byun, J Woo, S Mukhopadhyay, 2024, Hardware-friendly Hessian-driven Row-wise Quantization and FPGA Acceleration for Transformer-based Models, https://dl.acm.org/doi/pdf/10.1145/3665314.3670806
  • Zeyu Zhang, Haiying Shen, 23 Sep 2024, CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts, https://arxiv.org/abs/2409.15104 (Sparse attention and overlapped communication with computation and disaggregates prefill/decoding with chunked prefill, with a novel QKV splitting approach focused on the Q values.)

Fused Head Attention

Fused head attention, or fused attention, is the merging of the weights in the attention heads. This is a type of parameter sharing, and there are variants including MHA, Medusa attention, and other more obscure methods.

Research papers on fused attention heads:

FFT Attention

FFT attention is the use of Fast-Fourier Transformation (FFT) computations in LLM inference. This is an alternative to the traditional computations of attention matrices, but is not widely used in industry.

Research papers on FFT attention heads:

Image Attention Optimization

Image attention optimization is the use of optimizations to an image model to reduce the cost of LLM attention. Techniques can include image token pruning or merging, which involves avoiding computations for some portion of an image by either ignoring them or merging multiple patches together.

Research papers in optimizing image attention:

  • Hanzhou Liu, Binghan Li, Chengkai Liu, Mi Lu, 19 Mar 2024, DeblurDiNAT: A Lightweight and Effective Transformer for Image Deblurring, https://arxiv.org/abs/2403.13163 (Optimizes a deblurring Transformer with self-attention improvements and other optimizations.)
  • Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075 (Sparse attention)
  • Hongyang Li, Yu Liu, Wanli Ouyang, and Xiaogang Wang. 2019. Zoom out-and-in network with map attention decision for region proposal and object detection. International Journal of Computer Vision 127, 3 (2019), 225–238. https://arxiv.org/abs/1709.04347 (Attention mechanism in image processing.)
  • Yuxin Liu; Wenxin Yu; Zhiqiang Zhang; Qi Wang; Lu Che, May 2024, Axial Attention Transformer for Fast High-quality Image Style Transfer, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), 19-22 May 2024, https://ieeexplore.ieee.org/abstract/document/10558531
  • The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
  • Chull Hwan Song, Hye Joo Han, Yannis Avrithis, 16 Jul 2021, All the attention you need: Global-local, spatial-channel attention for image retrieval, https://arxiv.org/abs/2107.08000
  • Libo Wang, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, Peter M. Atkinson, 26 Jun 2022 (v4), UNetFormer: A UNet-like Transformer for Efficient Semantic Segmentation of Remote Sensing Urban Scene Imagery, https://arxiv.org/abs/2109.08937
  • Abhishek Srivastava, Sukalpa Chanda, Debesh Jha, Michael A. Riegler, Pål Halvorsen, Dag Johansen, Umapada Pal, 20 Nov 2021, PAANet: Progressive Alternating Attention for Automatic Medical Image Segmentation, https://arxiv.org/abs/2111.10618
  • Z. Wang, Y. Zhao and J. Chen, 2023, "Multi-Scale Fast Fourier Transform Based Attention Network for Remote-Sensing Image Super-Resolution," in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 16, pp. 2728-2740, 2023, doi: 10.1109/JSTARS.2023.3246564, https://ieeexplore.ieee.org/abstract/document/10049097 https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10049097
  • Liu, L., Zhu, Y., Zhang, H., Zhang, W., Peng, H. (2023). Strip-FFT Transformer for Single Image Deblurring. In: Lu, H., et al. Image and Graphics. ICIG 2023. Lecture Notes in Computer Science, vol 14356. Springer, Cham. https://doi.org/10.1007/978-3-031-46308-2_14 https://link.springer.com/chapter/10.1007/978-3-031-46308-2_14

Vision Attention Optimization

Vision attention optimization is a speed optimization for the inference in Vision Transformers (ViTs). This uses techniques similar to image attention optimization, such as pruning or merging patches or sections of the images.

Research on optimizing attention in vision transformers:

More AI Research

Read more about: