Aussie AI
Ring Attention
-
Last Updated 5 January, 2025
-
by David Spuler, Ph.D.
Ring attention is an LLM optimization of the attention module using blockwise computations. The aim is to speed up the calculations of the self-attention step in either training or inference. Ring attention is a method that can be combined orthogonally with some of the other memory-efficient attention algorithms, such as with Flash attention.
Research on Ring Attention
Research papers on ring attention include:
- Hao Liu, Matei Zaharia, Pieter Abbeel, 27 Nov 2023 (v4), Ring Attention with Blockwise Transformers for Near-Infinite Context, https://arxiv.org/abs/2310.01889 https://github.com/lhao499/llm_large_context (Original paper for ring attention.)
- William Brandon, Aniruddha Nrusimha, Kevin Qian, Zachary Ankner, Tian Jin, Zhiye Song, Jonathan Ragan-Kelley, 15 Nov 2023, Striped Attention: Faster Ring Attention for Causal Transformers, https://arxiv.org/abs/2311.09431
- Amy Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jongsoo Park, Jianyu Huang, 4 Nov 2024, Context Parallelism for Scalable Million-Token Inference, https://arxiv.org/abs/2411.01783
- Zongwu Wang, Fangxin Liu, Mingshuai Li, Li Jiang, 29 Dec 2024, TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication, https://arxiv.org/abs/2412.20501 https://github.com/ACA-Lab-SJTU/token-ring (Ring attention with inter-GPU network transmission optimizations.)
- Seongho Hong, Yong-Hoon Choi, 2 Jan 2025, RingFormer: A Neural Vocoder with Ring Attention and Convolution-Augmented Transformer, https://arxiv.org/abs/2501.01182
- zhuzilin, Jan 2025, ring-flash-attention: Ring attention implementation with flash attention, https://github.com/zhuzilin/ring-flash-attention
More Attention Research Topics
Related LLM research areas for long context optimization of the attention methods include:
- Attention optimization (main page)
- Local attention
- Linear attention
- Sparse attention
- Multi-Head Attention (MHA)
- Muti-Query Attention (MQA)
- Group-Query Attention (GQA)
- Flash attention
- Paged attention
Other topics in attention research:
- Low-rank matrix attention
- Medusa attention
- Block attention
- Cross attention
- Fused head attention
- Hybrid local-global attention
- FFT attention
- QKV computation optimizations
- Additive attention
- Multiplicative attention
- Graph attention
- Chunked attention
- Attention sink
- Attention steering
- Bilinear attention
- Attention-free methods
- Mixture-of-Heads (MOH) Attention (MoE+MHA)
More AI Research
Read more about: