Aussie AI
Medusa Attention
-
Last Updated 8 February, 2025
-
by David Spuler, Ph.D.
What is Medusa Attention?
Medusa attention is an optimization to attention computations in LLM inference that merges multiple attention heads. This results in fewer computations and faster QKV matrix operations that speed up inference.
Research on Medusa Attention
Research papers on Medusa attention:
- Zeping Li, Xinlong Yang, Ziheng Gao, Ji Liu, Zhuang Liu, Dong Li, Jinzhang Peng, Lu Tian, Emad Barsoum, 19 Jun 2024, Amphista: Accelerate LLM Inference with Bi-directional Multiple Drafting Heads in a Non-autoregressive Style, https://arxiv.org/abs/2406.13170 (Applying bi-directional decoding to speculative decoding.)
- Together AI, Nov 13, 2023, Announcing Together Inference Engine – the fastest inference available, https://www.together.ai/blog/together-inference-engine-v1
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, Deming Chen, 16 Jun 2024, New Solutions on LLM Acceleration, Optimization, and Application, https://arxiv.org/abs/2406.10903
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Tri Dao, September 11, 2023, Medusa: Simple framework for accelerating LLM generation with multiple decoding heads, https://www.together.ai/blog/medusa
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024 https://arxiv.org/abs/2401.10774
- Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for medusa decoding, 2024. https://arxiv.org/abs/2402.05109
- Jinwei Yao, Kaiqi Chen, Kexun Zhang, Jiaxuan You, Binhang Yuan, Zeke Wang, Tao Lin, 29 May 2024 (v2), DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference, https://arxiv.org/abs/2404.00242 https://openreview.net/forum?id=HqfLHoX8bR
- Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
- Kaiqi Zhang, Jing Zhao, Rui Chen, 15 Aug 2024, KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial Learning, https://arxiv.org/abs/2408.08146
- Karl Stratos, 2024, Speculative Decoding https://karlstratos.com/notes/speculative.pdf
- Ziqian Zeng, Jiahong Yu, Qianshi Pang, Zihao Wang, Huiping Zhuang, Hongen Shao, Xiaofeng Zou, 18 Apr 2024 (v2), Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens, https://arxiv.org/abs/2402.15758
- Davis Wertheimer, Joshua Rosenkranz, Thomas Parnell, Sahil Suneja, Pavithra Ranganathan, Raghu Ganti, Mudhakar Srivatsa, 6 Jun 2024 (v2), Accelerating Production LLMs with Combined Token/Embedding Speculators, https://arxiv.org/abs/2404.19124 https://github.com/foundation-model-stack/fms-fsdp https://huggingface.co/ibm-fms
- Wei Zhong, Manasa Bharadwaj, 1 Jun 2024 (v2), S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314
- Yael Segal-Feldman, Aviv Shamsian, Aviv Navon, Gill Hetz, Joseph Keshet, 24 Sep 2024, Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR, https://arxiv.org/abs/2409.15869
- Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han, 14 Oct 2024, DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads, https://arxiv.org/abs/2410.10819 https://github.com/mit-han-lab/duo-attention
- Y Cong, 2024, Research for Enhancing Processing and Computational Efficiency in LLM, 2024 2nd International Conference on Image, https://www.atlantis-press.com/article/126004157.pdf
- Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
- NVIDIA, Dec 2024, Speculative Sampling, https://nvidia.github.io/TensorRT-LLM/advanced/speculative-decoding.html
- NVIDIA, Dec 2024, Medusa Decoding, https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/medusa/README.md
- Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, Tri Dao, Dec 2024, Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads, https://sites.google.com/view/medusa-llm
- Haoran Wang, Kai Shu, Jan 2025, Make Every Token Count: A Systematic Survey on Decoding Methods for Foundation Model, https://www.researchgate.net/profile/Haoran-Wang-96/publication/387703971_Make_Every_Token_Count_A_Systematic_Survey_on_Decoding_Methods_for_Foundation_Models/links/67784c8ce74ca64e1f49eb15/Make-Every-Token-Count-A-Systematic-Survey-on-Decoding-Methods-for-Foundation-Models.pdf https://github.com/wang2226/Awesome-LLM-Decoding
- Ashraf Eassa, Brian Slechta, Brian Pharris and Nick Comly, Sep 05, 2024, Low Latency Inference Chapter 1: Up to 1.9x Higher Llama 3.1 Performance with Medusa on NVIDIA HGX H200 with NVLink Switch, https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/
More Attention Research Topics
Related LLM research areas for long context optimization of the attention methods include:
- Attention optimization (main page)
- Local attention
- Linear attention
- Sparse attention
- Multi-Head Attention (MHA)
- Muti-Query Attention (MQA)
- Group-Query Attention (GQA)
- Flash attention
- Paged attention
Other topics in attention research:
- Low-rank matrix attention
- Medusa attention
- Block attention
- Cross attention
- Fused head attention
- Hybrid local-global attention
- FFT attention
- QKV computation optimizations
- Additive attention
- Multiplicative attention
- Graph attention
- Chunked attention
- Attention sink
- Attention steering
- Bilinear attention
- Attention-free methods
- Mixture-of-Heads (MOH) Attention (MoE+MHA)
- Star attention
- Ring attention
More AI Research
Read more about: