Aussie AI
Multi-head Latent Attention (MLA)
-
Last Updated 21 March, 2025
-
by David Spuler, Ph.D.
What is Multi-head Latent Attention (MLA)?
Multi-head Latent Attention (MLA) is an LLM attention optimization developed by DeepSeek. It became well-known with the release of DeepSeek R1 reasoning model in early 2025, but had actually been developed earlier for their V2/V3 non-reasoning models in mid-late 2024.
MLA improves upon the well-known LLM attention optimizations such as Multi-Head Attention (MHA) in the original Transformer paper, and the follow-on advancements of Multi-Query Attention (MQA) and and Group Query Attention (GQA). Subsequently, DeepSeek has also released as open-source the code for a combination of MLA and Flash Attention called "FlashMLA."
Research on MLA
Research papers on MLA include:
- The SGLang Team, Sep 04, 2024 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision, https://lmsys.org/blog/2024-09-04-sglang-v0-3/
- DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et. al. (many additional authors), 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model, https://arxiv.org/abs/2405.04434
- Tim Urista, Dec 2024, Dramatically Reduce Inference Costs with DeepSeek-V3: A New Era in Open-Source LLMs, https://ai.gopubby.com/dramatically-reduce-inference-costs-with-deepseek-v3-a-new-era-in-open-source-llms-4f1adf760ee1
- Minhajul Hoque, Jan 4, 2025, DeepSeek V3: How They Achieved Big Results with Small Compute, https://ai.plainenglish.io/deepseek-v3-how-they-achieved-big-results-with-small-compute-fb694606d59a (DeepSeek optimizations included FP8 quantization with outlier handling, attention and KV cache optimization via Multi-Head Latent Attention (MHLA), and multi-token decoding.)
- Dr. Ashish Bamania, Feb 2025, Multi-Head Latent Attention Is The Powerful Engine Behind DeepSeek: A deep dive Into DeepSeek’s innovative Attention mechanism that makes its LLMs so good https://levelup.gitconnected.com/multi-head-latent-attention-is-the-powerful-engine-behind-deepseek-0ecfd29e0b04 (MLA versus GQA/MQA attention and how MLA achieves KV cache compression.)
- Fanxu Meng, Zengwei Yao, Muhan Zhang, 13 Feb 2025 (v2), TransMLA: Multi-Head Latent Attention Is All You Need, https://arxiv.org/abs/2502.07864
- Tao Ji, Bin Guo, Yuanbin Wu, Qipeng Guo, Lixing Shen, Zhan Chen, Xipeng Qiu, Qi Zhang, Tao Gui, 20 Feb 2025, Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs, https://arxiv.org/abs/2502.14837
- DeepSeek, Feb 2025 (accessed), FlashMLA: FlashMLA is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences serving, https://github.com/deepseek-ai/FlashMLA
- Nickie Louise, February 24, 2025, DeepSeek launches FlashMLA: A breakthrough in AI speed and efficiency for NVIDIA GPUs, https://techstartups.com/2025/02/24/deepseek-launches-flashmla-a-breakthrough-in-ai-speed-and-efficiency-for-nvidia-gpus/
- Ashley Goolam, March 4, 2025, DeepSeek Open Source Week: A Complete Summary, https://apidog.com/blog/deepseek-open-source-week/
- Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré, March 4, 2025, ThunderMLA: FlashMLA, Faster and Fused-er! https://hazyresearch.stanford.edu/blog/2025-03-04-thundermla https://github.com/HazyResearch/ThunderKittens/blob/mla/kernels/attn/demo/mla_decode/template_mla_decode.cu (Using a single CUDA "megakernel" to perform all jobs and passing it meta-instructions, thereby avoiding launching and shutting down kernels.)
- Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, Emad Barsoum, 14 Mar 2025, X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression, https://arxiv.org/abs/2503.11132
More AI Research
Read more about: