Aussie AI

Multi-head Latent Attention (MLA)

  • Last Updated 21 March, 2025
  • by David Spuler, Ph.D.

What is Multi-head Latent Attention (MLA)?

Multi-head Latent Attention (MLA) is an LLM attention optimization developed by DeepSeek. It became well-known with the release of DeepSeek R1 reasoning model in early 2025, but had actually been developed earlier for their V2/V3 non-reasoning models in mid-late 2024.

MLA improves upon the well-known LLM attention optimizations such as Multi-Head Attention (MHA) in the original Transformer paper, and the follow-on advancements of Multi-Query Attention (MQA) and and Group Query Attention (GQA). Subsequently, DeepSeek has also released as open-source the code for a combination of MLA and Flash Attention called "FlashMLA."

Research on MLA

Research papers on MLA include:

More AI Research

Read more about: