Aussie AI
Star Attention
-
Last Updated 7 December, 2024
-
by David Spuler, Ph.D.
Star attention is an LLM attention optimization that reduces the cost of attention matrix computations on long token sequences. It is a type of "linear attention" that uses an approximation with "block sparsity" to avoid the quadratic complexity of LLM full attention algorithms. This helps the LLM known which blocks of tokens to pay attention to without needing to compute the QKV tensors for every single token.
Related LLM research areas for long context optimization of the attention methods include:
- Attention optimization
- Local attention
- Linear attention
- Sparse attention
- Multi-Head Attention (MHA)
- Muti-Query Attention (MQA)
- Group-Query Attention (GQA)
- Flash attention
- Paged attention
Research on Star Attention
- Shantanu Acharya, Fei Jia, Boris Ginsburg, 26 Nov 2024, Star Attention: Efficient LLM Inference over Long Sequences, https://arxiv.org/abs/2411.17116
- Aswin Ak, November 28, 2024, NVIDIA AI Research Unveils ‘Star Attention’: A Novel AI Algorithm for Efficient LLM Long-Context Inference, https://www.marktechpost.com/2024/11/28/nvidia-ai-research-unveils-star-attention-a-novel-ai-algorithm-for-efficient-llm-long-context-inference/
More AI Research
Read more about: