Aussie AI
Mechanistic Interpretability
-
Last Updated 7 March, 2025
-
by David Spuler, Ph.D.
Mechanistic interpretability is the analysis of internal LLM computations during attention to understand or interpret why the LLM emitted the answers that it did. This involves analysis of the signals from the activations in the latent space of the embeddings. Mechanistic interpretability was initially a read-only analysis to aid in explainability, but arithmetic modifications of the activations are also possible in methods such as attention steering and activation patching.
Research on Mechanistic Interpretability
Research papers on mechanistic interpretability:
- Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, Ziyu Yao, 2 Jul 2024, A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models, https://arxiv.org/abs/2407.02646
- Mengru Wang, Yunzhi Yao, Ziwen Xu, Shuofei Qiao, Shumin Deng, Peng Wang, Xiang Chen, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang, 4 Dec 2024 (v4), Knowledge Mechanisms in Large Language Models: A Survey and Perspective, https://arxiv.org/abs/2407.15017
- Xintong Wang, Jingheng Pan, Longqin Jiang, Liang Ding, Xingshan Li, Chris Biemann, 23 Oct 2024, CogSteer: Cognition-Inspired Selective Layer Intervention for Efficient Semantic Steering in Large Language Models, https://arxiv.org/abs/2410.17714
- Hesam Hosseini, Ghazal Hosseini Mighan, Amirabbas Afzali, Sajjad Amini, Amir Houmansadr, 15 Nov 2024, ULTra: Unveiling Latent Token Interpretability in Transformer Based Understanding, https://arxiv.org/abs/2411.12589
- Naomi Saphra, Sarah Wiegreffe, 7 Oct 2024, Mechanistic? https://arxiv.org/abs/2410.09087
- Leonard Bereska, Efstratios Gavves, 23 Aug 2024 (v3), Mechanistic Interpretability for AI Safety -- A Review, https://arxiv.org/abs/2404.14082
- Neel Nanda, 8th Jul 2024, An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2, https://www.alignmentforum.org/posts/NfFST5Mio7BCAQHPA/an-extremely-opinionated-annotated-list-of-my-favourite
- Nikhil Anand, Dec 20, 2024, Understanding “steering” in LLMs And how simple math can solve global problems. https://ai.gopubby.com/understanding-steering-in-llms-96faf6e0bee7
- Chashi Mahiul Islam, Samuel Jacob Chacko, Mao Nishino, Xiuwen Liu, 7 Feb 2025, Mechanistic Understandings of Representation Vulnerabilities and Engineering Robust Vision Transformers, https://arxiv.org/abs/2502.04679
- Ala N. Tak, Amin Banayeeanzade, Anahita Bolourani, Mina Kian, Robin Jia, Jonathan Gratch, 8 Feb 2025, Mechanistic Interpretability of Emotion Inference in Large Language Models, https://arxiv.org/abs/2502.05489
- Artem Kirsanov, Chi-Ning Chou, Kyunghyun Cho, SueYeon Chung, 11 Feb 2025, The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models, https://arxiv.org/abs/2502.08009
- Zeping Yu, Yonatan Belinkov, Sophia Ananiadou, 15 Feb 2025, Back Attention: Understanding and Enhancing Multi-Hop Reasoning in Large Language Models, https://arxiv.org/abs/2502.10835
- Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le, Kosi Asuzu, Carsten Maple, 24 Feb 2025, Representation Engineering for Large-Language Models: Survey and Research Challenges,https://arxiv.org/abs/2502.17601
- Samuel Miller, Daking Rai, Ziyu Yao, 20 Feb 2025, Mechanistic Understanding of Language Models in Syntactic Code Completion, https://arxiv.org/abs/2502.18499
- Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin, 27 Feb 2025, Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking, https://arxiv.org/abs/2502.20129
- Kevin Meng, David Bau, Alex Andonian, Yonatan Belinkov, 13 Jan 2023 (v5), Locating and Editing Factual Associations in GPT, https://arxiv.org/abs/2202.05262
- Yingbing Huang, Deming Chen, Abhishek K. Umrawal, 28 Feb 2025, JAM: Controllable and Responsible Text Generation via Causal Reasoning and Latent Vector Manipulation, https://arxiv.org/abs/2502.20684
More Attention Research Topics
Related LLM research areas for long context optimization of the attention methods include:
- Attention optimization (main page)
- Local attention
- Linear attention
- Sparse attention
- Multi-Head Attention (MHA)
- Muti-Query Attention (MQA)
- Group-Query Attention (GQA)
- Flash attention
- Paged attention
Other topics in attention research:
- Low-rank matrix attention
- Medusa attention
- Block attention
- Cross attention
- Fused head attention
- Hybrid local-global attention
- FFT attention
- QKV computation optimizations
- Additive attention
- Multiplicative attention
- Graph attention
- Chunked attention
- Attention sink
- Attention steering
- Bilinear attention
- Attention-free methods
- Mixture-of-Heads (MOH) Attention (MoE+MHA)
- Star attention
- Ring attention
More AI Research
Read more about: