Aussie AI

Transformer Architectures

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

The Transformer was itself a major architectural advance in 2017. Since then, numerous modified Transformer architectures have been tested, and many ways to optimize Transformers have been found. Discussion of the various major architectural changes is given below; see also Transformer code optimizations, inference optimization techniques, and a very long list of Transformer optimizations

Global Transformer Architecture Changes

Since the introduction of the vanilla Transformer in 2017, researchers have been searching for optimizations up and down the Transformer's tech stack.

Global Transformer optimizations: Some of the architectural-level optimizations to Transformer inference engines include:

Depth-wise optimizations (layers):

  • Layer pruning / early exit: cutting short the layers in the encoder and/or decoder is a successful optimization strategy; see layer pruning and early exit inference.
  • Shallow decoder architecture. This idea for modifying the Transformer's decoder architecture uses layer pruning in the decoder to achieve a "deep encoder/shallow decoder" architecture, as reported in several research papers, such as Kasai et al. (2021) and Hsu et al. (2020); see shallow decoder architectures, and also depth pruning.

Width-wise optimizations (attention heads):

  • Attention head pruning: not all of the attention heads are important, esp. in the decoder (numerous research papers; see head pruning). There's some irony here when you consider the title of the original 2017 Transformer paper!
  • Flash Attention: Of all the various attention optimizations, Flash Attention (Dao et al., June 2022), and particularly Flash Attention 2 (Dao, July 2023), seems to have emerged as the most popular. See attention optimization methods.
  • Smaller Attention Head Components. It is possible to use more simplified attention head components, esp. in the decoder, as in Kasai et al. (2021); see approximate attention heads and head pruning. Another method is "weight sharing" for attention heads (or "fused" heads), such as in Zhai et al. (2023).

Lengthwise optimizations (input sequences):

Too Much of a Smart Thing: Just like in Highlander, there can be only one. No, wait, that's incorrect! It's called "multi-AI" or "ensemble" AI:

  • Ensemble architectures. Most architectures with two or more Transformers are aiming to achieve more advanced reasoning (usually at a worse speed), but the "big-small" dual architecture aims to improve inference speed by sending common queries to the smaller model. See enemble architectures.

Component-Level Transformer Architecture Changes

Attention heads are addressed above under width pruning, and layers are depth pruning, but various other Transformer components can be optimized:

Normalization optimizations:

Activation function optimizations:

Decoder algorithms:

  • Faster decoding algorithms. Research in Transformers includes beam search decoding and greedy decoding. There's also aggressive decoding, speculative decoding and collaborative decoding.
  • Speculative decoding (supervised dual decoding). A parallelization method whereby the decoding occurs in the small model to generate possible tokens. A larger model has the smaller model running ahead, and it confirms or vetos the suggested tokens, which is basically the same plot as Terminator II. No, but, I'm just checking if you're reading this stuff like an AI, rather than scanning and skipping like a human. If the small model is usually correct, this speeds up the overall process compared to only running a large model. This is similar to Big-Little architectures, but differs because both models are still running. See speculative decoding.

Feed-Fordware Network optimizations:

  • FFN Pruning. Simplified decoders, with FFN removed, as in Kasai et al. (2021), although this may be dependent on the use case; see "FFN pruning" section.

MatMul optimizations: Also known as GEMM and various other dumb names. It's matrix multiplication and vector dot product like you did in High School.

Softmax optimizations: Occurs less frequently than MatMul, but Softmax can still be optimized:

Positional encoding optimizations: Not usually considered a bottleneck, but even the PE can be optimized:

But wait, there's more. And there are more ways to optimize. Refer to the complete list of Transformer optimizations.

Survey Papers on Transformer Architectures

Several papers have surveyed the literature for the latest Transformer ideas:

Decoder-Only Architectures

Decoder-only architectures are the modern version of Transformers, such as GPT. It was discovered that the encoder in the older encoder-decoder Transformers from 2017 was not needed, and was actually an inefficiency. Decoder-only models are faster, and need fewer weights.

Research on the decoder-only transformer architectures:

  • Sathya Krishnan Suresh, Shunmugapriya P, 24 Apr 2024 (v2), Towards smaller, faster decoder-only transformers: Architectural variants and their implications, https://arxiv.org/abs/2404.14462 Code: https://github.com/SkAndMl/gpt-variations (Focuses on three new variants of decoder-only Transformer architectures: ParallelGPT (p-gpt), LinearlyCompressedGPT (lc-gpt), and ConvCompressedGPT (cc-gpt).)
  • Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
  • Jesse Roberts, 2 Feb 2024 (v3), How Powerful are Decoder-Only Transformer Neural Models? https://arxiv.org/abs/2305.17026
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel, Apr 2022, What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? https://arxiv.org/abs/2204.05832
  • Urvashi Khandelwal, Kevin Clark, Dan Jurafsky, Lukasz Kaiser, 21 May 2019, Sample Efficient Text Summarization Using a Single Pre-Trained Transformer, https://arxiv.org/abs/1905.08836
  • Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Łukasz Kaiser, Noam Shazeer, Jan 2018, GENERATING WIKIPEDIA BY SUMMARIZING LONG SEQUENCES, ICLR 2018 https://arxiv.org/pdf/1801.10198.pdf
  • Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating Wikipedia by Summarizing Long Sequences. In Proceedings of the 6th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1801.10198
  • Yumo Bai, Feb 3, 2024 Why are most LLMs decoder-only? Dive into the rabbit hole of recent advancement in Large Language Models, https://medium.com/@yumo-bai/why-are-most-llms-decoder-only-590c903e4789
  • M Fujitake, 2023 DTrOCR: Decoder-only Transformer for Optical Character Recognition, https://arxiv.org/pdf/2308.15996.pdf
  • Benjamin Bergner, Andrii Skliar, Amelie Royer, Tijmen Blankevoort, Yuki Asano, Babak Ehteshami Bejnordi, 26 Feb 2024, Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding, https://arxiv.org/abs/2402.16844
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • Meta, July 23, 2024, Introducing Llama 3.1: Our most capable models to date, https://ai.meta.com/blog/meta-llama-3-1/
  • Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
  • Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825

Encoder-Decoder Architectures

Encoder-decoder Transformers are the older architecture from 2017. Decoder-only architectures have largely superceded this version, but it is still used in some use cases such as machine translation (foreign language translation).

Research on encoder-decoder architectures:

Encoder-Only Architectures

Encoder-only architectures lack a decoder, and are only used where the output is not a full text sequence. This makes sense in models where the output can be an embedding vector. Most modern LLMs are not using this architecture.

Research on encoder-only architectures:

  • Ting Hu, Christoph Meinel, Haojin Yang, 2024, A flexible BERT model enabling width- and depth-dynamic inference, Computer Speech & Language 4 April 2024, 101646, https://www.sciencedirect.com/science/article/pii/S0885230824000299 (Dual pruning method with layerwise "neural grafting" that gives dynamic width models, and combined with early exit on the depth dimension.)
  • Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
  • Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825

Hybrid Transformer Architectures

Transformer architectures have been merged with aspects of previous neural network theory to create hybrid architectures. Examples include:

  • Vision Transformer (ViT)
  • Transformer-RNN hybrid architectures
  • Transformer-CNN hybrid architectures

Research papers on hybrid transformer architectures:

Innovative New Transformer Architecture Research Papers

Since the original Transformer paper in 2017, and various other Transformer milestone papers, there have been numerous architectural variations proposed to alleviate efficiency or accuracy concerns. Research papers on specific modifications to the Transformer architecture include:

That's a Lot of BERTs!

BERT was an early 2019 Transformer architecture that was significantly innovative. Since then, there have been a great many variants of "BERT" (e.g. FastBERT, MobileBERT, DistilBERT, etc.). Research papers on variants of BERT include:

  • Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, May 2019, https://arxiv.org/abs/1810.04805, Code: https://github.com/google-research/bert (The one BERT to rule them all.)
  • Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. MobileBERT: a compact task-agnostic BERT for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020. https://arxiv.org/abs/2004.02984
  • Weijie Liu, Peng Zhou, Zhe Zhao, Zhiruo Wang, Haotang Deng, and Qi Ju. FastBERT: a self-distilling BERT with adaptive inference time. arXiv preprint arXiv:2004.02178, 2020. https://arxiv.org/abs/2004.02178
  • Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin. DeeBERT: Dynamic early exiting for accelerating BERT inference. arXiv preprint arXiv:2004.12993, 2020. https://arxiv.org/abs/2004.12993
  • Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019, https://arxiv.org/abs/1910.01108
  • Forrest N Iandola, Albert E Shaw, Ravi Krishna, and Kurt W Keutzer. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? arXiv preprint arXiv:2006.11316, 2020. https://arxiv.org/abs/2006.11316
  • Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang, Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang, Wei Lin, and Jingren Zhou. AdaBERT: Task-adaptive BERT compression with differentiable neural architecture search. arXiv preprint arXiv:2001.04246, 2020. https://arxiv.org/abs/2001.04246
  • Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. arXiv preprint arXiv:2004.04037, 2020. https://arxiv.org/abs/2004.04037
  • Zejian Liu, Fanrong Li, Gang Li, and Jian Cheng. EBERT: Efficient BERT Inference with Dynamic Structured Pruning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4814– 4823, 2021. https://aclanthology.org/2021.findings-acl.425/
  • Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR, 2019. https://arxiv.org/abs/1907.11692
  • Z. Jiang, W. Yu, D. Zhou, Y. Chen, J. Feng, and S. Yan, “Convbert: Improving BERT with span-based dynamic convolution,” in NeurIPS, 2020, https://arxiv.org/abs/2008.02496
  • H. Bao, L. Dong, S. Piao, and F. Wei, “BEit: BERT pre-training of image transformers,” in International Conference on Learning Representations, 2022. https://arxiv.org/abs/2106.08254
  • Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu, Oct 2020, TinyBERT: Distilling BERT for Natural Language Understanding, https://arxiv.org/abs/1909.10351
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey

Next-Generation Architectures

What comes after Transformers? Maybe the answer is: more Transformers! Certainly, the newer multi-modal Transformers are gaining momentum, and there are other advanced Transformers:

  • Vision Transformer (ViT)
  • Multimodal transformer
  • Ensemble architectures (multi-AI, such as MoE)
  • Agent architectures (e.g., function calling, autonomous agents)
  • Advanced RAG architectures
  • Tool Augmented Language Models (TALM)
  • Retrieval Augment Language Models (RALM)
  • Compound AI architectures

However, there are some alternatives to Transformers that have been gathering steam. Here are a few newer architectures already being worked on:

  • State Space Models (SSMs)
  • RWKV (Transformer-RNN hybrid)
  • Mamba (a type of SSM)
  • Graph Neural Networks and Knowledge Graph extensions
  • S4 Hyena architecture
  • Spiking Neural Networks (SNNs) and Spiking Transformers
  • Weightless Neural Networks (WNNs)
  • Liquid Neural Networks (LNNs)
  • Hybrid Transformer-RNN architectures
  • Hybrid Transformer-CNN architectures

Research papers on next-gen architectures:

RWKV Architecture

The RWKV architecture is a hybrid Transformer-RNN architecture. Research papers on RWKV include:

State Space Models (SSMs)

  • Badri Narayana Patro, Vijay Srinivas Agneeswaran, 24 Apr 2024, Mamba-360: Survey of State Space Models as Transformer Alternative for Long Sequence Modelling: Methods, Applications, and Challenges, https://arxiv.org/abs/2404.16112
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Karan Goel, August 27, 2024, The On‑Device Intelligence Update https://cartesia.ai/blog/2024-08-27-on-device (On-device state space models.)
  • Nicolas Stellwag, 2024, Structured State Space Models, https://nicolasstellwag.com/download/structured_SSMs.pdf
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Cong Guo, Feng Cheng, Zhixu Du, James Kiessling, Jonathan Ku, Shiyu Li, Ziru Li, Mingyuan Ma, Tergel Molom-Ochir, Benjamin Morris, Haoxuan Shan, Jingwei Sun, Yitu Wang, Chiyue Wei, Xueying Wu, Yuhao Wu, Hao Frank Yang, Jingyang Zhang, Junyao Zhang, Qilin Zheng, Guanglei Zhou, Hai (Helen)Li, Yiran Chen, 8 Oct 2024. A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models, https://arxiv.org/abs/2410.07265
  • Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
  • Yash Akhauri, Safeen Huda, Mohamed S. Abdelfattah, 26 Nov 2024, Attamba: Attending To Multi-Token States, https://arxiv.org/abs/2411.17685
  • Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Ravi Netravali, Yida Wang, 28 Nov 2024, Marconi: Prefix Caching for the Era of Hybrid LLMs, https://arxiv.org/abs/2411.19379 (Prefix caching applied to hybrid SSM-Transformer LLMs.)

Hyena Architecture

  • Pierre-David Letourneau, Manish Kumar Singh, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, Fatih Porikli, 16 Jul 2024, PADRe: A Unifying Polynomial Attention Drop-in Replacement for Efficient Vision Transformer, https://arxiv.org/abs/2407.11306
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • From Transformers to the Future: An In-Depth Exploration of Modern Language Model Architectures H Xu, Z Bi, H Tseng, X Song, P Feng, https://osf.io/n8r5j/download

Mamba Architecture

The Mamba architecture is an advanced AI architecture based on the State Space Model (SSM) architecture. Research papers on Mamba include:

Knowledge Graph AI Architectures

Knowledge graphs represent structured information as a graph, usually a Directed Acyclic Graph (DAG). This additional structural information can improve LLM results, but it is not easy to integrate graph-structured data into the sequential text sequences expected by an LLM. One particular usage of Knowledge Graphs is to extend RAG architectures, called a "RAG Graph" architecture.

Research papers on Knowledge Graphs in AI include:

Graph Neural Networks

Research papers on Graph Neural Networks (GNNs):

Compound AI Architectures

Compound AI architectures are a new category that generalizes both RAG and multi-AI ensemble architectures. The general idea is that various components can be placed around an LLM, or multiple LLM queries can be used, and this can be done in a variety of ways. RAG is a well-known subcategory in this vein, as are extensions using Knowledge Graphs.

Research on Compound AI architectures:

Research on Efficient Architectures

More AI Research

Read more about: