Aussie AI

Transformer Optimization

Last Updated 7 March, 2025

by David Spuler, Ph.D.

The Transformer was invented at Google in 2017 and open-sourced by their research group. It became the most widely used AI engine architecture, notably being used in GPT-3 by OpenAI's ChatGPT. Since then, optimization research has taken off. There are two basic ways to optimize Transformer models:

Transformer architecture improvements: large-scale improvements.
Transformer code optimizations: smaller improvements, discussed below.

There are various ways to optimize a Transformer with code optimizations. Much research has also been conducted on slight modifications to the architecture of the Transformer to improve latency and throughput in both inference and training.

Transformer Inference Optimizations

See also these articles for further information on Transformer inference optimization:

Transformer Kernel Code Optimizations

Some of the specific kernel optimizations of inference engines include:

Attention head caching: Precomputing and caching attention head matrices from already-processed tokens (HuggingFace, 2021). This reduces the auto-regression costs when outputting multiple tokens (which is the usual case). See also attention head pruning
KV Caching: This optimization is caching the attention head K and V tensor matrix multiplications during decoding (Intel, 2023). This reduces the number of decoder matrix multiplications. See KV caching research.
Padding byte optimizations: Removing padding in the Feed Forward Network tensor/matrix computations (Intel, 2023; also in ByteTransformer by Zhai et al. (2023)); see "zero padding removal". This reduces the total number of multiplications.
Attention dimensions: Merging Q, K, and V matrices (of identical size) into a single large matrix for better matrix multiplication throughput (Zhai et al., 2023).
Operator fusion and reordering: Reordering reshaping and matmul operations (Intel, 2023). This streamlines some of the arithmetic operations to use more compact low-level libraries. See kernel fusion optimizations.

Kernel Optimization Research Papers

Reference papers on some of the specific code optimizations in Transformer engines:

Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference
Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (Contains a number of significant optimizations to the original Transformer architecture.)
Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/, Code: https://github.com/TenTrans/TenTrans
Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-based Generative Models, Jaewan Choi, Jaehyun Park, Kwanhee Kyung, Nam Sung Kim, and Jung Ho Ahn, IEEE Computer Architecture Letters, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10218731 (Efficient memory storage of K and V vectors in Transformer inference.)
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019, https://arxiv.org/abs/1904.01038, Code: https://github.com/pytorch/fairseq (Includes inference optimizations such as caching model states from previously generated tokens.)
Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (This paper avoids zero-padding inputs amongst other optimizations.)
Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 2022, Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp 1–36 https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737 (Extensive survey that contains a section on "Memoization" which is caching computed values for later reuse.)

See also general research on code optimizations.

Transformer General Optimizations

Some of the general classes of optimization techniques for the Transformer architecture include:

Hardware-specific optimizations and low-level libraries (various)
Model compilation (graph compilers / deep learning compilers)
Transformer architectures
Kernel optimizations (i.e. inference engine code optimizations such as caching and kernel fusion).
Inference optimization techniques (numerous methods)
Caching of the entire query results to re-use for other users. This is called an Inference Cache.

And here is a long list of the various other optimizations possible:

Model compression
Quantization (binary, ternary, logarithmic, 2-bit, 3-bit, 4-bit, 8-bit, FP8, FP16, stochastic, etc.)
Pruning (length, width, depth, dual, triple, layer, token, and more)
Distillation
Weight sharing
Fusion: layer fusion, kernel operator fusion
Skipping: including layer skipping, early exit, zero skipping
Arithmetic: zero-multiplication, conditional computation, logs, approximations
Decoding: speculative decoding, parallel decoding, aggressive decoding, etc.

For even more, see inference optimizations, Transformer architectural optimizations, and a complete list of Transformer optimizations.

Survey Papers on Transformer Optimization

Review and survey papers on faster Transformer engines:

Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami, Full stack optimization of transformer inference: a survey, Feb 2023, arXiv:2302.14017, https://arxiv.org/abs/2302.14017
Full Stack Optimization of Transformer Inference: a Survey. Part 2 on Transformer Optimization, A Paper Overview, https://www.nebuly.com/blog/full-stack-optimization-of-transformer-inference-a-survey-part-2
Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey (v2). arXiv preprint arXiv:2009.06732, 2022, https://arxiv.org/abs/2009.06732
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, A Survey of Techniques for Optimizing Transformer Inference, 2023, arxiv.org July 2023, https://arxiv.org/abs/2307.07982
L Papa, P Russo, I Amerini, L Zhou, Sep 2023, A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking, arXiv preprint arXiv:2309.02031, 2023, https://arxiv.org/abs/2309.02031
Efficient Attention: Breaking The Quadratic Transformer Bottleneck, 2023 (accessed 8/12/23), https://gwern.net/note/attention, (A regularly updated bibliography of transformer attention optimization papers)

Tips for Transformer Optimization

Articles and papers with general tips on optimizing a Transformer:

Fabián Varietti, Rodrigo Gallardo, Ian Spektor, Francisco Kurucz, Facundo Parodi, A guide to optimizing Transformer-based models for faster inference Tue, Nov 29, 2022 https://tryolabs.com/blog/2022/11/24/transformer-based-model-for-faster-inference
Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
Model optimization (TensorFlow), https://www.tensorflow.org/lite/performance/model_optimization
Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference
Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (Contains a number of significant optimizations to the original Transformer architecture.)
Weng, Lilian. (Jan 2023). Large Transformer Model Inference Optimization. Lil’Log. https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
Philipp Schmid, Accelerate Sentence Transformers with Hugging Face Optimum, August 2, 2022, https://www.philschmid.de/optimize-sentence-transformers
Michaël Benesty, Hugging Face Transformer Inference Under 1 Millisecond Latency, Nov 5, 2021 https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c

Research on Specific Fast Transformers

These papers are on new faster Transformer architectures tested by researchers:

Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (This paper uses zero-padding inputs and fused attention heads with shared parameters)
Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya, Reformer: The efficient transformer, In International Conference on Learning Representations, 2019, https://arxiv.org/abs/2001.04451
J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578
NVIDIA, NVIDIA FasterTransformer, https://github.com/NVIDIA/FasterTransformer

General Research on Transformer Optimization

These papers review Transformer optimization techniques in general.

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin , James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, "Efficiently Scaling Transformer Inference", arXiv:2211.05102v1 [cs.LG], 9 Nov 2022, https://arxiv.org/abs/2211.05102
Dave Dice, Alex Kogan, Optimizing Inference Performance of Transformers on CPUs, Feb 2021, https://arxiv.org/abs/2102.06621
Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. 2020. Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation. CoRR, abs/2006.10369. https://arxiv.org/abs/2006.10369, Code: https://github.com/jungokasai/deep-shallow (Single-layer decoder architecture, see also shallow decoder Transformer architectures inspired by this paper.)
Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee, Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding, July 2023, https://arxiv.org/abs/2307.05908
Zining Zhang; Yao Chen; Bingsheng He; Zhenjie Zhang, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 6, June 2023, pp.1982-1995, https://ieeexplore.ieee.org/abstract/document/10107474
So, D. R., Ma’nke, W., Liu, H., Dai, Z., Shazeer, N. M., and Le, Q. V., 2021 (updated Jan 2022), Primer: Searching for efficient transformers for language modeling, ArXiv, abs/2109.08668, https://arxiv.org/abs/2109.08668 Code: https://github.com/google-research/google-research/tree/master/primer (Has a different Transformer architecture, but not a common one.)
Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A., Adaptive attention span in transformers. In Annual Meeting of the Association for Computational Linguistics, Aug 2019, https://arxiv.org/abs/1905.07799 (Self-adaptive context lengths for attention heads.)
Bapna, A., Arivazhagan, N., and Firat, O., Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106 (Conditionally controls which subunits of the model can execute.)
Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 (The original FlashAttention version 1, now superceded by FlashAttention 2.)
GI Yu, JS Jeong, GW Kim, S Kim, BG Chun, 2022, Orca: A distributed serving system for Transformer-Based generative models, 16th USENIX Symposium, https://www.usenix.org/conference/osdi22/presentation/yu, PDF: https://www.usenix.org/system/files/osdi22-yu.pdf (Improved parallelization/pipelining with latency reduction from iteration-level scheduling across multiple requests.)
K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf (Interesting review of safety and bias/fairness issues for models optimized by quantization, pruning or distillation.)
X Li, B Ren, X Shen, Y Wang, 2022, CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework, arXiv preprint arXiv:2206.10620, https://arxiv.org/abs/2206.10620 (Various optimizations including block pruning and deep reuse.)

Kernel Optimizations

Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, April 2024, Tandem processor: Grappling with emerging operators in neural networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 1165–1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365 Code: https://actlab-genesys.github.io (Reviews hardware acceleration of all sub-layer kernel operators, with a focus beyond just GEMM/MatMul operators.)
Make LLM Fine-tuning 2x faster with Unsloth and HF TRL, January 10, 2023, Daniel Han-Chen, https://huggingface.co/blog/unsloth-trl Code: https://github.com/huggingface/blog/blob/main/unsloth-trl.md (Optimizes some PyTorch kernels for back-propagation and reduces memory usage in fine-tuning; currently works with Llama and Mistral architectures.)
H Shen, H Chang, B Dong, Y Luo, H Meng, Nov 2023, Efficient LLM Inference on CPUs, arXiv preprint arXiv:2311.00502, https://arxiv.org/pdf/2311.00502.pdf Code: https://github.com/intel/intel-extension-for-transformers (INT4 weight quantization with 16-bit activations, and highly optimized kernel with support for AVX2, AVX512, AVX512_VNNI and Advanced Matrix Extensions (AMX), and KV caching, tested on LLamam2 3B to 20B with 20-80ms latency per token.)
Piotr Kluska, Adri´an Castello, Florian Scheidegger, A. Cristiano I. Malossi, 2024, QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers https://openaccess.thecvf.com/content/CVPR2024W/eLVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf
Christian Szegedy et al., 2015, Going Deeper with Convolutions, http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf (The GoogleNet paper.)
Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, Ghislain Durif, 8 Apr 2021 (v2), Kernel Operations on the GPU, with Autodiff, without Memory Overflows, https://arxiv.org/abs/2004.11127 Code: https://www.kernel-operations.io/keops/index.html
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Alejandro Araya-Núñez, Justin Fernández-Badilla, Daniel González-Vargas, Jimena León-Huertas, Erick-Andrés Obregón-Fonseca, Danny Xie-Li, June, 2024, Proposal of an open-source accelerators library for inference of transformer networks in edge devices based on Linux, Tecnología en Marcha. Vol. 37, special issue. IEEE Latin American Electron Devices Conference (LAEDC), pages 118-125, https://doi.org/10.18845/tm.v37i5.7225 PDF: https://revistas.tec.ac.cr/index.php/tec_marcha/article/download/7225/7076
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 5 Jul 2024 (v3), Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041
Zheming Jin, July 2024, Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL, Oak Ridge National Laboratory, ORNL/TM-2024/3463, https://info.ornl.gov/sites/publications/Files/Pub217394.pdf
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
Intel, 2024, Get Started with Intel® oneAPI Math Kernel Library, https://www.intel.com/content/www/us/en/docs/onemkl/get-started-guide/2023-0/overview.html
T Zhao, 2024, Acceleration of Deep Learning Algorithms with Transformers, https://escholarship.org/uc/item/3419t2z6
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
J. Bi et al., "Efficient and Fast High-performance Library Generation for Deep Learning Accelerators," in IEEE Transactions on Computers, doi: 10.1109/TC.2024.3475575, https://ieeexplore.ieee.org/abstract/document/10707341 (Finding the most efficient kernel.)
Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
HF, 2024, TGI v3 overview, https://huggingface.co/docs/text-generation-inference/conceptual/chunking
Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, 7 Dec 2023 (v2), Efficient LLM Inference on CPUs, https://arxiv.org/abs/2311.00502 https://github.com/intel/intel-extension-for-transformers
Xiaoran Liu, Ruixiao Li, Mianqiu Huang, Zhigeng Liu, Yuerong Song, Qipeng Guo, Siyang He, Qiqi Wang, Linlin Li, Qun Liu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu, 24 Feb 2025, Thus Spake Long-Context Large Language Model, https://arxiv.org/abs/2502.17129 (Impressive survey of many techniques to improve efficiency and accuracy of long context processing in both inference and training, covering text, video and multimodal models.)
Runxin Zhong, Yuyang Jin, Chen Zhang, Kinman Lei, Shuangyu Li, and Jidong Zhai. 2025. FlashTensor: Optimizing Tensor Programs by Leveraging Fine-grained Tensor Property. In Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '25). Association for Computing Machinery, New York, NY, USA, 183–196. https://doi.org/10.1145/3710848.3710864 https://dl.acm.org/doi/abs/10.1145/3710848.3710864

Aussie AI

Transformer Optimization

Transformer Inference Optimizations

Transformer Kernel Code Optimizations

Kernel Optimization Research Papers

Transformer General Optimizations

Survey Papers on Transformer Optimization

Tips for Transformer Optimization

Research on Specific Fast Transformers

General Research on Transformer Optimization

Kernel Optimizations

More AI Research

Quick Links

Product

New to Writing?

Writing Styles