Aussie AI
Transformer Optimization
-
Last Updated 27 October, 2024
-
by David Spuler, Ph.D.
The Transformer was invented at Google in 2017 and open-sourced by their research group. It became the most widely used AI engine architecture, notably being used in GPT-3 by OpenAI's ChatGPT. Since then, optimization research has taken off. There are two basic ways to optimize Transformer models:
- Transformer architecture improvements: large-scale improvements.
- Transformer code optimizations: smaller improvements, discussed below.
There are various ways to optimize a Transformer with code optimizations. Much research has also been conducted on slight modifications to the architecture of the Transformer to improve latency and throughput in both inference and training.
Transformer Inference Optimizations
See also these articles for further information on Transformer inference optimization:
- What's Hot in Inference Optimization?
- 500+ Techniques for LLM Inference Optimization
- Long List of LLM Optimization Techniques
- Inference Optimization Research Blog
Transformer Kernel Code Optimizations
Some of the specific kernel optimizations of inference engines include:
- Attention head caching: Precomputing and caching attention head matrices from already-processed tokens (HuggingFace, 2021). This reduces the auto-regression costs when outputting multiple tokens (which is the usual case). See also attention head pruning
- KV Caching: This optimization is caching the attention head K and V tensor matrix multiplications during decoding (Intel, 2023). This reduces the number of decoder matrix multiplications. See KV caching research.
- Padding byte optimizations: Removing padding in the Feed Forward Network tensor/matrix computations (Intel, 2023; also in ByteTransformer by Zhai et al. (2023)); see "zero padding removal". This reduces the total number of multiplications.
- Attention dimensions: Merging Q, K, and V matrices (of identical size) into a single large matrix for better matrix multiplication throughput (Zhai et al., 2023).
- Operator fusion and reordering: Reordering reshaping and matmul operations (Intel, 2023). This streamlines some of the arithmetic operations to use more compact low-level libraries. See kernel fusion optimizations.
Kernel Optimization Research Papers
Reference papers on some of the specific code optimizations in Transformer engines:
- Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference
- Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (Contains a number of significant optimizations to the original Transformer architecture.)
- Kaixin Wu, Bojie Hu, and Qi Ju. 2021. TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task. In Proceedings of the Sixth Conference on Machine Translation, pages 795–798, Online. Association for Computational Linguistics, https://aclanthology.org/2021.wmt-1.77/, Code: https://github.com/TenTrans/TenTrans
- Unleashing the Potential of PIM: Accelerating Large Batched Inference of Transformer-based Generative Models, Jaewan Choi, Jaehyun Park, Kwanhee Kyung, Nam Sung Kim, and Jung Ho Ahn, IEEE Computer Architecture Letters, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10218731 (Efficient memory storage of K and V vectors in Transformer inference.)
- Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019, https://arxiv.org/abs/1904.01038, Code: https://github.com/pytorch/fairseq (Includes inference optimizations such as caching model states from previously generated tokens.)
- Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (This paper avoids zero-padding inputs amongst other optimizations.)
- Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 2022, Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp 1–36 https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737 (Extensive survey that contains a section on "Memoization" which is caching computed values for later reuse.)
See also general research on code optimizations.
Transformer General Optimizations
Some of the general classes of optimization techniques for the Transformer architecture include:
- Hardware-specific optimizations and low-level libraries (various)
- Model compilation (graph compilers / deep learning compilers)
- Transformer architectures
- Kernel optimizations (i.e. inference engine code optimizations such as caching and kernel fusion).
- Inference optimization techniques (numerous methods)
- Caching of the entire query results to re-use for other users. This is called an Inference Cache.
And here is a long list of the various other optimizations possible:
- Model compression
- Quantization (binary, ternary, logarithmic, 2-bit, 3-bit, 4-bit, 8-bit, FP8, FP16, stochastic, etc.)
- Pruning (length, width, depth, dual, triple, layer, token, and more)
- Distillation
- Weight sharing
- Fusion: layer fusion, kernel operator fusion
- Skipping: including layer skipping, early exit, zero skipping
- Arithmetic: zero-multiplication, conditional computation, logs, approximations
- Decoding: speculative decoding, parallel decoding, aggressive decoding, etc.
For even more, see inference optimizations, Transformer architectural optimizations, and a complete list of Transformer optimizations.
Survey Papers on Transformer Optimization
Review and survey papers on faster Transformer engines:
- Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami, Full stack optimization of transformer inference: a survey, Feb 2023, arXiv:2302.14017, https://arxiv.org/abs/2302.14017
- Full Stack Optimization of Transformer Inference: a Survey. Part 2 on Transformer Optimization, A Paper Overview, https://www.nebuly.com/blog/full-stack-optimization-of-transformer-inference-a-survey-part-2
- Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey (v2). arXiv preprint arXiv:2009.06732, 2022, https://arxiv.org/abs/2009.06732
- Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, A Survey of Techniques for Optimizing Transformer Inference, 2023, arxiv.org July 2023, https://arxiv.org/abs/2307.07982
- L Papa, P Russo, I Amerini, L Zhou, Sep 2023, A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking, arXiv preprint arXiv:2309.02031, 2023, https://arxiv.org/abs/2309.02031
- Efficient Attention: Breaking The Quadratic Transformer Bottleneck, 2023 (accessed 8/12/23), https://gwern.net/note/attention, (A regularly updated bibliography of transformer attention optimization papers)
Tips for Transformer Optimization
Articles and papers with general tips on optimizing a Transformer:
- Fabián Varietti, Rodrigo Gallardo, Ian Spektor, Francisco Kurucz, Facundo Parodi, A guide to optimizing Transformer-based models for faster inference Tue, Nov 29, 2022 https://tryolabs.com/blog/2022/11/24/transformer-based-model-for-faster-inference
- Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
- Model optimization (TensorFlow), https://www.tensorflow.org/lite/performance/model_optimization
- Hugging Face, How we sped up transformer inference 100x for HF API customers, January 18, 2021, https://huggingface.co/blog/accelerated-inference
- Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (Contains a number of significant optimizations to the original Transformer architecture.)
- Weng, Lilian. (Jan 2023). Large Transformer Model Inference Optimization. Lil’Log. https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- Philipp Schmid, Accelerate Sentence Transformers with Hugging Face Optimum, August 2, 2022, https://www.philschmid.de/optimize-sentence-transformers
- Michaël Benesty, Hugging Face Transformer Inference Under 1 Millisecond Latency, Nov 5, 2021 https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c
Research on Specific Fast Transformers
These papers are on new faster Transformer architectures tested by researchers:
- Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (This paper uses zero-padding inputs and fused attention heads with shared parameters)
- Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya, Reformer: The efficient transformer, In International Conference on Learning Representations, 2019, https://arxiv.org/abs/2001.04451
- J Fang, Y Yu, C Zhao, J Zhou, 2021, Turbotransformers: an efficient gpu serving system for transformer models, Proceedings of the 26th ACM SIGPLAN, https://dl.acm.org/doi/abs/10.1145/3437801.3441578, PDF: https://dl.acm.org/doi/pdf/10.1145/3437801.3441578
- NVIDIA, NVIDIA FasterTransformer, https://github.com/NVIDIA/FasterTransformer
General Research on Transformer Optimization
These papers review Transformer optimization techniques in general.
- Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin , James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, "Efficiently Scaling Transformer Inference", arXiv:2211.05102v1 [cs.LG], 9 Nov 2022, https://arxiv.org/abs/2211.05102
- Dave Dice, Alex Kogan, Optimizing Inference Performance of Transformers on CPUs, Feb 2021, https://arxiv.org/abs/2102.06621
- Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. 2020. Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation. CoRR, abs/2006.10369. https://arxiv.org/abs/2006.10369, Code: https://github.com/jungokasai/deep-shallow (Single-layer decoder architecture, see also shallow decoder Transformer architectures inspired by this paper.)
- Seongjun Yang, Gibbeum Lee, Jaewoong Cho, Dimitris Papailiopoulos, Kangwook Lee, Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding, July 2023, https://arxiv.org/abs/2307.05908
- Zining Zhang; Yao Chen; Bingsheng He; Zhenjie Zhang, NIOT: A Novel Inference Optimization of Transformers on Modern CPUs, IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 6, June 2023, pp.1982-1995, https://ieeexplore.ieee.org/abstract/document/10107474
- So, D. R., Ma’nke, W., Liu, H., Dai, Z., Shazeer, N. M., and Le, Q. V., 2021 (updated Jan 2022), Primer: Searching for efficient transformers for language modeling, ArXiv, abs/2109.08668, https://arxiv.org/abs/2109.08668 Code: https://github.com/google-research/google-research/tree/master/primer (Has a different Transformer architecture, but not a common one.)
- Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A., Adaptive attention span in transformers. In Annual Meeting of the Association for Computational Linguistics, Aug 2019, https://arxiv.org/abs/1905.07799 (Self-adaptive context lengths for attention heads.)
- Bapna, A., Arivazhagan, N., and Firat, O., Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106 (Conditionally controls which subunits of the model can execute.)
- Tri Dao, July 2023, FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning, https://arxiv.org/abs/2307.08691, Code: https://github.com/Dao-AILab/flash-attention
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 (The original FlashAttention version 1, now superceded by FlashAttention 2.)
- GI Yu, JS Jeong, GW Kim, S Kim, BG Chun, 2022, Orca: A distributed serving system for Transformer-Based generative models, 16th USENIX Symposium, https://www.usenix.org/conference/osdi22/presentation/yu, PDF: https://www.usenix.org/system/files/osdi22-yu.pdf (Improved parallelization/pipelining with latency reduction from iteration-level scheduling across multiple requests.)
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf (Interesting review of safety and bias/fairness issues for models optimized by quantization, pruning or distillation.)
- X Li, B Ren, X Shen, Y Wang, 2022, CoCoPIE XGen: A Full-Stack AI-Oriented Optimizing Framework, arXiv preprint arXiv:2206.10620, https://arxiv.org/abs/2206.10620 (Various optimizations including block pruning and deep reuse.)
Kernel Optimizations
- Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, April 2024, Tandem processor: Grappling with emerging operators in neural networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 1165–1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365 Code: https://actlab-genesys.github.io (Reviews hardware acceleration of all sub-layer kernel operators, with a focus beyond just GEMM/MatMul operators.)
- Make LLM Fine-tuning 2x faster with Unsloth and HF TRL, January 10, 2023, Daniel Han-Chen, https://huggingface.co/blog/unsloth-trl Code: https://github.com/huggingface/blog/blob/main/unsloth-trl.md (Optimizes some PyTorch kernels for back-propagation and reduces memory usage in fine-tuning; currently works with Llama and Mistral architectures.)
- H Shen, H Chang, B Dong, Y Luo, H Meng, Nov 2023, Efficient LLM Inference on CPUs, arXiv preprint arXiv:2311.00502, https://arxiv.org/pdf/2311.00502.pdf Code: https://github.com/intel/intel-extension-for-transformers (INT4 weight quantization with 16-bit activations, and highly optimized kernel with support for AVX2, AVX512, AVX512_VNNI and Advanced Matrix Extensions (AMX), and KV caching, tested on LLamam2 3B to 20B with 20-80ms latency per token.)
- Piotr Kluska, Adri´an Castello, Florian Scheidegger, A. Cristiano I. Malossi, 2024, QAttn: Efficient GPU Kernels for mixed-precision Vision Transformers https://openaccess.thecvf.com/content/CVPR2024W/eLVM/papers/Kluska_QAttn_Efficient_GPU_Kernels_for_Mixed-precision_Vision_Transformers_CVPRW_2024_paper.pdf
- Christian Szegedy et al., 2015, Going Deeper with Convolutions, http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf (The GoogleNet paper.)
- Benjamin Charlier, Jean Feydy, Joan Alexis Glaunès, François-David Collin, Ghislain Durif, 8 Apr 2021 (v2), Kernel Operations on the GPU, with Autodiff, without Memory Overflows, https://arxiv.org/abs/2004.11127 Code: https://www.kernel-operations.io/keops/index.html
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Alejandro Araya-Núñez, Justin Fernández-Badilla, Daniel González-Vargas, Jimena León-Huertas, Erick-Andrés Obregón-Fonseca, Danny Xie-Li, June, 2024, Proposal of an open-source accelerators library for inference of transformer networks in edge devices based on Linux, Tecnología en Marcha. Vol. 37, special issue. IEEE Latin American Electron Devices Conference (LAEDC), pages 118-125, https://doi.org/10.18845/tm.v37i5.7225 PDF: https://revistas.tec.ac.cr/index.php/tec_marcha/article/download/7225/7076
- Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 5 Jul 2024 (v3), Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041
- Zheming Jin, July 2024, Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL, Oak Ridge National Laboratory, ORNL/TM-2024/3463, https://info.ornl.gov/sites/publications/Files/Pub217394.pdf
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Intel, 2024, Get Started with Intel® oneAPI Math Kernel Library, https://www.intel.com/content/www/us/en/docs/onemkl/get-started-guide/2023-0/overview.html
- T Zhao, 2024, Acceleration of Deep Learning Algorithms with Transformers, https://escholarship.org/uc/item/3419t2z6
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- J. Bi et al., "Efficient and Fast High-performance Library Generation for Deep Learning Accelerators," in IEEE Transactions on Computers, doi: 10.1109/TC.2024.3475575, https://ieeexplore.ieee.org/abstract/document/10707341 (Finding the most efficient kernel.)
- Wei Zhao, Anand Jayarajan, Gennady Pekhimenko, 9 Oct 2024, Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads, https://arxiv.org/abs/2410.07381 (Interleaved scheduling layer for GPU workloads.)
- Byron (Pin-Lun)Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, 14 Oct 2024, Liger Kernel: Efficient Triton Kernels for LLM Training, https://arxiv.org/abs/2410.10989 http://github.com/linkedin/Liger-Kernel
More AI Research
Read more about:
- List of AI Optimizations.
- Transformer architectures
- Inference Optimizations
- Shallow decoder architecture
- Inference Cache
- Zero-Multiplication Models
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Loop Optimizations
- Code Optimizations
- « Research Home