Aussie AI

Transformer Optimization

  • Last Updated 27 October, 2024
  • by David Spuler, Ph.D.

The Transformer was invented at Google in 2017 and open-sourced by their research group. It became the most widely used AI engine architecture, notably being used in GPT-3 by OpenAI's ChatGPT. Since then, optimization research has taken off. There are two basic ways to optimize Transformer models:

There are various ways to optimize a Transformer with code optimizations. Much research has also been conducted on slight modifications to the architecture of the Transformer to improve latency and throughput in both inference and training.

Transformer Inference Optimizations

See also these articles for further information on Transformer inference optimization:

Transformer Kernel Code Optimizations

Some of the specific kernel optimizations of inference engines include:

  • Attention head caching: Precomputing and caching attention head matrices from already-processed tokens (HuggingFace, 2021). This reduces the auto-regression costs when outputting multiple tokens (which is the usual case). See also attention head pruning
  • KV Caching: This optimization is caching the attention head K and V tensor matrix multiplications during decoding (Intel, 2023). This reduces the number of decoder matrix multiplications. See KV caching research.
  • Padding byte optimizations: Removing padding in the Feed Forward Network tensor/matrix computations (Intel, 2023; also in ByteTransformer by Zhai et al. (2023)); see "zero padding removal". This reduces the total number of multiplications.
  • Attention dimensions: Merging Q, K, and V matrices (of identical size) into a single large matrix for better matrix multiplication throughput (Zhai et al., 2023).
  • Operator fusion and reordering: Reordering reshaping and matmul operations (Intel, 2023). This streamlines some of the arithmetic operations to use more compact low-level libraries. See kernel fusion optimizations.

Kernel Optimization Research Papers

Reference papers on some of the specific code optimizations in Transformer engines:

See also general research on code optimizations.

Transformer General Optimizations

Some of the general classes of optimization techniques for the Transformer architecture include:

And here is a long list of the various other optimizations possible:

For even more, see inference optimizations, Transformer architectural optimizations, and a complete list of Transformer optimizations.

Survey Papers on Transformer Optimization

Review and survey papers on faster Transformer engines:

  • Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami, Full stack optimization of transformer inference: a survey, Feb 2023, arXiv:2302.14017, https://arxiv.org/abs/2302.14017
  • Full Stack Optimization of Transformer Inference: a Survey. Part 2 on Transformer Optimization, A Paper Overview, https://www.nebuly.com/blog/full-stack-optimization-of-transformer-inference-a-survey-part-2
  • Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey (v2). arXiv preprint arXiv:2009.06732, 2022, https://arxiv.org/abs/2009.06732
  • Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, A Survey of Techniques for Optimizing Transformer Inference, 2023, arxiv.org July 2023, https://arxiv.org/abs/2307.07982
  • L Papa, P Russo, I Amerini, L Zhou, Sep 2023, A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking, arXiv preprint arXiv:2309.02031, 2023, https://arxiv.org/abs/2309.02031
  • Efficient Attention: Breaking The Quadratic Transformer Bottleneck, 2023 (accessed 8/12/23), https://gwern.net/note/attention, (A regularly updated bibliography of transformer attention optimization papers)

Tips for Transformer Optimization

Articles and papers with general tips on optimizing a Transformer:

Research on Specific Fast Transformers

These papers are on new faster Transformer architectures tested by researchers:

General Research on Transformer Optimization

These papers review Transformer optimization techniques in general.

Kernel Optimizations

More AI Research

Read more about: