Aussie AI

Zero Padding Removal

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Zero Padding Removal

One zero skipping technique for speeding up Transformer inference is to avoid using zero padding in the input vectors. The need for padding arises in some architectures where it can be helpful in keeping vectors the same size, because that consistency can help with pipelining calculations through the GPU. However, research has shown that it can also lead to inefficiency from performing redundant computations that are never used, and various papers have advocated removing the zero padding bytes.

An alternative approach is to use packing of input sequences to avoid or reduce padding bytes. This is effective in training sets, or multiple streams of inference queries.

And it's worth nothing that not all padding bytes are evil. Some of them are quite charismatic if you take them out for a cup of tea. In fact, the need for padding removal in Transformers arose for good reason from the well-intentioned optimizing by professional programmers using very nice and hospitable padding zeros. The use of padding is a positive optimization in numerous situations, particularly when GPUs are involved. See Chapter 49 for more about padding byte optimizations.

Research papers on zero padding avoidance:

  1. Intel, 2023, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (One of the optimizations suggested is to avoid computations involving zero padding bytes.)
  2. Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (Removing zero-padding inputs is one of the major optimizations in this paper.)
  3. J Du, J Jiang, J Zheng, H Zhang, D Huang, Y Lu, August 2023, Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs, ACM Transactions on Architecture and Code Optimization, https://dl.acm.org/doi/10.1145/3617689, PDF: https://dl.acm.org/doi/pdf/10.1145/3617689
  4. H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference, July 2022, Pages 1135–1140, https://doi.org/10.1145/3489517.3530585, https://dl.acm.org/doi/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
  5. Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report (Includes an interesting review of practical problems with zero padding in binarized networks, where the weights are only -1 and +1.)
  6. Zhai, Yujia, 2023, Ph.D. thesis, Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications, Computer Science, University of California, Riverside, https://escholarship.org/content/qt8s28g07q/qt8s28g07q.pdf (Includes examination of padding-free algorithms such as ByteTransformer.)
  7. Gongzheng Li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie Fan, Xiaoxi Mao, Zeng Zhao, 2022, Easy and Efficient Transformer: Scalable Inference Solution For large NLP model, May 2022, https://arxiv.org/abs/2104.12470 (Optimizations include avoiding padding computations in the attention heads.)
  8. Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission used zero-padding removal and also various kernel fusions.)
  9. Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Used packing of sequences in training with a SEP separator token rather than CLS. Note: code uses deprecated nvFuser compiler.)

For more research on zero padding, see also https://www.aussieai.com/research/zero-padding.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++