Aussie AI
Zero-Padding Removal
-
Last Updated 3 November, 2024
-
by David Spuler, Ph.D.
One technique for speeding up Transformer inference is to avoid using zero padding in the input vectors (see also length pruning). The need for padding arises in some architectures where it can be helpful in keeping vectors the same size, because that consistency can help with pipelining calculations through the GPU. However, research has shown that it can also lead to inefficiency from performing redundant computations that are never used, and various papers have advocated removing the zero padding bytes.
An alternative approach is to use packing of input sequences to avoid or reduce padding bytes. This is effective in training sets, or multiple inference queries.
And it's worth nothing that not all padding bytes are evil. Some of them are quite charismatic if you take them out for a cup of tea. In fact, the need for padding removal in Transformers arose for good reason from the well-intentioned optimizing by professional programmers using very nice and hospitable padding zeros. The use of padding is a positive optimization in numerous situations, particularly when GPUs are involved. Read more about padding byte optimizations.
Research Papers on Zero Padding Removal
- Intel, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (One of the optimizations suggested is to avoid computations involving zero padding bytes.)
- Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (Removing zero-padding inputs is one of the major optimizations in this paper.)
- J Du, J Jiang, J Zheng, H Zhang, D Huang, Y Lu, August 2023, Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs, ACM Transactions on Architecture and Code Optimization, https://dl.acm.org/doi/10.1145/3617689, PDF: https://dl.acm.org/doi/pdf/10.1145/3617689
- H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference, July 2022, Pages 1135–1140, https://doi.org/10.1145/3489517.3530585, https://dl.acm.org/doi/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
- Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report (Includes an interesting review of practical problems with zero padding in binarized networks, where the weights are only -1 and +1.)
- Zhai, Yujia, 2023, Ph.D. thesis, Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications, Computer Science, Universion of California, Riverside, https://escholarship.org/content/qt8s28g07q/qt8s28g07q.pdf (Includes examination of padding-free algorithms such as ByteTransformer.)
- Gongzheng Li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie Fan, Xiaoxi Mao, Zeng Zhao, Easy and Efficient Transformer : Scalable Inference Solution For large NLP model, May 2022, https://arxiv.org/abs/2104.12470 (Optimizations include avoiding padding computations in the attention heads.)
- Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission used zero-padding removal and also various kernel fusions.)
- Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Used packing of sequences in training with a SEP separator token rather than CLS. Note: code uses deprecated nvFuser compiler.)
- S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory optimizations toward training trillion parameter models,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–16, https://arxiv.org/abs/1910.02054 Code: part of: https://github.com/microsoft/deepspeed (Zero Redundancy Optimizer (ZeRO) provides memory optimization, improved utilization, and fragmentation avoidance, allowing improved pipelining during training.)
- M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, “Megatron-LM: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019, https://arxiv.org/abs/1909.08053
- David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Xin Tan, Jingzong Li, Jiamin Li, Yitao Yang, Hong Xu, August 2024, Arlo: Serving Transformer-based Language Models with Dynamic, Input Lengths, ICPP ’24, August 12–15, 2024, Gotland, Sweden, https://doi.org/10.1145/3673038.3673124 https://kanonjz.github.io/academic/share/xin-icpp24.pdf
- Blaise Delattre, Quentin Barthélemy, Alexandre Allauzen, 31 Jan 2024, Spectral Norm of Convolutional Layers with Circular and Zero Paddings, https://arxiv.org/abs/2402.00240
- D. Liu, X. Guo, N. Wang and Q. Wu, 2024, Lightweight Deep Neural Network Model With Padding-Free Downsampling, IEEE Signal Processing Letters, vol. 31, pp. 865-869, 2024, doi: 10.1109/LSP.2024.3374057, https://ieeexplore.ieee.org/abstract/document/10461068
- Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang, 13 May 2024, EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models, https://arxiv.org/abs/2405.07542 https://github.com/niyunsheng/EMS-SD
- Shawn Tan, Yikang Shen, Rameswar Panda, Aaron Courville, 4 Oct 2024 (v2), Scattered Mixture-of-Experts Implementation, https://arxiv.org/abs/2403.08245
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Research
Read more about: