Aussie AI
Zero Skipping
-
Last Updated 22 November, 2024
-
by David Spuler, Ph.D.
What is Zero Skipping?
Zero skipping is the avoidance of multipication by zero weights. At a high-level, this can mean skipping multipication by an entire column of a matrix, or in an entire structure of the model (see also structural model pruning). At a low-level, zero skipping means testing a single weight to see whether it is zero, thereby avoiding a wasteful multiplication-by-zero operation.
There are two types of zero skipping:
- Low-level zero skipping (weight level)
- High-level zero skipping (structure level)
Low-level zero skipping is closely related to sparsity and magnitude pruning.
High-level structural zero skipping is the avoidance of operations on whole structures. This is related to structured pruning and dynamic pruning.
Zero skipping is a type of dynamic inference optimization. Related speedup strategies include:
- Conditional computation
- Layer skipping
- Early exiting (a type of dynamic layer pruning)
- Zero-multiplication models (e.g., adder models, bitshift models, etc.)
- Negative skipping (advanced skipping with RELU)
Low-Level Zero Skipping
For low-level zero-skipping, an incremental method to avoid multiplication operations is to null test the weight first, to avoid unnecessary multiplications by zero. Testing a register against zero is much faster than multiplication, because the multiplication algorithm doesn't go any faster for zeros, so this is a "simple case first" optimization.
Note that there's a whole class of research called "sparse matrices" or "sparsifications" which aims to cut whole swatches of zero-multiplications, but the research below is lower level than this. Read more about sparsification research.
There aren't many papers on this low-level topic of "zero skipping" of individual weights, specific to inference arithmetic, and even in some of these papers, it's not the central point of the paper. That's probably because hardware acceleration makes pre-testing for zeros on a small scale not worth it, whereas large-scale avoidance of zero-multiplication appears in research on "sparsification".
It seems like a similar pre-test idea could be applied to weights that are "1.0" or "-1.0", but we haven't found papers on that. These optimizations could all be performed in deep learning compilers.
Research on Low-Level Zero Skipping
Papers that incorporate the low-level optimization of zero-skipping directly into the inference code include:
- Y. Chen, J. Emer, and V. Sze, 2016, Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks, In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367–379, https://ieeexplore.ieee.org/document/7551407
- Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo, ZeNA: Zero-aware neural network accelerator. IEEE Design, 2018, & Test 35, 1 (2018), 39–46, https://doi.org/10.1109/MDAT.2017.2741463
- Xinlin Li, Bang Liu, Rui Heng Yang, Vanessa Courville, Chao Xing, Vahid Partovi Nia, DenseShift: Towards Accurate and Transferable Low-Bit Shift Network, Aug 2022, https://arxiv.org/abs/2208.09708
- Chunhua Deng, Yang Sui, Siyu Liao, Xuehai Qian, and Bo Yuan, 2021, GoSPA: An energy-efficient high-performance globally optimized sparse convolutional neural network accelerator, In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA’21), 1110–1123, https://doi.org/10.1109/ISCA52012.2021.00090, https://ieeexplore.ieee.org/document/9499915
- S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, Cambricon: An instruction set architecture for neural networks, 2016, In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 393–405, https://ieeexplore.ieee.org/abstract/document/7551409
- Yuxiang Huan, Yifan Qin, Yantian You, Lirong Zheng, and Zhuo Zou. Sep 2016. A multiplication reduction technique with near-zero approximation for embedded learning in IoT devices. 2016 29th IEEE International System-on-Chip Conference (SOCC), 102–107. https://ieeexplore.ieee.org/abstract/document/7905445 (Avoids near-zero low multiplications on small values, by counting the number of prefix zeros in the floating point representation using bitwise arithmetic.)
- Minkyu Kim and Jae Sun Seo. 2021. An energy-efficient deep convolutional neural network accelerator featuring conditional computing and low external memory access. IEEE Journal of Solid-State Circuits 56, 3 (2021), 803–813, https://ieeexplore.ieee.org/document/9229157 (Cascades and zero-skipping.)
- R. J. R. Struharik, B. Z. Vukobratovi´c, A. M. Erdeljan, and D. M. Rakanovi´c, “CoNNa–Hardware accelerator for compressed convolutional neural networks,” Microprocessors Microsyst., vol. 73, Mar. 2020, Art. no. 102991. https://ieeexplore.ieee.org/document/8491841
- Y.-H. Chen, T. Krishina, J.-S. Emer and V. Sze, "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks", IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Nov. 2016, https://ieeexplore.ieee.org/document/7738524 (Uses zero-skipping as part of the improvements.)
- R. J. R. Struharik, B. Z. Vukobratović, A. M. Erdeljan and D. M. Rakanović, "CoNNa–Hardware accelerator for compressed convolutional neural networks", Microprocessors Microsyst., vol. 73, Mar. 2020. https://www.sciencedirect.com/science/article/abs/pii/S0141933119300158
- J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N.E. Jerger, A. Moshovos, Cnvlutin: ineffectual-neuron-free deep neural network computing, in: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 1–13. https://ieeexplore.ieee.org/document/7551378
- Y. Lu, C. Wang, L. Gong, X. Zhou, SparseNN: a performance-efficient accelerator for large-scale sparse neural networks, Int. J. Parallel Program. 46 (4) (2018) 648–659. https://arxiv.org/abs/1711.01263
- Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures. CoRR abs/1607.03250, (2016), https://arxiv.org/abs/1607.03250 (Skips entire neurons if the value is expected to be zero.)
- Gil Shomron, Ron Banner, Moran Shkolnik, and Uri Weiser. 2020. Thanks for nothing: Predicting zero-valued activations with lightweight convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, 234–250. https://arxiv.org/abs/1909.07636 (Method is "Zero Activation Prediction" to skip zero-valued computations.)
- Weijie Chen, Yuan Zhang, Di Xie, and Shiliang Pu. 2019. A layer decomposition-recomposition framework for neuron pruning towards accurate lightweight networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3355–3362. https://arxiv.org/abs/1812.06611 (Layerwise dynamic structural pruning of unimportant neurons.)
- Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, and Tomoaki Nishimura. 2020. Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error. IJCAI. https://arxiv.org/abs/1808.08558 (A type of structured pruning based on information loss metrics.)
- Hong-Yi Wang, and Tian-Sheuan Chang, 2022, Row-wise Accelerator for Vision Transformer, https://arxiv.org/pdf/2205.03998.pdf
- C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, “DeltaRNN: A power-efficient recurrent neural network accelerator,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018, pp. 21–30. PDF: https://dl.acm.org/doi/pdf/10.1145/3174243.3174261
- Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
- Hengyuan Hu. 2016. Papers with Code. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures. https://paperswithcode.com/paper/network-trimming-a-data-driven-neuron-pruning (2021).
- Xitong Gao. 2019. Papers with Code. Dynamic Channel Pruning: Feature Boosting and Suppression. 2021, https://paperswithcode.com/paper/dynamic-channel-pruning-feature-boosting-and
- R Sharifi, P Shiri, A Baniasadi, 2020, Zero-skipping in CapsNet. Is it worth it? https://5wwwww.easychair.org/publications/download/f99W
- M. P. Véstias, R. P. Duarte, J. T. de Sousa, and H. C. Neto, 2019, “Fast convolutional neural networks in low density FPGAs using zero-skipping and weight pruning,” Electronics, vol. 8, no. 11, p. 1321, Nov. 2019. https://www.mdpi.com/2079-9292/8/11/1321
- David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, 4 Jan 2024 (v2), LLM in a flash: Efficient Large Language Model Inference with Limited Memory, https://arxiv.org/abs/2312.11514 (Storing model parameters in flash memory on phones.)
- S.-J. Lee, T.-H. Kim, 15 January 2024, Latency and accuracy optimization for binary neural network inference with locality-aware operation skipping, https://doi.org/10.1049/ell2.13090 https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/ell2.13090 https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/ell2.13090
- Yuzong Chen, Jian Meng, Jae-sun Seo, Mohamed S. Abdelfattah, 8 Sep 2024, BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration, https://arxiv.org/abs/2409.05227
High-Level Zero Skipping
High-level zero skipping is closely related to the various types of model pruning. Pruning of LLMs can be done on four dimensions:
- Lengthwise pruning (e.g., token pruning)
- Width pruning (e.g., attention head pruning, channel pruning, filter pruning)
- Depth pruning (e.g., layer pruning, early exiting, layer fusion)
- Embeddings pruning
And these four approaches are orthogonal, so there's also:
Research Papers on High-Level Zero Skipping
There are some research papers that examine skipping zeros in more detail than simply pruning an entire structural component of the LLM. Papers on zero skipping at a high level in model structures include:
- C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, “DeltaRNN: A power-efficient recurrent neural network accelerator,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018, pp. 21–30. PDF: https://dl.acm.org/doi/pdf/10.1145/3174243.3174261 (Refers to zero-skipping at a high-level, skipping an entire column or row.)
- M. P. Véstias, R. P. Duarte, J. T. de Sousa, and H. C. Neto, “Fast convolutional neural networks in low density FPGAs using zero-skipping and weight pruning,” Electronics, vol. 8, no. 11, p. 1321, Nov. 2019. https://www.mdpi.com/2079-9292/8/11/1321 (High-level zero-skipping of activations with zero weights.)
- Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, Tobi Delbruck, "NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps", IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644-656, Mar. 2019. https://arxiv.org/abs/1706.01406
- S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, Y. Chen, Cambricon-x: an accelerator for sparse neural networks, in: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, 2016, p. 20. https://ieeexplore.ieee.org/document/7783723
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: efficient inference engine on compressed deep neural network, in: Proceedings of the 43rd International Symposium on Computer Architecture, Seoul, 2016, pp. 243–254. https://arxiv.org/abs/1602.01528
- A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, W.J. Dally, SCNN: an accelerator for compressed-sparse convolutional neural networks, in: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, 2017, pp. 27–40. https://arxiv.org/abs/1708.04485
- D. Kim, J. Ahn and S. Yoo, "ZeNA: Zero-aware neural network accelerator", IEEE Des. Test, vol. 35, no. 1, pp. 39-46, Feb. 2018. https://ieeexplore.ieee.org/document/8013151
- Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Uses a "greedy interleaving" algorithm for processing sparse matrices to avoid zero multiplications.)
- P. Grigoras, P. Burovskiy, E. Hung, and W. Luk. Accelerating SpMV on FPGAs by compressing nonzero values. In International Symposium on Field Programmable Gate Arrays, pages 64–67, 2015. https://ieeexplore.ieee.org/document/7160041 (Sparse multiplication of non-zero values, skipping zeros.)
- M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li., Prediction based execution on deep neural networks. In International Symposium on Computer Architecture, pages 752–763, 2018, https://ieeexplore.ieee.org/document/8416870 (Attempts to predict and avoid zero-valued operands for multiplication in hardware.)
- JA Chen, W Niu, B Ren, Y Wang, X Shen, 2023, Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Survey paper covering various data redundancy optimizations such as skipping or reusing computations for similar data.)
- Mingcong Song; Jiechen Zhao; Yang Hu; Jiaqi Zhang; Tao Li, 2018, Prediction based execution on deep neural networks, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/8416870/, https://www.researchgate.net/profile/Mingcong-Song/publication/326566905_Prediction_Based_Execution_on_Deep_Neural_Networks/links/5bd68551a6fdcc3a8dad72ff/Prediction-Based-Execution-on-Deep-Neural-Networks.pdf
- H Park, D Kim, J Ahn, S Yoo, 2016, Zero and data reuse-aware fast convolution for deep neural networks on GPU, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), https://dl.acm.org/doi/abs/10.1145/2968456.2968476, https://ieeexplore.ieee.org/document/7750981 (Zero-skipping by prediction of the results.)
Negative Skipping (Predictive Dot Product Optimization)
Negative skipping is not skipping of negative weights (that would be called a fancier acronym of "RELU"; see activation functions). Instead, negative skipping is an attempt to predict which vector dot product computations will be negative, and skip doing them, because this will be zero anyway if sent to RELU. Hence, negative skipping with RELU is a type of zero skipping.
It should be noted that individual multiplications of two vector elements can be determined to be negative simply by examining both sign bits and using an XOR operation. This is true for both integer and floating-point computations. However, the full computation of a dot product as the sum of many such computations is less easily optimized (as everyone in AI knows).
Various approaches to make approximate predictions about negative or non-negative dot products have been tried, such as:
- Vector dot product caching and nearest-neighbor lookups (see vector hashing).
- Ordering computations largest to smallest (with approximation and thresholds).
- Examining sign bits first (if a high percentage of pairwise element multiplications are negatives, the overall dot product is more likely to be negative).
Research papers on negative skipping include:
- Duvindu Piyasena, Rukshan Wickramasinghe, Debdeep Paul, Siew Kei Lam, and Meiqing Wu. 2019. Reducing dynamic power in streaming CNN hardware accelerators by exploiting computational redundancies. Proceedings 29th International Conference on Field-Programmable Logic and Applications, FPL 2019 (9 2019), 354–359, https://ieeexplore.ieee.org/document/8891989 PDF: https://siewkeilam.github.io/ei-research-group/Paper/2019H-Duvindu-FPL.pdf (This is "negative skipping", similar to zero-skipping, where cheap estimates avoid computations that would be negative, which would thereby be reduced to zero by RELU activation.)
- T. Ujiie, M. Hiromoto, and T. Sato. 2016. Approximated Prediction Strategy for Reducing Power Consumption of Convolutional Neural Network Processor. Conf. on Comp. Vision and Pattern Recog. Workshops (CVPRW), 870–876. https://ieeexplore.ieee.org/document/7789603 https://openaccess.thecvf.com/content_cvpr_2016_workshops/w14/papers/Ujiie_Approximated_Prediction_Strategy_CVPR_2016_paper.pdf (Does "negative skipping" by quickly approximating the value of a convolution to skip it entirely if expected to be negative.)
- Vahideh Akhlaghi, Amir Yazdanbakhsh, Kambiz Samadi, Rajesh K. Gupta, and Hadi Esmaeilzadeh. 2018. SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 662ś673. https://doi.org/10.1109/ISCA.2018.00061
- Yu Zhang, Dajiang Liu, and Yongkang Xing. 2021. Dynamic Convolution Pruning Using Pooling Characteristic in Convolution Neural Networks. In Neural Information Processing (Communications in Computer and Information Science), Teddy Mantoro, Minho Lee, Media Anugerah Ayu, Kok Wai Wong, and Achmad Nizar Hidayanto (Eds.). Springer International Publishing, Cham, 558ś565. https://doi.org/10.1007/978-3-030-92307-5_65
- David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- S.-J. Lee, T.-H. Kim, 15 January 2024, Latency and accuracy optimization for binary neural network inference with locality-aware operation skipping, https://doi.org/10.1049/ell2.13090 https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/ell2.13090 https://ietresearch.onlinelibrary.wiley.com/doi/full/10.1049/ell2.13090
- Jiho Shin, Hoeseok Yang, Youngmin Yi, 19 Nov 2024, SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference, https://arxiv.org/abs/2411.12692
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Research
Read more about:
- Conditional Computation Models
- Zero-Multiplication Models
- Approximate Computing
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home