Aussie AI

Zero Skipping

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Zero Skipping

Zero skipping is a particular type of adaptive inference that involves the avoidance of multiplications by zero weights. This idea can be performed at a low-level or high-level of the inference algorithm.

Low-Level Zero Skipping. At a low-level, zero skipping means testing a single weight to see whether it is zero, thereby avoiding a wasteful multiplication-by-zero operation. Testing a register against zero is much faster than multiplication, because the multiplication algorithm doesn't go any faster for zeros, so this is a “simple case first” optimization.

Note that there's a whole class of research called “sparse matrices” or “sparsifications” which aims to cut whole swatches of zero-multiplications, but the research below is lower level than this.

There aren't many papers on this low-level topic of “zero skipping” of individual weights, specific to inference arithmetic, and even in some of these papers, it's not the central point of the paper. That's probably because hardware acceleration makes pre-testing for zeros on a small scale not worth it, whereas large-scale avoidance of zero-multiplication appears in research on “sparsification”.

Research papers on low-level zero skipping:

  1. Y. Chen, J. Emer, and V. Sze, 2016, Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks, In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367–379, https://ieeexplore.ieee.org/document/7551407
  2. Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo, 2018, ZeNA: Zero-aware neural network accelerator, IEEE Design, 2018, & Test 35, 1 (2018), 39–46, https://doi.org/10.1109/MDAT.2017.2741463
  3. Xinlin Li, Bang Liu, Rui Heng Yang, Vanessa Courville, Chao Xing, Vahid Partovi Nia, 2022, DenseShift: Towards Accurate and Transferable Low-Bit Shift Network, Aug 2022, https://arxiv.org/abs/2208.09708
  4. Chunhua Deng, Yang Sui, Siyu Liao, Xuehai Qian, and Bo Yuan, 2021, GoSPA: An energy-efficient high-performance globally optimized sparse convolutional neural network accelerator, In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA’21), 1110–1123, https://doi.org/10.1109/ISCA52012.2021.00090, https://ieeexplore.ieee.org/document/9499915
  5. S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, 2016, Cambricon: An instruction set architecture for neural networks, 2016, In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 393–405, https://ieeexplore.ieee.org/abstract/document/7551409
  6. Yuxiang Huan, Yifan Qin, Yantian You, Lirong Zheng, and Zhuo Zou. Sep 2016. A multiplication reduction technique with near-zero approximation for embedded learning in IoT devices, 2016 29th IEEE International System-on-Chip Conference (SOCC), 102–107. https://ieeexplore.ieee.org/abstract/document/7905445 (Avoids near-zero low multiplications on small values, by counting the number of prefix zeros in the floating-point representation using bitwise arithmetic.)
  7. Minkyu Kim and Jae Sun Seo. 2021. An energy-efficient deep convolutional neural network accelerator featuring conditional computing and low external memory access, IEEE Journal of Solid-State Circuits 56, 3 (2021), 803–813, https://ieeexplore.ieee.org/document/9229157 (Cascades and zero-skipping.)
  8. R. J. R. Struharik, B. Z. Vukobratovic, A. M. Erdeljan, and D. M. Rakanovic, 2020, CoNNa–Hardware accelerator for compressed convolutional neural networks, Microprocessors Microsyst., vol. 73, Mar. 2020, Art. no. 102991. https://ieeexplore.ieee.org/document/8491841
  9. Y.-H. Chen, T. Krishina, J.-S. Emer and V. Sze, 2016, Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks, IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Nov. 2016, https://ieeexplore.ieee.org/document/7738524 (Uses zero-skipping as part of the improvements.)
  10. R. J. R. Struharik, B. Z. Vukobratović, A. M. Erdeljan and D. M. Rakanović, 2020, CoNNa–Hardware accelerator for compressed convolutional neural networks, Microprocessors Microsyst., vol. 73, Mar. 2020. https://www.sciencedirect.com/science/article/abs/pii/S0141933119300158
  11. J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N.E. Jerger, A. Moshovos, 2016, Cnvlutin: ineffectual-neuron-free deep neural network computing, in: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 1–13. https://ieeexplore.ieee.org/document/7551378
  12. Y. Lu, C. Wang, L. Gong, X. Zhou, 2018, SparseNN: a performance-efficient accelerator for large-scale sparse neural networks, Int. J. Parallel Program. 46 (4) (2018) 648–659. https://arxiv.org/abs/1711.01263

High-Level Zero Skipping. At a high-level, zero skipping can mean avoiding all of the multiplications from an entire column of a matrix, or in an entire structure of the model (e.g. structural model pruning). Papers on zero skipping at a high level in model structures include:

  1. C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, 2018, DeltaRNN: A power-efficient recurrent neural network accelerator, in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018, pp. 21–30. PDF: https://dl.acm.org/doi/pdf/10.1145/3174243.3174261 (Refers to zero-skipping at a high-level, skipping an entire column or row.)
  2. M. P. Véstias, R. P. Duarte, J. T. de Sousa, and H. C. Neto, 2019, Fast convolutional neural networks in low density FPGAs using zero-skipping and weight pruning, Electronics, vol. 8, no. 11, p. 1321, Nov. 2019. https://www.mdpi.com/2079-9292/8/11/1321 (High-level zero-skipping of activations with zero weights.)
  3. Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, Tobi Delbruck, 2019, NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps, IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644-656, Mar. 2019. https://arxiv.org/abs/1706.01406
  4. S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, Y. Chen, 2016, Cambricon-x: an accelerator for sparse neural networks, in: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, 2016, p. 20. https://ieeexplore.ieee.org/document/7783723
  5. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, 2016, EIE: efficient inference engine on compressed deep neural network, in: Proceedings of the 43rd International Symposium on Computer Architecture, Seoul, 2016, pp. 243–254. https://arxiv.org/abs/1602.01528
  6. A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, W.J. Dally, 2017, SCNN: an accelerator for compressed-sparse convolutional neural networks, in: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, 2017, pp. 27–40. https://arxiv.org/abs/1708.04485
  7. D. Kim, J. Ahn and S. Yoo, 2018, ZeNA: Zero-aware neural network accelerator, IEEE Des. Test, vol. 35, no. 1, pp. 39-46, Feb. 2018. https://ieeexplore.ieee.org/document/8013151
  8. Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Uses a “greedy interleaving” algorithm for processing sparse matrices to avoid zero multiplications.)
  9. P. Grigoras, P. Burovskiy, E. Hung, and W. Luk. 2015, Accelerating SpMV on FPGAs by compressing nonzero values, In International Symposium on Field Programmable Gate Arrays, pages 64–67, 2015. https://ieeexplore.ieee.org/document/7160041 (Sparse multiplication of non-zero values, skipping zeros.)
  10. M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li., 2018, Prediction based execution on deep neural networks, In International Symposium on Computer Architecture, pages 752–763, 2018, https://ieeexplore.ieee.org/document/8416870 (Attempts to predict and avoid zero-valued operands for multiplication in hardware.)
  11. JA Chen, W Niu, B Ren, Y Wang, X Shen, 2023, Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Survey paper covering various data redundancy optimizations such as skipping or reusing computations for similar data.)
  12. Mingcong Song; Jiechen Zhao; Yang Hu; Jiaqi Zhang; Tao Li, 2018, Prediction based execution on deep neural networks, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/8416870/, https://www.researchgate.net/profile/Mingcong-Song/publication/326566905_Prediction_Based_Execution_on_Deep_Neural_Networks/links/5bd68551a6fdcc3a8dad72ff/Prediction-Based-Execution-on-Deep-Neural-Networks.pdf
  13. H Park, D Kim, J Ahn, S Yoo, 2016, Zero and data reuse-aware fast convolution for deep neural networks on GPU, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), https://dl.acm.org/doi/abs/10.1145/2968456.2968476, https://ieeexplore.ieee.org/document/7750981 (Zero-skipping by prediction of the results.)

For more research on zero skipping, see also https://www.aussieai.com/research/zero-skipping.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++