Aussie AI

Zero Skipping

  • Last Updated 10 October, 2025
  • by David Spuler, Ph.D.

What is Zero Skipping?

Zero skipping is the avoidance of multipication by zero weights. At a high-level, this can mean skipping multipication by an entire column of a matrix, or in an entire structure of the model (see also structural model pruning). At a low-level, zero skipping means testing a single weight to see whether it is zero, thereby avoiding a wasteful multiplication-by-zero operation.

There are two types of zero skipping:

  • Low-level zero skipping (weight level)
  • High-level zero skipping (structure level)

Low-level zero skipping is closely related to sparsity and magnitude pruning.

High-level structural zero skipping is the avoidance of operations on whole structures. This is related to structured pruning and dynamic pruning.

Zero skipping is a type of dynamic inference optimization. Related speedup strategies include:

Low-Level Zero Skipping

For low-level zero-skipping, an incremental method to avoid multiplication operations is to null test the weight first, to avoid unnecessary multiplications by zero. Testing a register against zero is much faster than multiplication, because the multiplication algorithm doesn't go any faster for zeros, so this is a "simple case first" optimization.

Note that there's a whole class of research called "sparse matrices" or "sparsifications" which aims to cut whole swatches of zero-multiplications, but the research below is lower level than this. Read more about sparsification research.

There aren't many papers on this low-level topic of "zero skipping" of individual weights, specific to inference arithmetic, and even in some of these papers, it's not the central point of the paper. That's probably because hardware acceleration makes pre-testing for zeros on a small scale not worth it, whereas large-scale avoidance of zero-multiplication appears in research on "sparsification".

It seems like a similar pre-test idea could be applied to weights that are "1.0" or "-1.0", but we haven't found papers on that. These optimizations could all be performed in deep learning compilers.

Research on Low-Level Zero Skipping

Papers that incorporate the low-level optimization of zero-skipping directly into the inference code include:

High-Level Zero Skipping

High-level zero skipping is closely related to the various types of model pruning. Pruning of LLMs can be done on four dimensions:

And these four approaches are orthogonal, so there's also:

Research Papers on High-Level Zero Skipping

There are some research papers that examine skipping zeros in more detail than simply pruning an entire structural component of the LLM. Papers on zero skipping at a high level in model structures include:

  • C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, “DeltaRNN: A power-efficient recurrent neural network accelerator,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018, pp. 21–30. PDF: https://dl.acm.org/doi/pdf/10.1145/3174243.3174261 (Refers to zero-skipping at a high-level, skipping an entire column or row.)
  • M. P. Véstias, R. P. Duarte, J. T. de Sousa, and H. C. Neto, “Fast convolutional neural networks in low density FPGAs using zero-skipping and weight pruning,” Electronics, vol. 8, no. 11, p. 1321, Nov. 2019. https://www.mdpi.com/2079-9292/8/11/1321 (High-level zero-skipping of activations with zero weights.)
  • Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, Tobi Delbruck, "NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps", IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644-656, Mar. 2019. https://arxiv.org/abs/1706.01406
  • S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, Y. Chen, Cambricon-x: an accelerator for sparse neural networks, in: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, 2016, p. 20. https://ieeexplore.ieee.org/document/7783723
  • S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, EIE: efficient inference engine on compressed deep neural network, in: Proceedings of the 43rd International Symposium on Computer Architecture, Seoul, 2016, pp. 243–254. https://arxiv.org/abs/1602.01528
  • A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, W.J. Dally, SCNN: an accelerator for compressed-sparse convolutional neural networks, in: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, 2017, pp. 27–40. https://arxiv.org/abs/1708.04485
  • D. Kim, J. Ahn and S. Yoo, "ZeNA: Zero-aware neural network accelerator", IEEE Des. Test, vol. 35, no. 1, pp. 39-46, Feb. 2018. https://ieeexplore.ieee.org/document/8013151
  • Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Uses a "greedy interleaving" algorithm for processing sparse matrices to avoid zero multiplications.)
  • P. Grigoras, P. Burovskiy, E. Hung, and W. Luk. Accelerating SpMV on FPGAs by compressing nonzero values. In International Symposium on Field Programmable Gate Arrays, pages 64–67, 2015. https://ieeexplore.ieee.org/document/7160041 (Sparse multiplication of non-zero values, skipping zeros.)
  • M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li., Prediction based execution on deep neural networks. In International Symposium on Computer Architecture, pages 752–763, 2018, https://ieeexplore.ieee.org/document/8416870 (Attempts to predict and avoid zero-valued operands for multiplication in hardware.)
  • JA Chen, W Niu, B Ren, Y Wang, X Shen, 2023, Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Survey paper covering various data redundancy optimizations such as skipping or reusing computations for similar data.)
  • Mingcong Song; Jiechen Zhao; Yang Hu; Jiaqi Zhang; Tao Li, 2018, Prediction based execution on deep neural networks, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/8416870/, https://www.researchgate.net/profile/Mingcong-Song/publication/326566905_Prediction_Based_Execution_on_Deep_Neural_Networks/links/5bd68551a6fdcc3a8dad72ff/Prediction-Based-Execution-on-Deep-Neural-Networks.pdf
  • H Park, D Kim, J Ahn, S Yoo, 2016, Zero and data reuse-aware fast convolution for deep neural networks on GPU, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), https://dl.acm.org/doi/abs/10.1145/2968456.2968476, https://ieeexplore.ieee.org/document/7750981 (Zero-skipping by prediction of the results.)

Negative Skipping (Predictive Dot Product Optimization)

Negative skipping is not skipping of negative weights (that would be called a fancier acronym of "RELU"; see activation functions). Instead, negative skipping is an attempt to predict which vector dot product computations will be negative, and skip doing them, because this will be zero anyway if sent to RELU. Hence, negative skipping with RELU is a type of zero skipping.

It should be noted that individual multiplications of two vector elements can be determined to be negative simply by examining both sign bits and using an XOR operation. This is true for both integer and floating-point computations. However, the full computation of a dot product as the sum of many such computations is less easily optimized (as everyone in AI knows).

Various approaches to make approximate predictions about negative or non-negative dot products have been tried, such as:

  • Vector dot product caching and nearest-neighbor lookups (see vector hashing).
  • Ordering computations largest to smallest (with approximation and thresholds).
  • Examining sign bits first (if a high percentage of pairwise element multiplications are negatives, the overall dot product is more likely to be negative).

Research papers on negative skipping include:

More Research on Pruning Types

AI Books from Aussie AI



The Sweetest Lesson: Your Brain Versus AI The Sweetest Lesson: Your Brain Versus AI: new book on AI intelligence theory:
  • Your brain is 50 times bigger than the best AI engines.
  • Truly intelligent AI will require more compute!
  • Another case of the bitter lesson?
  • Maybe it's the opposite of that: the sweetest lesson.

Get your copy from Amazon: The Sweetest Lesson



RAG Optimization RAG Optimization: Accurate and Efficient LLM Applications: new book on RAG architectures:
  • Smarter RAG
  • Faster RAG
  • Cheaper RAG
  • Agentic RAG
  • RAG reasoning

Get your copy from Amazon: RAG Optimization



Generative AI in C++ Generative AI Applications book:
  • Deciding on your AI project
  • Planning for success and safety
  • Designs and LLM architectures
  • Expediting development
  • Implementation and deployment

Get your copy from Amazon: Generative AI Applications



Generative AI in C++ Generative AI programming book:
  • Generative AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++



CUDA C++ Optimization CUDA C++ Optimization book:
  • Faster CUDA C++ kernels
  • Optimization tools & techniques
  • Compute optimization
  • Memory optimization

Get your copy from Amazon: CUDA C++ Optimization



CUDA C++ Optimization CUDA C++ Debugging book:
  • Debugging CUDA C++ kernels
  • Tools & techniques
  • Self-testing & reliability
  • Common GPU kernel bugs

Get your copy from Amazon: CUDA C++ Debugging

More AI Research

Read more about: