Aussie AI

Pros and Cons of Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Pros and Cons of Pruning

Efficiency in terms of both time and space is the goal of pruning, rather than improved accuracy. The main downside is a small reduction in model accuracy.

Smaller model. Zero weights needn't be stored in memory or the model file. We don't need them, so the model file should be smaller. This is what is meant by “model compression” as a general category, and pruning achieves this by removing the zero weights.

Faster inference. Inference execution should be faster, because any multiplication by a zero weight can be simply skipped. Hence, latency should be reduced, and throughput increased.

Memory efficiency. The model should also be smaller when loaded into memory, which also further helps with reducing execution cost in memory-bound engines. This reduces the time overhead wasted in transferring model weights from memory into the CPU or GPU caches.

Reduced model accuracy. Generally, if you remove weights from the model, it will lose some capabilities. Hence, pruning is an approximation of the non-pruned model with slightly degraded accuracy as a trade-off for faster execution of inference. However, if we do pruning during training, or do re-training by fine-tuning after pruning, then the impact to model accuracy is much less.

All about that zero. Pruning is all about the fact that multiplication by zero is zero, and adding zero can be skipped. There are other algebraic optimizations that spring to mind, such as multiplication by 1 is the unchanged value, so multiplication could become simply assignment. However, pruning is only about zeros and converting near-zeros to zero. Various other types of optimizations are done by quantization or ML Compilers, but that's not pruning.

Practical difficulties. As detailed later in this chapter, it's quite involved to make a pruned model smaller and faster. For example, storing the model file using run-length encoding or Huffman encoding doesn't reduce in-memory size because the model must be uncompressed to calculate results. In regard to speed, we can maybe skip zeros quickly in sequential code, but if we send a vector to the GPU with a mix of zeros and non-zeros, it's just going to process the whole thing anyway. Spoiler alert: permuted arrays are key.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Pros and Cons of Pruning

Pros and Cons of Pruning

Quick Links

Product

New to Writing?

Writing Styles