Aussie AI
Pros and Cons of Pruning
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Pros and Cons of Pruning
Efficiency in terms of both time and space is the goal of pruning, rather than improved accuracy. The main downside is a small reduction in model accuracy.
Smaller model. Zero weights needn't be stored in memory or the model file. We don't need them, so the model file should be smaller. This is what is meant by “model compression” as a general category, and pruning achieves this by removing the zero weights.
Faster inference. Inference execution should be faster, because any multiplication by a zero weight can be simply skipped. Hence, latency should be reduced, and throughput increased.
Memory efficiency. The model should also be smaller when loaded into memory, which also further helps with reducing execution cost in memory-bound engines. This reduces the time overhead wasted in transferring model weights from memory into the CPU or GPU caches.
Reduced model accuracy. Generally, if you remove weights from the model, it will lose some capabilities. Hence, pruning is an approximation of the non-pruned model with slightly degraded accuracy as a trade-off for faster execution of inference. However, if we do pruning during training, or do re-training by fine-tuning after pruning, then the impact to model accuracy is much less.
All about that zero. Pruning is all about the fact that multiplication by zero is zero, and adding zero can be skipped. There are other algebraic optimizations that spring to mind, such as multiplication by 1 is the unchanged value, so multiplication could become simply assignment. However, pruning is only about zeros and converting near-zeros to zero. Various other types of optimizations are done by quantization or ML Compilers, but that's not pruning.
Practical difficulties. As detailed later in this chapter, it's quite involved to make a pruned model smaller and faster. For example, storing the model file using run-length encoding or Huffman encoding doesn't reduce in-memory size because the model must be uncompressed to calculate results. In regard to speed, we can maybe skip zeros quickly in sequential code, but if we send a vector to the GPU with a mix of zeros and non-zeros, it's just going to process the whole thing anyway. Spoiler alert: permuted arrays are key.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |