Aussie AI

Why Structured Pruningand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Why Structured Pruning?

As we saw in Chapter 33, the downside of unstructured magnitude pruning was that it was inherently unpredictable which weights would be zero. This motivates “structured pruning” where we prune whole structures in the model, such as layers, filters, or channels. In structured pruning, we always know which parts of the model will be zeroed. It is a much more controlled type of pruning in a sense.

Smaller and Faster. There is no problem with storing a smaller model or running faster with structured pruning. If we remove a whole layer, as in layer pruning, then we simply don't store that entire layer in the model file. The speedup from structured pruning is also relatively easy, and proportional to what we've pruned. For example, with layer pruning, we simply don't run an entire layer at runtime. Changes to the Transformer's runtime inference algorithm are actually quite minor to implement.

Disadvantages. The downside of structured pruning is that it lacks flexibility, and the model cannot always overcome the limitations from pruned structured by re-training (i.e. post-pruning fine-tuning). There are various mitigations whereby we choose not to prune the most important structures. For example, research shows that the first layers of a Transformer are usually more important in doing the main analysis, whereas the final few layers do the finessing. However, even with such mitigations, the model inherently lacks as many options to re-train itself to overcome the removed weights. Hence, any type of structured pruning may make the model smaller and run faster, but also less accurate. Nevertheless, various types of structured pruning in research papers have achieved impressive results in terms of speedup with minimal accuracy degradation.

Combined structured and unstructured pruning. It is theoretically possible to combine both structured and unstructured pruning, since they are mostly orthogonal techniques. Structured pruning removes whatever structure is being pruned (with a dozen options!), and the remaining weights in the non-pruned structures could also be subjected to unstructured magnitude pruning, by simply zeroing any tiny fractional weights. I really don't recall having seen this yet in the research papers, but it may have been done, or someone might want to try it.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++