Aussie AI
Dynamic Structured Pruning
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Dynamic Structured Pruning
Dynamic pruning refers to pruning of network weights, links, or entire layers at runtime during inference. This differs from “static pruning” that is done offline during training, or in a post-training optimization, to create a modified model. The types of dynamic pruning may include:
- Dynamic depth pruning: Skipping of inference of entire layers of the model using an “early exit” of the inference loop. See also depth pruning, layer pruning, layer skipping, layer fusion, and shallow decoders.
- Dynamic width pruning: Dynamically reducing the “width” of the model based on the input. See width pruning, attention head pruning, channel pruning, filter pruning.
- Dynamic length pruning: Adaptive to the input to modify internal dimensions related to tokens, embeddings, etc. See length pruning, token pruning, embeddings pruning, autoregressive algorithms.
Note that all types of dynamic pruning suffer some extra inference cost in the calculations that decide whether to prune or not. The hope is that the benefit of pruning will exceed the cost of decision logic. For example, choosing an “early exit” criterion for layer pruning will require extra computation at each layer, which is hopefully recouped by skipping layers often enough.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |