Aussie AI
Types of Structured Pruning
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Types of Structured Pruning
Pick a structure, any structure. Open up the standard vanilla Transformer research paper (Vaswani, et. al., 2017) and find the diagram of the architecture. Close your eyes, and poke your finger somewhere in that diagram. Open your eyes again. I can show you research papers on pruning of whatever structure you're pointing at, and sometimes hundreds of papers (e.g. early exit).
There's an odd thing, though: none of those types of structured pruning have gone mainstream. The vast majority of pruning capabilities in open source frameworks are simply for training-based unstructured pruning, such as magnitude pruning or movement pruning. I find this surprising since several of the structured pruning techniques show significant efficiency gains with modest loss of model accuracy.
The main types of structured pruning with significant research papers are:
- Layer pruning
- Early exit (i.e., dynamic layer pruning)
- Attention head pruning
- Channel pruning
- Filter pruning
- Token pruning
Some of the less commonly pruned Transformer components include:
- Bias pruning
- Embeddings pruning
- FFN pruning
- Normalization pruning
- Softmax pruning
- Positional encoding pruning
Did I miss any?
There are also some other notable techniques with the same goal of reducing the total number of weights, with some similarity to pruning:
- Parameter sharing and layer fusion
- Low-rank matrices
Smaller matrices have fewer weights, so another technique is to cut weights by using smaller matrices. Advanced matrix algebra can be used to factorize the large matrices into smaller “low-rank” matrices, with fewer rows and columns (hence, less weights). This idea applied to tensors is called “tensor decomposition.”
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |