Aussie AI

What is Width Pruningand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

What is Width Pruning?

Width pruning is a type of structured pruning that reduces the “width” of the internal layers of a neural network. In early neural networks, the “width” was the number of internal neurons in the hidden layers. In modern Transformer architectures, width pruning usually refers to reducing the number of attention heads.

Like all types of pruning, the goal is both smaller models and faster inference. Model compression by width pruning reduces the size of the model and thereby the memory usage to store it. This also reduces inference time by avoiding computation associated with the pruned structures, and the avoidance of the cost of memory-to-cache transfers of data.

Width pruning can mean several different things that all affect the fan-out of data as it propagates through the hidden layers. The types and names include:

  • Attention head pruning (in Transformers)
  • Thin networks (static)
  • “Slimmable” neural networks (dynamic)
  • Channel pruning
  • Filter pruning

Width pruning techniques are orthogonal to “layer pruning” (depth pruning) where a number of internal layers are removed. Another orthogonal type of pruning is “length pruning” such as token pruning and embeddings pruning.

Like all pruning methods, width pruning can be static or dynamic. Static width pruning is removing channels or attention heads from the model as it's trained or shortly after training. It is conceptually similar to choosing a smaller width hyper-parameter for the model as part of Neural Architecture Search (NAS).

Dynamic width pruning means a selective inference path by removing width at runtime by blocking or bypassing some elements of the model, usually depending on the input sequence. Transformers can dynamically choose which attention head to activate or suppress in dynamic attention head pruning. It is also possible to dynamically prune the width of a non-Transformer model by inactivating channels or filters during inference.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++