Aussie AI

What is Model Pruningand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

What is Model Pruning?

Model pruning, or simply “pruning,” is a type of model compression for Large Language Models (LLMs) and other neural networks. Conceptually, this involves removal of weights within the model, which reduces the size of the model and the total computation required for model inference. The model runs smaller and faster to generate a response.

It's called “pruning” because it involves cutting the connections between two neurons (i.e. the arcs between two nodes). In a modern Transformer, this is done by setting values to zero in matrices or tensors. Recall that each weight in a tensor is just the value of an arc between two nodes in different layers.

Pruning is one of the main long-standing optimizations of AI models. It is one of the “big three” methods of “model compression” along with quantization and knowledge distillation.

The basic idea with pruning is to get rid of the less important weights, such as those with a tiny fractional value. We do this by making these near-zero weights into exactly zero, because a zero or near-zero weight in a matrix multiplication or vector dot product adds nothing much. Obviously any weight that's already zero is left as zero. A zero weight has no opinion on whether the next word should be “dog” or “cat.” Hence, any weights that have been pruned to zero can be skipped in the computations.

The main type of pruning generally just changes very tiny weights to zero. This is the type of pruning that's easiest to understand, and it's called “magnitude pruning.” Small fractional weights are not adding much to any probabilities, so they might as well be zero.

This type of pruning is done during training, so that some of the weights are examined during the training phases, and dropped to zero if they're tiny and insignificant. Magnitude pruning usually works much better during training because it gives the model time to learn new paths.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++