Aussie AI

33. Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“There’s no problem that can’t be ignored
if we really put our minds to it.”

— King Ralph, 1991.

What is Model Pruning?

Model pruning, or simply “pruning,” is a type of model compression for Large Language Models (LLMs) and other neural networks. Conceptually, this involves removal of weights within the model, which reduces the size of the model and the total computation required for model inference. The model runs smaller and faster to generate a response.

It's called “pruning” because it involves cutting the connections between two neurons (i.e. the arcs between two nodes). In a modern Transformer, this is done by setting values to zero in matrices or tensors. Recall that each weight in a tensor is just the value of an arc between two nodes in different layers.

Pruning is one of the main long-standing optimizations of AI models. It is one of the “big three” methods of “model compression” along with quantization and knowledge distillation.

The basic idea with pruning is to get rid of the less important weights, such as those with a tiny fractional value. We do this by making these near-zero weights into exactly zero, because a zero or near-zero weight in a matrix multiplication or vector dot product adds nothing much. Obviously any weight that's already zero is left as zero. A zero weight has no opinion on whether the next word should be “dog” or “cat.” Hence, any weights that have been pruned to zero can be skipped in the computations.

The main type of pruning generally just changes very tiny weights to zero. This is the type of pruning that's easiest to understand, and it's called “magnitude pruning.” Small fractional weights are not adding much to any probabilities, so they might as well be zero.

This type of pruning is done during training, so that some of the weights are examined during the training phases, and dropped to zero if they're tiny and insignificant. Magnitude pruning usually works much better during training because it gives the model time to learn new paths.

Pros and Cons of Pruning

Efficiency in terms of both time and space is the goal of pruning, rather than improved accuracy. The main downside is a small reduction in model accuracy.

Smaller model. Zero weights needn't be stored in memory or the model file. We don't need them, so the model file should be smaller. This is what is meant by “model compression” as a general category, and pruning achieves this by removing the zero weights.

Faster inference. Inference execution should be faster, because any multiplication by a zero weight can be simply skipped. Hence, latency should be reduced, and throughput increased.

Memory efficiency. The model should also be smaller when loaded into memory, which also further helps with reducing execution cost in memory-bound engines. This reduces the time overhead wasted in transferring model weights from memory into the CPU or GPU caches.

Reduced model accuracy. Generally, if you remove weights from the model, it will lose some capabilities. Hence, pruning is an approximation of the non-pruned model with slightly degraded accuracy as a trade-off for faster execution of inference. However, if we do pruning during training, or do re-training by fine-tuning after pruning, then the impact to model accuracy is much less.

All about that zero. Pruning is all about the fact that multiplication by zero is zero, and adding zero can be skipped. There are other algebraic optimizations that spring to mind, such as multiplication by 1 is the unchanged value, so multiplication could become simply assignment. However, pruning is only about zeros and converting near-zeros to zero. Various other types of optimizations are done by quantization or ML Compilers, but that's not pruning.

Practical difficulties. As detailed later in this chapter, it's quite involved to make a pruned model smaller and faster. For example, storing the model file using run-length encoding or Huffman encoding doesn't reduce in-memory size because the model must be uncompressed to calculate results. In regard to speed, we can maybe skip zeros quickly in sequential code, but if we send a vector to the GPU with a mix of zeros and non-zeros, it's just going to process the whole thing anyway. Spoiler alert: permuted arrays are key.

Unstructured and Structured Pruning

The main general categories of pruning are according to the strategy of which weights to prune:

Unstructured pruning. Remove low weights regardless of where they are in the model.
Structured pruning. Everything else, which means removing weight groups in a specific part of the structure of the model (e.g. all weights in a layer, channel, filter, etc.)

You can do both. It's called “hybrid pruning” or “semi-structured pruning.” For example, a hybrid strategy would be to do unstructured magnitude pruning of all the weights in one structure, rather than removing an entire structural component (e.g. layer-specific magnitude pruning).

Weight pruning or “magnitude pruning” is unstructured pruning that removes very low weights, including small negative weights. Technically, weight pruning is the general class of pruning decisions made about a single weight at a time, whereas magnitude pruning is the specific case of pruning based on cutting small near-zero absolute-value weights (using some threshold). Magnitude pruning also obviously removes zero weights, or the nebulous oddity called “negative-zero” if that is found.

Unstructured pruning will remove small weights anywhere it finds matching candidates. We don't know in which structure of the model there will be pruning. It becomes somewhat random, and at runtime, we don't know in any vector which elements will be zero.

The underlying basis of the pruning idea is that weights with a low magnitude are not contributing much to the overall inference, and hence skipping these calculations should not greatly affect the overall accuracy of the model. Theory says that this works quite well, and magnitude pruning is the simplest and most longstanding type of pruning, with a large body of research.

An important practical aspect of unstructured pruning is that it has no latency benefit unless multiplication by zeros in vector dot products can be efficiently avoided. For example, any benefit from zeroing some weights in a vector that is all still sent to an accelerator (i.e. with some zeroed elements) depends on the characteristics of the hardware, or on the algorithms used by the deep learning compiler (or graph compiler), and the latency improvement may be limited.

Instead of hoping that training gives us lots of small weights to prune, there are ways to cause that intentionally. For example, quantization techniques can also increase the number of zero weights throughout the quantized model. Quantization may round low-magnitude weights down to zero, with a smaller number of discrete weights possible. However, it depends on the level of quantization, and the choices made about rounding versus truncation in the quantization method. Additional research has been done on various other techniques to increase the number of zero weights or low weights, thereby making the weight matrices sparser. Research on such issues is addressed under “sparsification” and “sparse matrix” theory.

Types of Unstructured Pruning

There are many variations of unstructured pruning algorithms. As we've seen above, magnitude pruning is a first-order method because it considers only the current value, whereas “movement proving” is a second-order method because it considers the rate of change during training. Also possible are third-order methods based on the second derivative, which can converge faster, but are computationally expensive.

All of these unstructured pruning methods have meta-parameters that are threshold values that alter the pruning characteristics. Magnitude pruning could also have different thresholds for positive and negative values. The magnitude pruning threshold is often set at a particular target sparsity level rather than a magnitude threshold. These meta-parameters may also change over time, starting with a startup phase of one or two epochs (“stabilization” or “warm-up”), then a low sparsity target at the start and increasing the level of pruning over time. This is called Gradual Magnitude Pruning (GMP).

The decision of how often to prune is also a meta-parameter of these algorithms. Theoretically, we could prune at every weight update, but that seems inefficient and everything could diverge quickly down to zero. Magnitude pruning is typically done using Gradual Magnitude Pruning (GMP), rather than one-shot magnitude pruning, which prunes only once per epoch. Pruning only once ever, at the end of the training phase, is not really training-based pruning, but is effectively post-training pruning.

Magnitude Pruning

Magnitude pruning is zeroing weights that have a small magnitude, which means they have a small numeric absolute value. Equivalently, it is the removal of near-zero weights (positive and negative).

Magnitude pruning is the simplest type of unstructured pruning. In its pure form, any of the weights in the whole model may be pruned, regardless of what structure they are in. This can be combined with structured pruning by limiting to particular structural units of the model.

Magnitude pruning can be performed as part of training, or after training. Post-training magnitude pruning is conceptually similar to quantization, in that a new model with changed weights can be created. Sometimes post-pruning re-training may be required, or it also may be avoided.

Movement Pruning

Another type of unstructured pruning is “movement pruning”. This is a pruning of weights during training, based on how they change during the training process. Weights that are “moving” towards zero as training progresses are removed (pruned), whereas weights “moving away from zero” are retained.

Movement pruning can also have meta-parameters for the sparsity targets or multiple thresholds for the magnitude of values to consider as candidates for pruning, and the rate-of-change thresholds for deciding whether to prune.

Note that this pruning criteria relates to both positive and negative weights (with opposite movement directions). It's also possible to combine a movement threshold and an absolute magnitude threshold, so that important weights with large absolute values are not subsequently removed if they change a tiny amount towards zero. Overall, this means movement pruning focuses on the changes to weights as they are trained, rather than on their absolute value at the end of training.

First-Order and Second-Order Pruning

Magnitude pruning is the simplest type of unstructured pruning, which only considers the current value of the weight during training. It is called a “first-order” pruning method because it doesn't consider the prior values or how they are changing.

Movement pruning is a generalization of magnitude pruning, which considers the “rate of change” of the weight, as it changes with each training step. This is called a “second-order” pruning method because it is considering the “gradient” or “derivative” in the mathematical sense.

The slope of the curve helps decide whether to prune a weight. For example, if a weight has a magnitude below a threshold and its rate of change is decreasing so that it is moving towards zero, then it is pruned to zero. In this way, tiny weights that are still increasing in magnitude, whether positive or negative, are not yet pruned.

Making Magnitude Pruning Effective

Magnitude pruning all sounds wonderful, since we just throw away small weights and this makes the model smaller and inference faster. It's a little bit less accurate, but that's largely fixed by doing re-training after each pruning step.

Let's assume we pruned 20% of our weights down to zero using magnitude pruning. This is called “20% sparsity.” Hence, our model file should be 20% smaller, and our inference engine 20% faster. Let's see if that's true.

How is it smaller? Consider the implementation of model weights in memory. How do we make it smaller with 20% zeros? The answer is some type of data representation or data structure that only tracks the non-zero values, which we need to use. Here are some candidates:

Bit masks (bit vectors)
Data compression (of vectors)
Index offsets (permutation arrays)

Bit vectors. One way would be bit vectors that map which hold bit masks of which elements are zero or non-zero. These bit vectors would be our vector length in bits (i.e. the vector size, but divided by 32). However, our data still needs to store all those zeros, and this would actually add to the model's memory usage from the extra bit vectors. So much for compressing the model.

Data compression. Alternatively, we could try run-length encoding, Huffman encoding or other data compression algorithms. The sequence of weights would be shorter and smaller. But then we'll need to de-compress this data for vectors that we send to the GPU. The GPUs are vectorized machines and they need vector data. Everything's running so fast that we don't want to make the GPU wait while we decompress the compressed form into a vector. Nor should we really decompress everything before we start inference, because then we haven't made our model any smaller in memory.

Permutation arrays of indices. Another way we could store the data about which weights are zeros or non-zero is to store the offset indices of only the non-zero data. However, these indices also use up space in memory. If our model size has vectors of length 4096, then we need 12 bits per index offset (i.e. 0..4095), so it's two indices per 32-bit integer, and that's a lot of memory overhead. If we pack them tighter, then it's more computation to unpack them during inference. For pruning at a 20% level, it's actually more memory to store just non-zero weights and their indices.

Nothing works! Everything we tried resulted in increased storage, not less. Overall, we've tried bit vectors, data compression and indices. It's tough to find a way to store our pruned model so that it's smaller in memory, but still makes the data available fast enough for the GPU.

Is it faster? Our 20% pruned model should be 20% faster by avoiding all those zero weights. The basic idea is “zero skipping” in our computation. Where the weights are zero, this multiplication by a zero should be changed to a “no-op” in the engine.

Let's assume that we've solved the above memory problems with the stored model somehow, and we have a vector from our pruned model that is stored as normal, in a contiguous memory array (e.g. maybe we decompressed it to be stored that way).

Zero skipping. The basic idea is to skip zeros by doing a simple arithmetic test, which should be faster than a multiplication. Here's a vector dot product in C++ with our zeros getting skipped.

    float aussie_vecdot_zero_skipping(const float v1[], const float vweights[], int n) 
    {
           // Vector dot product with simple zero skipping test
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
                if (vweights[i] != 0.0f) { // Skip zero weights
                        sum += v1[i] * vweights[i];
                }
        }
        return sum;
    }

The idea is that a test for zero should be faster than a multiplication. And note that the zero test (i.e. “!=0.0f”) can be implemented in some much faster ways using bitwise tricks, which will definitely beat a multiplication operation. Hence, hopefully this code can be optimized so that the pruned model work 20% faster overall. This zero skipping code is therefore a win on simple sequential C++ execution on a CPU. Estimated CPU speedup from our 20% pruned model using zero skipping: up to 20%.

Slugs. In theory this is faster, but is it really? There are still some problems:

All memory is still getting accessed.
The code is sequential, not parallel.

Memory usage. The first problem is memory. Our zero skipping test still has to load the zero weights into memory, if only to test whether it's zero. It's still loading every element of “vweights[i]” into memory. This undermines our goal of having fewer memory accesses in our compressed model. AI engines are notorious for being memory-bound and we've done nothing to improve that with zero skipping, even in our theoretically-faster CPU speedup method.

Vectorization. Worse, though, is that this code cannot be easily vectorized. The first point is that, like the CPU algorithm, we still have to send the entire vector, including the zeros, to the GPU. Hence, it's loading the entire vector from memory into the GPU cache.

Secondly, the GPU doesn't know which elements are zero ahead of time. How can we do zero skipping in parallel in the GPU? There are some capabilities to use, and it could do an extra parallelized test for zeros, or a bit mask, or something similar. There are various options, but that idea is actually likely to be slower by adding extra vectorized steps. The fastest way is simply to send the entire vector to the parallelized normal dot product computation, using the vectorized multiplication or fused-multiply-accumulate calculations.

Yes, the zeros in the vector will be a little faster than the non-zeros, and light up the silicon circuits a bit less, and create a little less heat, but the whole vectorized operation still has to wait for the other non-zero multiplications to finish, so the speed is simply the normal speed of non-zero multiplications. Estimated GPU speedup from our 20% pruned model: 0%.

In summary, it's a total failure. Our model with 20% pruning and 80% non-zero weights is not very effective. It's not smaller in memory and it's not faster on a GPU. There might be some better solutions to the ones I've offered, but overall, it's not a great big win at a 20% pruning level.

High sparsity is required. So, in a roundabout way, we've got to an important point about pruning: high sparsity is required. Sparsity is the percentage of zeros. Even 50% zeros isn't likely to be very helpful with memory space or speedup on a GPU. For model compression and inference speedup, we're talking about needing 80% or 90% sparsity. This means only a small percentage of non-zero weights left in the model. And here's the funny thing:

It still works!

There is so much redundancy built into a typical AI model that it can still function with only 10-20% of the weights set. Amazingly, models can still have reasonable accuracy with so few weights. It's not as good as the original non-pruned model, but it is often still effective. Training while pruning, or fine-tuning afterwards, allows the model to adjust.

With 80-90% sparsity, it's not hard to make it smaller and faster, because there are very few non-zero weights to manage. If we have high sparsity, then the permutation array of indices method of storage is a good option. This small subset of weights can be stored in a “sparse tensor,” which is typically stored as the two parallel arrays of non-zero weights and their permutation indices. We can store it in this indexed tuple format, and also keep it in memory in the same shrunken format, thereby reducing memory.

For example, consider this original vector of 16 elements:

    0 0 0 0 0.117 0 0 0 2.1 0 0 0 3.5 0 0 0

It has three non-zero values out of 16 (18.75%), so its sparsity level is 81.25%. The optimization is to store it using [index,value] tuples, giving this more complicated format, with indices starting at zero:

    [4, 0.117] [8, 2.1] [12, 3.5]

In terms of space, this is approximately twice 18.75%, giving 37.5%, although we could reduce this by using fewer bits for the index values. We could also use quantization to reduce the size of the non-zero weight values (e.g. from 32-bit floating-point to 16-bit floating-point), in which case we're doing two model optimizations combined: pruning with quantization.

For speedup, there is hardware support for parallelized operations on indexed tuple arrays (or lookup tables), so we can multiply permutations of two sparse vectors in parallel quickly. This is faster because the total number of non-zero multiplications is also much less (e.g. 18.75% here). Hence, we get faster inference because the dot product of two “sparse vectors” can be computed fast, using a fraction of the calculations.

Overall, magnitude pruning done to the point of high sparsity means we get the promise of model compression and faster inference. It's smaller when stored using the permutation index method, and it's faster because there is hardware parallelization support for indexed array calculations.

Static vs Dynamic Pruning

Another top-level classification involves whether pruning is done during training, after training, or during inference.

Static pruning. Changing the model during or after training, but not during inference.
Dynamic pruning. Inference decisions that effectively “skip” parts of the network for a given query.

If we wanted consistency between the pruning and quantization research literature, we really should call these Pruning-Aware Training (PAT) and Post-Training Pruning (PTP), but hey, who needs consistency? These terms are rarely used in pruning research papers, so I'm leading you astray.

Static pruning creates a fixed model, with lots of zeros, that doesn't change during inference. The model should be smaller in memory, and run faster.

Dynamic pruning doesn't really cut anything out of the model data structure, but “prunes” parts of the model at runtime by ignoring some of the weights. It effectively removes some weights by selectively skipping small weights or whole structures, depending on the user's input query. The weights are not permanently zeroed in dynamic pruning, and may be used again for the next user's query, which might skip different parts of the model. Hence, dynamic pruning helps runtime efficiency with fewer computations and fewer memory-to-cache data transfers, but does not reduce the size of a model in memory.

Note that static pruning can be unstructured or structured, and most techniques could be used. More specifically, unstructured pruning is usually done during training, but structured pruning could be done during training or after training, with or without additional fine-tuning. For example, static pruning could involve pre-inference removal of random weights (unstructured magnitude pruning) or of entire layers (structured layer pruning).

However, dynamic pruning doesn't make sense for unstructured pruning, because the model weights and other parameters are static during inference. I guess it's theoretically possible to do “dynamic unstructured pruning” by turning some of the weights on and off, such as by setting a magnitude threshold below which we don't use the weights in vector dot products, or using some bit vectors to mask different sets of weights, but I don't think I've seen a paper on it.

Dynamic pruning during inference always refers more to structured pruning, where we avoid processing an entire structure, such as a layer or an attention head. Dynamic layer pruning is where whole model layers are skipped dynamically, which is usually called “early exit” or “layer skipping” or “depth pruning” in the papers. Also possible is dynamic pruning along the width dimension, such as where whole attention heads are skipped, known as “attention head pruning” or “width pruning” in the literature.

• Next: Chapter 34. MatMul and GEMM

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs