Aussie AI

Making Magnitude Pruning Effective

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Making Magnitude Pruning Effective

Magnitude pruning all sounds wonderful, since we just throw away small weights and this makes the model smaller and inference faster. It's a little bit less accurate, but that's largely fixed by doing re-training after each pruning step.

Let's assume we pruned 20% of our weights down to zero using magnitude pruning. This is called “20% sparsity.” Hence, our model file should be 20% smaller, and our inference engine 20% faster. Let's see if that's true.

How is it smaller? Consider the implementation of model weights in memory. How do we make it smaller with 20% zeros? The answer is some type of data representation or data structure that only tracks the non-zero values, which we need to use. Here are some candidates:

  • Bit masks (bit vectors)
  • Data compression (of vectors)
  • Index offsets (permutation arrays)

Bit vectors. One way would be bit vectors that map which hold bit masks of which elements are zero or non-zero. These bit vectors would be our vector length in bits (i.e. the vector size, but divided by 32). However, our data still needs to store all those zeros, and this would actually add to the model's memory usage from the extra bit vectors. So much for compressing the model.

Data compression. Alternatively, we could try run-length encoding, Huffman encoding or other data compression algorithms. The sequence of weights would be shorter and smaller. But then we'll need to de-compress this data for vectors that we send to the GPU. The GPUs are vectorized machines and they need vector data. Everything's running so fast that we don't want to make the GPU wait while we decompress the compressed form into a vector. Nor should we really decompress everything before we start inference, because then we haven't made our model any smaller in memory.

Permutation arrays of indices. Another way we could store the data about which weights are zeros or non-zero is to store the offset indices of only the non-zero data. However, these indices also use up space in memory. If our model size has vectors of length 4096, then we need 12 bits per index offset (i.e. 0..4095), so it's two indices per 32-bit integer, and that's a lot of memory overhead. If we pack them tighter, then it's more computation to unpack them during inference. For pruning at a 20% level, it's actually more memory to store just non-zero weights and their indices.

Nothing works! Everything we tried resulted in increased storage, not less. Overall, we've tried bit vectors, data compression and indices. It's tough to find a way to store our pruned model so that it's smaller in memory, but still makes the data available fast enough for the GPU.

Is it faster? Our 20% pruned model should be 20% faster by avoiding all those zero weights. The basic idea is “zero skipping” in our computation. Where the weights are zero, this multiplication by a zero should be changed to a “no-op” in the engine.

Let's assume that we've solved the above memory problems with the stored model somehow, and we have a vector from our pruned model that is stored as normal, in a contiguous memory array (e.g. maybe we decompressed it to be stored that way).

Zero skipping. The basic idea is to skip zeros by doing a simple arithmetic test, which should be faster than a multiplication. Here's a vector dot product in C++ with our zeros getting skipped.

    float aussie_vecdot_zero_skipping(const float v1[], const float vweights[], int n) 
    {
           // Vector dot product with simple zero skipping test
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
                if (vweights[i] != 0.0f) { // Skip zero weights
                        sum += v1[i] * vweights[i];
                }
        }
        return sum;
    }

The idea is that a test for zero should be faster than a multiplication. And note that the zero test (i.e. “!=0.0f”) can be implemented in some much faster ways using bitwise tricks, which will definitely beat a multiplication operation. Hence, hopefully this code can be optimized so that the pruned model work 20% faster overall. This zero skipping code is therefore a win on simple sequential C++ execution on a CPU. Estimated CPU speedup from our 20% pruned model using zero skipping: up to 20%.

Slugs. In theory this is faster, but is it really? There are still some problems:

  • All memory is still getting accessed.
  • The code is sequential, not parallel.

Memory usage. The first problem is memory. Our zero skipping test still has to load the zero weights into memory, if only to test whether it's zero. It's still loading every element of “vweights[i]” into memory. This undermines our goal of having fewer memory accesses in our compressed model. AI engines are notorious for being memory-bound and we've done nothing to improve that with zero skipping, even in our theoretically-faster CPU speedup method.

Vectorization. Worse, though, is that this code cannot be easily vectorized. The first point is that, like the CPU algorithm, we still have to send the entire vector, including the zeros, to the GPU. Hence, it's loading the entire vector from memory into the GPU cache.

Secondly, the GPU doesn't know which elements are zero ahead of time. How can we do zero skipping in parallel in the GPU? There are some capabilities to use, and it could do an extra parallelized test for zeros, or a bit mask, or something similar. There are various options, but that idea is actually likely to be slower by adding extra vectorized steps. The fastest way is simply to send the entire vector to the parallelized normal dot product computation, using the vectorized multiplication or fused-multiply-accumulate calculations.

Yes, the zeros in the vector will be a little faster than the non-zeros, and light up the silicon circuits a bit less, and create a little less heat, but the whole vectorized operation still has to wait for the other non-zero multiplications to finish, so the speed is simply the normal speed of non-zero multiplications. Estimated GPU speedup from our 20% pruned model: 0%.

In summary, it's a total failure. Our model with 20% pruning and 80% non-zero weights is not very effective. It's not smaller in memory and it's not faster on a GPU. There might be some better solutions to the ones I've offered, but overall, it's not a great big win at a 20% pruning level.

High sparsity is required. So, in a roundabout way, we've got to an important point about pruning: high sparsity is required. Sparsity is the percentage of zeros. Even 50% zeros isn't likely to be very helpful with memory space or speedup on a GPU. For model compression and inference speedup, we're talking about needing 80% or 90% sparsity. This means only a small percentage of non-zero weights left in the model. And here's the funny thing:

It still works!

There is so much redundancy built into a typical AI model that it can still function with only 10-20% of the weights set. Amazingly, models can still have reasonable accuracy with so few weights. It's not as good as the original non-pruned model, but it is often still effective. Training while pruning, or fine-tuning afterwards, allows the model to adjust.

With 80-90% sparsity, it's not hard to make it smaller and faster, because there are very few non-zero weights to manage. If we have high sparsity, then the permutation array of indices method of storage is a good option. This small subset of weights can be stored in a “sparse tensor,” which is typically stored as the two parallel arrays of non-zero weights and their permutation indices. We can store it in this indexed tuple format, and also keep it in memory in the same shrunken format, thereby reducing memory.

For example, consider this original vector of 16 elements:

    0 0 0 0 0.117 0 0 0 2.1 0 0 0 3.5 0 0 0

It has three non-zero values out of 16 (18.75%), so its sparsity level is 81.25%. The optimization is to store it using [index,value] tuples, giving this more complicated format, with indices starting at zero:

    [4, 0.117] [8, 2.1] [12, 3.5]

In terms of space, this is approximately twice 18.75%, giving 37.5%, although we could reduce this by using fewer bits for the index values. We could also use quantization to reduce the size of the non-zero weight values (e.g. from 32-bit floating-point to 16-bit floating-point), in which case we're doing two model optimizations combined: pruning with quantization.

For speedup, there is hardware support for parallelized operations on indexed tuple arrays (or lookup tables), so we can multiply permutations of two sparse vectors in parallel quickly. This is faster because the total number of non-zero multiplications is also much less (e.g. 18.75% here). Hence, we get faster inference because the dot product of two “sparse vectors” can be computed fast, using a fraction of the calculations.

Overall, magnitude pruning done to the point of high sparsity means we get the promise of model compression and faster inference. It's smaller when stored using the permutation index method, and it's faster because there is hardware parallelization support for indexed array calculations.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++