Aussie AI Blog

Weight Clustering Needs a Refresh

  • October 26, 2024
  • by David Spuler, Ph.D.

Weight Clustering Has Disappeared?

I was thinking the other day about how little research I'd seen on "weight clustering" in recent times. It seems to have all but disappeared from the lexicon, replaced by integer quantization (e.g., INT4, INT8, etc.)

But why? It has some advantages over basic integer quantization.

What is Weight Clustering?

Weight clustering, also sometimes called "cluster-based quantization," is similar to quantization, but not the same. Let's use 4-bit quantization as an example.

In 4-bit quantization (INT4), every weight is mapped to an integer 0..15 (i.e., 16 values), and the arithmetic is done in 4-bit integer kernels. Very small, very fast.

Weight clustering with 4 bits also maps all of the weights onto 16 different values, but not the numbers 0..15. Instead, there can be 16 different weight values, which are stored in a lookup-table that maps 0..15 to the actual weight. Hence, the weights are stored in 4 bits, just like INT4 quantization, but the arithmetic is:

  • Get the weight value 0..15 (it's an index, not really a weight),
  • Lookup the weight in the LUT.
  • Use this weight value in the arithmetic.

Note that the "weight value" at the end can be any precision you wish. It could even be FP32 for greater accuracy than integer quantization.

Weight clustering is a "model compression" technique, just like INT4 quantization. It will reduce the size of the whole LLM by compressiong FP32 weights into 4-bits each, plus a small extra table of permuted weight values. Hence, the space usage of INT4 quantization and INT4 weight clustering are effectively identical.

Pros and Cons of Weight Clustering

Continuing our examination of 4-bit quantization versus 4-bit weight clustering, the pros of weight clustering include:

  • Higher-precision arithmetic can be used (e.g., FP32 if you like), because the precision of the weights in the LUT is not the same bit-width as the index value (permutation index). Hence, there is more flexibility in the trade-off between space utilization and accuracy.
  • More accurate distribution of weights (whereas quantization to a few bits is almost forced to be uniform).
  • Granular weight values for different sub-parts of the models is possible.
  • Memory usage is only a tiny bit larger than INT4 quantization (i.e., the permutation mapping LUT).

The cons of weight clustering, where INT4 quantization is better, include:

  • Higher-precision arithmetic (can be both a con or a pro!)
  • Extra step in arithmetic (first a permutation index LUT lookup, and then the dot product or multiplicative arithmetic).
  • Non-contiguous accesses to weight values via the LUT leads to inefficient memory access patterns. On the other hand, a small-to-medium sized LUT could be preloaded into GPU registers, so this issue may be solveable.

It's really the last part that has dropped weight clustering off the map. The efficiency is too low, and needs to be improved. Can it be?

Advanced Weight Clustering Research

I feel like the work in this research area is unfinished. Here are some areas needing some work:

  • Efficient parallelized kernel implementations using permutation lookups.
  • Different data types for the actual weight value (e.g., FP32, FP16, INT32, INT16, etc.)
  • Comparison of weight clustering with non-uniform quantization methods. Is it similar?
  • Accuracy comparisons of weight clustering versus integer quantization with the same bit counts.
  • Granular weight clustering — This is clustering with different LUTs with different weights used at varying levels (e.g., layer-level weight clustering, block-level clustering, etc.). There is a lot of similarity here with granular quantization methods, such as layer quantization, block quantization, etc.
  • Mixed-precision weight clustering. Just as with mixed-precision quantization, clustering could use a different number of bits, and a different-sized LUT, at various levels of granularity.
  • KV cache weight clustering. Does weight clustering offer any advantages when used for KV cache data, instead of KV cache quantization?

There are a few extra super-advanced wrinkles to consider in advanced usage of weight clustering, such as using hashing instead of permutation lookups to map the weights. But we'll leave that for another day.

Further Reading on Weight Clustering

More AI Research Topics

Read more about: