Aussie AI
Accuracy-Degrading Optimizations
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Accuracy-Degrading Optimizations
The question is not only about speed, but how to optimize an AI model, and still have it be equally as smart. You want to have your cake and eat it, too, I guess. But first let's look at some of the cake-eating optimizations that degrade accuracy of the model.
Many types of AI optimization techniques result in a less capable model, with a trade-off between speedup (latency/throughput) versus accuracy (perplexity). Here are some of the optimizations that will result in lower accuracy:
- Model compression — quantization, pruning, distillation, weight sharing, weight clustering, etc.
- Sparsity or low-rank matrix optimizations — LoRA/QLoRA.
- Adaptive inference — early exit, dynamic pruning, layer fusion, etc.
- Semantic caching — nearest-neighbor query caching with a vector database.
- Approximations — of activation functions, matrix multiplication, arithmetic multiplication, etc.
- Decoding algorithm optimizations — streamlined with approximation.
- Linearized attention algorithms — local attention, Flash Attention.
- Ensemble multi-model methods — mixture-of-experts, big-little, speculative decoding, cascades, wisdom of committees, swarms.
- Stochastic optimization algorithms — intentional randomness.
- Zero-multiplication models — e.g. bitshift, max-plus, or adder models; research-only.
- Integer-only arithmetic models — usually quantization-related; research-only.
- Advanced number system models — various obscure types; research-only.
The level of accuracy loss can vary greatly amongst these methods.
For example, 4-bit integer quantization will be a lot worse
than FP16 quantization with 16-bit “half precision” floating-point values, which will be very close in accuracy to the default “single precision” 32-bit float
numbers.
Where these methods are approximate, a lot can depend on the degree of precision
involved in the approximation.
For example, if you change the expf
function in Softmax to use a 24-bit LUT, the precision
will still be high and the error will only be in lower-order digits (i.e. tiny fractional errors),
so the loss of model accuracy might be negligible.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |