Aussie AI

Accuracy-Degrading Optimizations

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Accuracy-Degrading Optimizations

The question is not only about speed, but how to optimize an AI model, and still have it be equally as smart. You want to have your cake and eat it, too, I guess. But first let's look at some of the cake-eating optimizations that degrade accuracy of the model.

Many types of AI optimization techniques result in a less capable model, with a trade-off between speedup (latency/throughput) versus accuracy (perplexity). Here are some of the optimizations that will result in lower accuracy:

  • Model compression — quantization, pruning, distillation, weight sharing, weight clustering, etc.
  • Sparsity or low-rank matrix optimizations — LoRA/QLoRA.
  • Adaptive inference — early exit, dynamic pruning, layer fusion, etc.
  • Semantic caching — nearest-neighbor query caching with a vector database.
  • Approximations — of activation functions, matrix multiplication, arithmetic multiplication, etc.
  • Decoding algorithm optimizations — streamlined with approximation.
  • Linearized attention algorithms — local attention, Flash Attention.
  • Ensemble multi-model methods — mixture-of-experts, big-little, speculative decoding, cascades, wisdom of committees, swarms.
  • Stochastic optimization algorithms — intentional randomness.
  • Zero-multiplication models — e.g. bitshift, max-plus, or adder models; research-only.
  • Integer-only arithmetic models — usually quantization-related; research-only.
  • Advanced number system models — various obscure types; research-only.

The level of accuracy loss can vary greatly amongst these methods. For example, 4-bit integer quantization will be a lot worse than FP16 quantization with 16-bit “half precision” floating-point values, which will be very close in accuracy to the default “single precision” 32-bit float numbers. Where these methods are approximate, a lot can depend on the degree of precision involved in the approximation. For example, if you change the expf function in Softmax to use a 24-bit LUT, the precision will still be high and the error will only be in lower-order digits (i.e. tiny fractional errors), so the loss of model accuracy might be negligible.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++