Aussie AI

Hybrid AI Engine Optimizations

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Hybrid AI Engine Optimizations

Hybrid optimization approaches are those that combine two or more optimization techniques. Many of the optimization techniques can be combined in various ways. In research-speak, we say that they are “orthogonal” to each other.

Common hybrid approaches use the major techniques of quantization and pruning together. You can use either unstructured weight pruning (magnitude pruning) or structural pruning and then quantize the model. Or you can quantize the model first, and then do dynamic pruning at runtime (e.g. early exit of layers). Another hybrid optimization is that you can use distillation to train a smaller model (from a bigger teacher model), and then apply quantization and/or pruning to the resulting distilled model.

As another example, it is possible to do structural pruning on three orthogonal dimensions: depth, width, and length. Depth pruning is layer-wise pruning such as early exit, layer skipping, or layer fusion. Width pruning refers to reducing the “width” of the model, and refers to techniques such as attention head pruning or channel/filter pruning. Length pruning refers to the “length” of the input stream of tokens, with shortening optimizations possible such as prompt compression, token pruning, or embedding pruning. Combining two is “dual pruning”, such as depth and width pruning combined. The combination of all three in “triple pruning” is not something that I've seen in a research paper yet, but there's no theoretical obstacle that blocks it from being done.

A lot of the basic C++ speedups discussed in other chapters neither depend upon nor affect other optimizations. Similarly, faster C++ code for a Transformer component (e.g. MatMul, Softmax, activation functions, etc.), where it isn't an approximation, offers genuine improvement without any other impacts on the model or the engine.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++