Aussie AI

Uncommon Optimization Techniques

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Uncommon Optimization Techniques

It's hard to list all the possible research areas. However, as a kind of “showcase” of some of the interesting research areas, here are some of the more theoretical and lesser-known techniques:

  • Weight clustering. This technique involves merging weights that have similar magnitude into “clusters” that use exactly the same weight instead. It is similar to a combination of quantization and pruning, and can augment either technique.
  • Approximate matrix multiplication. There are various algorithms for performing mathematical multiplication of matrices without actually using numeric multiplication. This is an area of active research that involves a crossover between high-end mathematics and computer algorithms. Several techniques show promise of fast calculations with an acceptable loss of accuracy.
  • Logarithmic bitshift quantization (power-of-two). The coding optimization of using bit-shift operators to replace integer multiplication is well-known. Conceptually, the idea is a second-order model quantization method involving two main steps: (a) convert floating-point model weights to integers (for integer multiplications), and (b) further convert those integer weights logarithmically to the nearest power-of-2. This allows integer multiplication in vector and matrix multiplications to be replaced with integer bit-shift operations. However, the trade-off is a greater loss of model accuracy than basic integer quantization.
  • Additive and zero-multiplication neural networks. Various approaches to remove the multiplication bottleneck by replacing the dreaded star with other arithmetic operators, including “adder” networks, bit shifting, and other low-level optimizations.
  • Low-rank optimization. Optimize high-degree tensors to be lower-degree tensors using matrix factorization/decomposition. This is conceptually similar to a type of large-scale pruning based on matrix algebra.
  • Faster multiplication algorithms. There are ways to do arithmetic multiplication faster, although this research is mainly used by chip designers nowadays.
  • Approximate multiplication. This is a low-level optimization of the multiplication operation itself, usually a more mathematical method than using bitshifts. Some of these methods have been used in quantization.
  • Matrix multiplication algorithms. A lot of research has been performed on faster matrix algorithms to speed up your MatMul/GEMM kernel. It's a very advanced area, but also rather fun.
  • Advanced Numeric Representations. Various non-standard alternative methods to store numeric weights, making use of all the bits in a byte, going beyond the standard floating-point or integer bit patterns.

Want more? Just pick two random methods and combine them. Hybrid optimization strategies are possible in many ways. For example, a model can be quantized to lower precision, and then the inference engine could employ various dynamic pruning strategies. And some strategies apply across both training and inference phases, thereby combining approaches, such as using different approximation algorithms or advanced matrix decompositions. The co-design of hardware and software architectures also typically crosses both training and inference execution.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++