Aussie AI
Accuracy-Retaining Optimizations
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Accuracy-Retaining Optimizations
Which AI engine optimizations are only about speed? We are looking for “pure speedups” that are only about the code and are therefore “smartness-retaining optimizations.” We want either shorter latency or greater throughput overall, but without any degradation in artificial braininess. Which techniques simply find faster ways to compute the exact same results?
Here's my short list of over-arching ideas:
- Hardware optimizations (i.e. more and faster GPUs, and the CPUs, too).
- Parallelization (multi-threading, multi-core, multi-CPU, multi-GPU, etc.).
- Vectorization (chunking computations off to a GPU or CPU SIMD hardware intrinsic).
- Pipelining optimizations (pipelined hardware or faster software scheduling optimizations; partitioning, marshaling, and dataflow optimizations).
- Transformer component optimizations (non-approximate algorithm improvements or C++ speedups).
- Memory access optimizations (e.g. contiguous memory, alignment, data locality, cache management, prefetching, offloading).
And here's some component-level methods:
- Kernel optimizations (e.g. tiling to reorder memory accesses, kernel fission for vectorization, operator reordering, C++ coding improvements).
- Kernel fusion (faster way by merging two sequential operations into one combined C++ kernel).
- MatMul/GEMM kernel optimizations (i.e. the same multiplication computations, reordered to be faster).
- KV caching (avoiding a well-known computation redundancy).
- Zero padding avoidance and zero skipping (don't do redundant calculations on zeros).
- Precalculation of exact results (non-approximate).
- Logarithmic number system models (research-only).
The AI engine isn't the only slow-down. Other ways to speed up the overall architecture include:
- Deployment architecture optimizations (e.g. web server, application logic server, hosting boxes, etc.)
- Inference cache (identical results for identical queries).
- Scheduling optimizations (across multiple AI engine servers).
- ML Compilers (depending on how you use them, such as no pruning).
General coding improvements can be used for faster AI:
- C++ intrinsics with underlying hardware support (e.g. AVX).
- Assembly language (i.e. deep inside C++ kernels).
- Floating-point calculation speedups (e.g. FTZ/DAZ CPU modes).
- General C++ code optimizations (e.g. C++ tricks, loop optimizations, LUTs, etc.).
Some of the above listed items may be erroneous, or there may be sub-variants that retain or lose accuracy. I've tried to categorized them to the best of my ability, but it's hard to know all the cases from the many research papers (literally thousands!).
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |