Aussie AI

Accuracy-Retaining Optimizations

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Accuracy-Retaining Optimizations

Which AI engine optimizations are only about speed? We are looking for “pure speedups” that are only about the code and are therefore “smartness-retaining optimizations.” We want either shorter latency or greater throughput overall, but without any degradation in artificial braininess. Which techniques simply find faster ways to compute the exact same results?

Here's my short list of over-arching ideas:

Hardware optimizations (i.e. more and faster GPUs, and the CPUs, too).
Parallelization (multi-threading, multi-core, multi-CPU, multi-GPU, etc.).
Vectorization (chunking computations off to a GPU or CPU SIMD hardware intrinsic).
Pipelining optimizations (pipelined hardware or faster software scheduling optimizations; partitioning, marshaling, and dataflow optimizations).
Transformer component optimizations (non-approximate algorithm improvements or C++ speedups).
Memory access optimizations (e.g. contiguous memory, alignment, data locality, cache management, prefetching, offloading).

And here's some component-level methods:

Kernel optimizations (e.g. tiling to reorder memory accesses, kernel fission for vectorization, operator reordering, C++ coding improvements).
Kernel fusion (faster way by merging two sequential operations into one combined C++ kernel).
MatMul/GEMM kernel optimizations (i.e. the same multiplication computations, reordered to be faster).
KV caching (avoiding a well-known computation redundancy).
Zero padding avoidance and zero skipping (don't do redundant calculations on zeros).
Precalculation of exact results (non-approximate).
Logarithmic number system models (research-only).

The AI engine isn't the only slow-down. Other ways to speed up the overall architecture include:

Deployment architecture optimizations (e.g. web server, application logic server, hosting boxes, etc.)
Inference cache (identical results for identical queries).
Scheduling optimizations (across multiple AI engine servers).
ML Compilers (depending on how you use them, such as no pruning).

General coding improvements can be used for faster AI:

C++ intrinsics with underlying hardware support (e.g. AVX).
Assembly language (i.e. deep inside C++ kernels).
Floating-point calculation speedups (e.g. FTZ/DAZ CPU modes).
General C++ code optimizations (e.g. C++ tricks, loop optimizations, LUTs, etc.).

Some of the above listed items may be erroneous, or there may be sub-variants that retain or lose accuracy. I've tried to categorized them to the best of my ability, but it's hard to know all the cases from the many research papers (literally thousands!).

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Accuracy-Retaining Optimizations

Accuracy-Retaining Optimizations

Quick Links

Product

New to Writing?

Writing Styles