Aussie AI

AI Memory Reduction

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

AI Memory Reduction

Memory reduction optimizations involve using less memory during model inference. This means that inference requires less resources, and can also reduce processing time cost with less data being swapped in and out of memory. Memory optimization can refer to optimizing either CPU memory (RAM) or GPU memory (VRAM) or both.

Much research reports that model inference is memory-bound rather than CPU-bound or GPU-bound, with the processors and GPU often underutilized because they are waiting to receive the data from memory (or should I say they're “weighting”?). In such cases, memory management is key to improving latency and throughput.

On the other hand, researchers have also examined increasing memory usage to save time by caching and computation reuse. Actually, you can do both with caching at the front-end to ward off the common queries, but behind that you have the harder queries using memory management techniques in the engine to run faster.

Model Compression Techniques: The main class of optimizations that reduces memory requirements by making the model smaller is called “model compression.” These methods reduce both memory size and computation time by making the model smaller. The “big three” of well-known and often used model compression methods are: quantization, pruning, and knowledge distillation. However, there are several other types of model compression that result in a light-weight model: weight sharing, layer fusion, sparsity (sparse matrices), low-rank matrices, and weight clustering.

Recomputation: Trading Time for Space: On memory-constrained devices, it is possible to reduce space requirements at the cost of extra processor time. This is called “recomputation,” or sometimes in research papers it is called “rematerialization” or “checkpointing.” When this is used to optimize training of a model that is too large to fit inside GPU memory, it is called “gradient checkpointing.” The portion of this algorithm that involves swapping tensors off the GPU back to the CPU is often called “offloading.”

Model Pre-loading: The above points about recomputation show the best way to optimize memory accesses: avoid doing them at all for the second query, by preloading the entire model into the GPU memory cache. Hence, the GPU already has the whole model available for any followup inference queries, or indeed for the next token computation in an autoregressive algorithm. Ideally, there are zero accesses to the basic RAM needed from the GPU during inference.

For a large model, it isn't possible to store everything in a single GPU, but a lot can still be done to optimize a multi-GPU architecture where part of each model is stored in the GPU's cache. A lot can vary with the particular architecture of CPU and GPU capabilities. Judicious management of the various memory cache levels can help reduce the lag time from a memory-bound algorithm.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

AI Memory Reduction

AI Memory Reduction

Quick Links

Product

New to Writing?

Writing Styles