Aussie AI
36. AI Memory Optimizations
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
“The array is stable.”
— Star Trek, 1996.
Why Optimize Memory?
C++ programmers are not especially used to optimizing for memory rather than speed. However, AI engines are one application where memory optimization is a significant part of the whole efficiency puzzle because of the sheer volume of data in the model (“weights”) and the interim calculations (“activations”). There are two problems with today's AI architectures that together create the need for memory optimization:
(a) models are too big, and
(b) GPUs are too fast.
Obviously, it's great to have a fast GPU, and their amazing speed has led to the revolution in AI capabilities, and indeed is what allows us to have such huge models. However, the problem is that memory chips haven't kept up with the speed increases in GPU technologies, leading to the problem that AI engines are often memory-bound, rather than CPU-bound (or rather, GPU-bound).
What this means is that the GPU is often sitting there doing nothing, waiting for data to be uploaded from memory. This is sometimes called a “bubble” in the GPU pipeline. And there are various software techniques to pop the bubbles, which is what this chapter is about.
At the top-level, there are two fundamental ways to improve memory efficiency in an AI application:
- Smaller models — model compression
- Memory management — advanced engine algorithms (“kernels”)
And at the bottom-level, there's always the C++ code. It's important to get the fundamentals of memory management right at all levels.
Elements of Memory Optimization
Before we delve into complex AI memory management optimizations like model compression or kernel tiling, let's look at the fundamental building blocks. How do we represent data in memory inside an AI engine to promote memory efficiency?
Some of the lower-level types of memory optimizations include:
- Contiguous memory blocks
- Linearizing multidimensional matrices/tensors
- C++ memory management optimizations
For the last category, there are also a variety of basic C++ programming techniques to minimize CPU memory usage overall. See chapter 14 for information on C++ memory reduction.
Contiguous Memory Blocks
AI frameworks typically require that tensors are stored in contiguous memory, whether in CPU RAM or GPU VRAM. A critical part of optimizing AI engines is to manage the storage of data in a contiguous memory block, so that they have a sequential address space and high data locality.
Processing chunks of data in parallel is the main optimization used in both GPU and CPU SIMD acceleration. All of the vectors, matrices, and tensors need their underlying data in a block. This applies to both pre-computed static weights and dynamic interim activation results in the AI engine's computations.
Processing data that is in adjacent addresses is much faster than jumping all over the place. Vectors should obviously be stored in a simple contiguous array of memory. Less obviously, similar comments apply to the memory storage of matrices and tensors. The use of contiguous memory is an important optimization for both sequential and parallel algorithms. The reasons that memory blocks are more efficient include:
- Data locality (cache hits)
- Data block GPU uploads (model weights from memory-to-cache)
- Predictive cache pipelining (in CPU sequential accesses)
Data locality refers to using data in the same or similar address locations. This is helpful for the cache hit rate because data that is already in the cache is much faster to access that a non-cached RAM memory address.
GPU uploads from CPU RAM to the GPU's RAM (VRAM) is done in blocks. Obviously, we don't want to be uploading random bits of data from different parts of the RAM.
Non-GPU architectures also benefit from the use of contiguous memory. This is obviously true of CPU SIMD instructions (e.g. AVX on x86), but even in sequential execution, the CPU has its own RAM caching methods and often has other optimizations of memory accesses. Predictive cache pipelining is where the CPU attempts to predict what the next memory location will be, and load it in a pipelined speedup, before being asked. This pipelining of memory accesses is much faster than doing completely sequential address lookups.
Typically, predictive cache pipelining uses the simple heuristic that the next memory address is the most likely next request, which assumes that data is being processed in order of the addresses. Hence, scanning an array in reverse is the worst possible order for these CPUs. Similarly, jumping around to different memory addresses, such as scanning the column of a matrix using a large “stride,” is also inefficient.
Fast Memory Block Operations
The slow way to do things in arrays is one element at a time. The faster way is to use the standard memory block functions on the whole array. There are a number of standard functions that operate on array data or memory blocks and they are very fast. Memory block operations in the standard C++ libraries are implemented using fast assembly language behind the scenes. The main functions in the standard C++ library that operate on binary bytes in a memory block are:
memset
— set bytes to a value, or clear bytes to zero.memcpy
— copy bytes in a block (non-overlapping).memmove
— copy bytes, allowing overlapping blocks.memcmp
— compare two blocks of bytes.memchr
— search for a byte in a block.
Note that unlike the standard string functions (such as strlen
), these functions do not
assume a block is null-terminated by a zero byte.
Zero is simply a binary value, and these functions don't stop at a zero byte.
All of these functions operate on a block of memory with a fixed byte length.
Each compiler environment typically offers some extra non-standard byte-wise functions that are also fast. Some of the less standardized C++ intrinsics that operate on memory blocks include:
_memccpy
— copy bytes up to a specified sentinel byte.memicmp
/_memicmp
— compare bytes ignoring letter case.bcopy
— copy a block of bytes.bzero
— clear a block of bytes to zero.bcmp
— compare bytes in two blocks._byteswap_uint64
— swap the bytes of an integer (Microsoft).__builtin_bswap16
— swap the bytes in an integer (GCC). There are versions for 32-bit and 64-bit.
Initialize memory blocks with memset
The memset
function sets all of a memory block to a byte value.
It is widely used as a fast way to initialize a block of memory
to all zeros.
memset(&x, 0, sizeof(x));
Almost all usages of memset
will be for the zero byte.
The only other usage I've seen is to fill memory with
a dummy non-zero byte as a form of mutation testing
to catch uses of uninitialized memory.
memset(&x, 0x55, sizeof(x));
memset sizeof problem.
Here's a common glitch in using memset
inside functions:
void zero_array(int arr[10]) { memset(&arr, 0, sizeof(arr)); // Bug }
The problem is not memset
, but the sizeof
operator
on function parameters.
An array parameter in a function is like a hologram
and isn't really there.
It's not really an array, but a pointer,
and sizeof(int[10])
is the same as sizeof(int*)
.
Hence, sizeof(arr)
is probably only 4 or 8, rather than 40 or 80,
leaving most of the array uninitialized.
Personally, I recommend a memset
debug wrapper function
to catch this kind of problem at runtime,
or maybe a tricky preprocessor macro can detect it at compile-time
with a static_assert
somehow.
memset portability issue.
Even though it's a fast zeroing method, the use of
memset
to zero bytes has an obscure portability problem
on any architecture where all-bytes-zero is not the
same as all data types zero.
However, on most standard platforms, all-bytes-zero is correct
for all types: integer zero (regardless of endianness),
floating-point zero (positive zero is all bits zero), and the null pointer.
Zero portability is covered in more detail in Chapter 38.
Fast memory block copying with memcpy
The fast way to copy an entire memory block is with memcpy
.
This applies to copying vectors and to matrices and tensors that are linearized into a contiguous array.
Rather than copy each element of an array, one at a time, in a loop, the memcpy
standard
library function can be used to copy the entire block in one statement:
memcpy(destarr, srcarr, sizeof(srcarr));
Note that this is a bitwise copy of the array intended for simple data types. For example, it won't run copy constructors if applied to an array of objects.
The memcpy
function does a very fast memory block copy.
It is like strcpy
in that the destination is the first parameter,
but
memcpy
will copy everything, even null bytes
and hidden padding bytes,
and will always copy a fixed number of bytes.
memcpy
is a super-fast byte copy,
but is unsafe, because it does not have
well-defined behavior if the source and destination blocks overlap.
memcpy overlapping blocks error:
The only downside with memcpy
is that
it can fail with overlapping ranges for the
source and destination blocks,
so if you are shuffling arrays up or down one element using memcpy
,
then you have to be careful, because the results on overlapping
ranges are undefined.
Here's a buggy example of using memcpy
to remove the
first character of a string in place:
memcpy(s, s+1, strlen(s+1)+1); // Bug
The problem is that the blocks starting at “s
”
and “s+1
” are overlapping.
It is implementation-defined whether this code will run correctly.
The fix is simply to use memmove
, which always works correctly for overlaps:
memmove(s, s+1, strlen(s+1)+1); // Correct
The memmove
function is a safer version of memcpy
,
which also works correctly if the memory blocks overlap.
If the source and destination blocks don't overlap,
it's the same as memcpy
,
except probably slightly slower.
If they do overlap, then memmove
conceptually
will copy the source to a temporary area,
and then copy it to the destination block.
memcmp byte comparisons
The memcmp
function does a byte-wise comparison of a memory block.
Its return value is like strcmp
, returning 0 for equality,
and a negative or positive value otherwise.
However, note that memcmp
is not like strcmp
, and will not
stop when it finds a zero byte.
memcmp return value.
A pitfall with memcmp
is that you cannot assume that it returns 1 or -1, but must compare
the return result to zero (like the strcmp
function).
if (memcmp(&a, &b, sizeof(a)) == 1) // Bug if (memcmp(&a, &b, sizeof(a)) > 0) // Correct
memcmp object equality testing.
Looking at the memcmp
function, you might think of it as an opportunity
to do a fast equality/inequality test on large objects
by simply doing a byte-wise test.
You would not be the first to think that.
Unfortunately, there are several obscure pitfalls with this approach. There are multiple ways in C++ that the data in two memory blocks might be identical, but the bytes different:
- Padding bytes (alignment)
- Two types of floating-point zero
- Multiple types of floating-point
NaN
- Bitfield data members (unset padding bits)
You can only use memcmp
for memory block comparisons (e.g. tensor equality)
if you're sure that these situations cannot occur,
such as a contiguous array of primitive data types.
Not recommended otherwise.
Linearized Multi-Dimensional Arrays
Contiguous memory blocks are efficient and easy to vectorize, as we found out in the previous sections. Further gains are possible if we can represent the higher-order logic of the AI model in matrices and tensors, but retain the power of sequential address spaces underneath. Rather than thinking of a matrix as rows and columns, or higher dimensions for tensors, we should still think of them as one big chunk of data.
Using dynamic memory to store multi-dimensional arrays is a difficult problem to solve. One idea is to allocated a one-dimensional array, and then “map” a two-dimensional structure on it by using arithmetic with the index offsets. This is what the compiler is doing behind the scenes when it defines the builtin C++ multi-dimensional arrays.
The main difficulty with multi-dimensional arrays
of primitive types (e.g. float
)
is that “new[]
” or malloc
/calloc
return blocks of memory with no
structure, and it is difficult to impose the structure of a multi-dimensional array. It is only
legal to use one dimension of indexing on the pointer to this block of memory.
For
example, if we wish to declare an array similar to arr[5][3]
on the heap, we cannot
allocate 15 integers, pass the address to a pointer and then use two levels of brackets on
the pointer variable such as ptr[i][j]
. Instead, it is necessary to map the two indices, x
and y
,
to a single
index, i
. For the array arr[5][3]
the mapping is:
i = x * 3 + y; val = ptr[i];
Note that this is pointer arithmetic based on the size of the array's data, not the number of bytes.
More generally, a macro for defining the mapping for two-dimensional arrays, using the indices and the number of rows and columns, is shown below:
#define MAP2(x1, x2, size1, size2) (x1 * size2 + x2)
Note that the size1
macro parameter is not actually used, but is left there to avoid confusion (or maybe to create it). The size of the
first dimension is not needed in any linearized index calculations.
The MAP2
macro can be used for multi-dimensional dynamic arrays, as below. Unfortunately, the MAP2
macro must be used for every array reference. The code fragment
below allocates a two-dimensional dynamic array and then uses the MAP2
macro to set all
elements to zero.
int num_rows = 5, num_columns = 3; int *p = calloc(num_rows * num_columns, sizeof(int)); for (int i = 0; i < num_rows; i++) for (int j = 0; j < num_columns; j++) p[MAP2(i, j, num_rows, num_columns)] = 0; // p[i][j]=0
Similarly, macros can be defined for mapping three-dimensional and higher dimensional arrays to the one-dimensional array index. The macros for dimensions three and four are shown below, and macros for higher dimensions can be devised by following the pattern.
#define MAP3(x1, x2, x3, size1, size2, size3) \ (x1 * size2 * size3 + x2 * size3 + x3) #define MAP4(x1, x2, x3, x4, size1, size2, size3, size4) \ (x1 * size2 * size3 * size4 + x2 * size3 * size4 \ + x3*size4 + x4)
Model Compression
Model compression is the general class of optimizations that “compress” a model down to a smaller size. The goal is usually both memory and speed optimization via a smaller model that requires fewer operations and/or lower-precision arithmetic. Some techniques work after training, and some model compression methods require a brief re-training or fine-tuning followup.
One key point to reducing a model size is that the number of weights is usually directly correlated to: (a) the number of arithmetic operations at runtime, and (b) the total bytes of memory-to-cache data transfers needed. Hence, shrinking a model can proportionally reduce time cost, which is not true for all space optimizations.
We've already examined a lot of the possible ways to make a smaller model in earlier chapters. The most popular types of model compression are:
- Quantization
- Pruning
- Distillation
There are also a number of other less well-known types of model compression:
- Weight sharing (parameter sharing)
- Layer fusion
- Weight clustering
- Sparsity (not only via pruning)
- Low-rank matrices
There are also several ensemble multi-model architectures that offer memory efficiency by having at least one small model in the mix:
- Mixture-of-experts
- Big-little models
- Speculative decoding
All of these model compression techniques are discussed in separate chapters. Whenever fewer computations are required, there is also the additional benefit that fewer memory transfers will be required for the data at runtime.
GPU Memory Management
The GPU has its own memory, and a good GPU for AI algorithms has many gigabytes. One of the important aspects of optimizing an AI engine is how it handles not only the LLM model in CPU RAM, but also its algorithm for uploading the model into GPU VRAM, and managing the back-and-forth between the two types of RAM.
The GPU also has its own caches. The L1 cache is the on-processor memory cache of a GPU (fast and small), and the L2 cache is its own VRAM (usually larger). Hence, cache management issues apply at multiple levels including the CPU RAM cache and the multiple GPU cache levels.
There are various ways to improve the memory efficiency of an AI model by changing the algorithms of the Transformer engines in which it runs. Many of these options are algorithm changes to the underlying C++ kernels, which run in the CPU or GPU, depending on the platform.
Memory optimization has many facets. Some of the possible techniques for general management of the model data in CPU RAM, and when sent to the GPU, include:
- Pipelining and data marshaling algorithms
- Data locality optimizations (e.g. tiled kernels)
- Multi-level cache management
- Prefetching
- Dataflow optimizations
- Partitioning
- Paging/swapping algorithms
- Multi-GPU scheduling optimizations
- Offloading
- Recomputation
An important point is that the VRAM is closely aligned to “CUDA cores” (GPU computation units). So, it would be a mistake to think that to optimize your 3B model you could just copy 12GB of contiguous data into 12GB of VRAM and exploit the GPU in a meaningful way. Rather, each “core” is likely to have control of a portion of the VRAM. The 12GB is going to need to be split up and populated with respect to the GPU primitives and parallelization models native to the card and the low-level algorithm will need to be aware of that too.
Pipelining and Data Marshaling: One of the fundamental tasks of the AI engine, or an ML compiler, is to ensure a continual stream of data is uploaded to the GPU VRAM, ready for computation. This is easier said than done!
The GPU is super-fast, and it's hard to keep up. There are many research papers on the various data marshaling algorithms to upload data to keep the GPU pipeline full.
This is also complicated by the fact that the GPU VRAM may not be large enough to contain an entire LLM (if it's a big one), in which case there is an issue with uploading model data to the GPU, and then sometimes doing it again. And the whole area is further complicated in two ways:
(a) multi-query optimizations for the same model on one GPU,
(b) multi-GPU optimizations that parallelize single-query model inference over multiple GPUs, and
(c) all of the above (multi-query over multi-GPU).
Data Locality and Tiling: Data locality is an optimization that we have already examined indirectly in various chapters. Note that data locality is a general technique and can apply to either CPU RAM or GPU VRAM, albeit at different access speeds. The gains from better data locality apply broadly to any cache.
The primary goal is to speed up data processing by using memory addresses that are already in the cache, whether it is the CPU RAM cache or the GPU VRAM cache. Hence, this method speeds up the execution by reducing memory access costs. Some of the ways to improve data locality include:
- Contiguous memory blocks — see prior sections.
- Tensor layouts — see Chapter 23.
- Tiled loop optimizations — see Chapter 15.
- Tiled MatMul/GEMM — see Chapter 34.
Multi-GPU Scheduling Optimizations: One way to address a GPU waiting for memory uploads is to multiplex different inference jobs across multiple GPUs. This has the benefit of increasing throughput of multiple inference calculations in each GPU, with improved cost efficiency by avoiding wasted GPU cycles. However, the latency of a single inference job is not improved by this method.
Offloading and Recomputation: The term “offloading” in Computer Science theory generally refers to a low-power device offloading computation to a separate more powerful device. In the AI context, this would seem to refer to a CPU offloading processing to a GPU. However, the term “offloading” is not really used with this meaning in AI research. Other terms are used such as “hardware acceleration,” “parallelization,” and “vectorization.”
In much AI research, the term “offloading” actually refers to the opposite type of situation, from a GPU back to the CPU (i.e. from a high-power device down to a low-power device). This type of offloading is done to save memory space, not to speed up parallel processing (although indirectly it does this, too). The “offloading” occurs in the reverse and is combined with recomputation (also called “rematerialization”).
The recomputation optimization method involves not storing results of a computation that you might need later, but instead waiting until later, and then recomputing them all over again. Hence, recomputation trades time for space and is effectively the opposite of caching and data reuse optimizations, which trade space for time. Recomputation involves doing calculations a second time, which is redundant computation. Hence, this is ideally not something you want to do often, since it involves a lot of extra CPU or GPU time. But it is a technique that can be considered when memory is at a premium, and is sometimes done as a GPU optimization that enables maintaining a large model in the GPU cache.
The overall goal of “offloading-with-recomputation” is to reduce processing time by optimization of memory usage, where the data (e.g. a tensor) is offloaded out of the GPU to save GPU RAM space, and then later re-sent to the GPU (or recomputed) if it's needed again. When this recomputation optimization is used during training, it is often called “checkpointing” or “gradient checkpointing.”
Transformer Component Memory Optimization
Some of the memory management improvements that are specific to Transformer components and low-level programming include:
- Memory-efficient Transformer components (e.g. Flash Attention)
- Kernel fusion
- KV cache management
- Shared KV caches
KV Cache Management: We have examined the speedup available from KV caching in Chapter 29 (Caching). However, this internal cache also has specific characteristics with regard to its memory requirements. Unlike many other parts of the Transformer, the KV cache's memory requirements can grow and shrink dynamically. There are various research papers on how to optimally manage the KV cache and its memory needs. The KV cache can also be shared across queries and across different computations for the same prompt.
Kernel Fusion: Kernel fusion is the merging of two sequential operations into one combined operation. Although this is usually done to improve speed, it has the secondary memory benefit of avoiding the need to store the interim calculations after the first operator for re-loading by the second operator. See Chapter 31 for more about kernel fusion.
• Next: Chapter 37. Profiling and Benchmarking • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |