Aussie AI

36. AI Memory Optimizations

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“The array is stable.”

— Star Trek, 1996.

Why Optimize Memory?

C++ programmers are not especially used to optimizing for memory rather than speed. However, AI engines are one application where memory optimization is a significant part of the whole efficiency puzzle because of the sheer volume of data in the model (“weights”) and the interim calculations (“activations”). There are two problems with today's AI architectures that together create the need for memory optimization:

(a) models are too big, and

(b) GPUs are too fast.

Obviously, it's great to have a fast GPU, and their amazing speed has led to the revolution in AI capabilities, and indeed is what allows us to have such huge models. However, the problem is that memory chips haven't kept up with the speed increases in GPU technologies, leading to the problem that AI engines are often memory-bound, rather than CPU-bound (or rather, GPU-bound).

What this means is that the GPU is often sitting there doing nothing, waiting for data to be uploaded from memory. This is sometimes called a “bubble” in the GPU pipeline. And there are various software techniques to pop the bubbles, which is what this chapter is about.

At the top-level, there are two fundamental ways to improve memory efficiency in an AI application:

Smaller models — model compression
Memory management — advanced engine algorithms (“kernels”)

And at the bottom-level, there's always the C++ code. It's important to get the fundamentals of memory management right at all levels.

Elements of Memory Optimization

Before we delve into complex AI memory management optimizations like model compression or kernel tiling, let's look at the fundamental building blocks. How do we represent data in memory inside an AI engine to promote memory efficiency?

Some of the lower-level types of memory optimizations include:

Contiguous memory blocks
Linearizing multidimensional matrices/tensors
C++ memory management optimizations

For the last category, there are also a variety of basic C++ programming techniques to minimize CPU memory usage overall. See chapter 14 for information on C++ memory reduction.

Contiguous Memory Blocks

AI frameworks typically require that tensors are stored in contiguous memory, whether in CPU RAM or GPU VRAM. A critical part of optimizing AI engines is to manage the storage of data in a contiguous memory block, so that they have a sequential address space and high data locality.

Processing chunks of data in parallel is the main optimization used in both GPU and CPU SIMD acceleration. All of the vectors, matrices, and tensors need their underlying data in a block. This applies to both pre-computed static weights and dynamic interim activation results in the AI engine's computations.

Processing data that is in adjacent addresses is much faster than jumping all over the place. Vectors should obviously be stored in a simple contiguous array of memory. Less obviously, similar comments apply to the memory storage of matrices and tensors. The use of contiguous memory is an important optimization for both sequential and parallel algorithms. The reasons that memory blocks are more efficient include:

Data locality (cache hits)
Data block GPU uploads (model weights from memory-to-cache)
Predictive cache pipelining (in CPU sequential accesses)

Data locality refers to using data in the same or similar address locations. This is helpful for the cache hit rate because data that is already in the cache is much faster to access that a non-cached RAM memory address.

GPU uploads from CPU RAM to the GPU's RAM (VRAM) is done in blocks. Obviously, we don't want to be uploading random bits of data from different parts of the RAM.

Non-GPU architectures also benefit from the use of contiguous memory. This is obviously true of CPU SIMD instructions (e.g. AVX on x86), but even in sequential execution, the CPU has its own RAM caching methods and often has other optimizations of memory accesses. Predictive cache pipelining is where the CPU attempts to predict what the next memory location will be, and load it in a pipelined speedup, before being asked. This pipelining of memory accesses is much faster than doing completely sequential address lookups.

Typically, predictive cache pipelining uses the simple heuristic that the next memory address is the most likely next request, which assumes that data is being processed in order of the addresses. Hence, scanning an array in reverse is the worst possible order for these CPUs. Similarly, jumping around to different memory addresses, such as scanning the column of a matrix using a large “stride,” is also inefficient.

Fast Memory Block Operations

The slow way to do things in arrays is one element at a time. The faster way is to use the standard memory block functions on the whole array. There are a number of standard functions that operate on array data or memory blocks and they are very fast. Memory block operations in the standard C++ libraries are implemented using fast assembly language behind the scenes. The main functions in the standard C++ library that operate on binary bytes in a memory block are:

memset — set bytes to a value, or clear bytes to zero.
memcpy — copy bytes in a block (non-overlapping).
memmove — copy bytes, allowing overlapping blocks.
memcmp — compare two blocks of bytes.
memchr — search for a byte in a block.

Note that unlike the standard string functions (such as strlen), these functions do not assume a block is null-terminated by a zero byte. Zero is simply a binary value, and these functions don't stop at a zero byte. All of these functions operate on a block of memory with a fixed byte length.

Each compiler environment typically offers some extra non-standard byte-wise functions that are also fast. Some of the less standardized C++ intrinsics that operate on memory blocks include:

_memccpy — copy bytes up to a specified sentinel byte.
memicmp / _memicmp — compare bytes ignoring letter case.
bcopy — copy a block of bytes.
bzero — clear a block of bytes to zero.
bcmp — compare bytes in two blocks.
_byteswap_uint64 — swap the bytes of an integer (Microsoft).
__builtin_bswap16 — swap the bytes in an integer (GCC). There are versions for 32-bit and 64-bit.

Initialize memory blocks with memset

The memset function sets all of a memory block to a byte value. It is widely used as a fast way to initialize a block of memory to all zeros.

    memset(&x, 0, sizeof(x));

Almost all usages of memset will be for the zero byte. The only other usage I've seen is to fill memory with a dummy non-zero byte as a form of mutation testing to catch uses of uninitialized memory.

    memset(&x, 0x55, sizeof(x));

memset sizeof problem. Here's a common glitch in using memset inside functions:

    void zero_array(int arr[10])
    {
        memset(&arr, 0, sizeof(arr));  // Bug
    }

The problem is not memset, but the sizeof operator on function parameters. An array parameter in a function is like a hologram and isn't really there. It's not really an array, but a pointer, and sizeof(int[10]) is the same as sizeof(int*). Hence, sizeof(arr) is probably only 4 or 8, rather than 40 or 80, leaving most of the array uninitialized. Personally, I recommend a memset debug wrapper function to catch this kind of problem at runtime, or maybe a tricky preprocessor macro can detect it at compile-time with a static_assert somehow.

memset portability issue. Even though it's a fast zeroing method, the use of memset to zero bytes has an obscure portability problem on any architecture where all-bytes-zero is not the same as all data types zero. However, on most standard platforms, all-bytes-zero is correct for all types: integer zero (regardless of endianness), floating-point zero (positive zero is all bits zero), and the null pointer. Zero portability is covered in more detail in Chapter 38.

Fast memory block copying with memcpy

The fast way to copy an entire memory block is with memcpy. This applies to copying vectors and to matrices and tensors that are linearized into a contiguous array. Rather than copy each element of an array, one at a time, in a loop, the memcpy standard library function can be used to copy the entire block in one statement:

    memcpy(destarr, srcarr, sizeof(srcarr));

Note that this is a bitwise copy of the array intended for simple data types. For example, it won't run copy constructors if applied to an array of objects.

The memcpy function does a very fast memory block copy. It is like strcpy in that the destination is the first parameter, but memcpy will copy everything, even null bytes and hidden padding bytes, and will always copy a fixed number of bytes. memcpy is a super-fast byte copy, but is unsafe, because it does not have well-defined behavior if the source and destination blocks overlap.

memcpy overlapping blocks error: The only downside with memcpy is that it can fail with overlapping ranges for the source and destination blocks, so if you are shuffling arrays up or down one element using memcpy, then you have to be careful, because the results on overlapping ranges are undefined. Here's a buggy example of using memcpy to remove the first character of a string in place:

    memcpy(s, s+1, strlen(s+1)+1);  // Bug

The problem is that the blocks starting at “s” and “s+1” are overlapping. It is implementation-defined whether this code will run correctly. The fix is simply to use memmove, which always works correctly for overlaps:

    memmove(s, s+1, strlen(s+1)+1);  // Correct

The memmove function is a safer version of memcpy, which also works correctly if the memory blocks overlap. If the source and destination blocks don't overlap, it's the same as memcpy, except probably slightly slower. If they do overlap, then memmove conceptually will copy the source to a temporary area, and then copy it to the destination block.

memcmp byte comparisons

The memcmp function does a byte-wise comparison of a memory block. Its return value is like strcmp, returning 0 for equality, and a negative or positive value otherwise. However, note that memcmp is not like strcmp, and will not stop when it finds a zero byte.

memcmp return value. A pitfall with memcmp is that you cannot assume that it returns 1 or -1, but must compare the return result to zero (like the strcmp function).

    if (memcmp(&a, &b, sizeof(a)) == 1)  // Bug
    if (memcmp(&a, &b, sizeof(a)) > 0)   // Correct

memcmp object equality testing. Looking at the memcmp function, you might think of it as an opportunity to do a fast equality/inequality test on large objects by simply doing a byte-wise test. You would not be the first to think that.

Unfortunately, there are several obscure pitfalls with this approach. There are multiple ways in C++ that the data in two memory blocks might be identical, but the bytes different:

Padding bytes (alignment)
Two types of floating-point zero
Multiple types of floating-point NaN
Bitfield data members (unset padding bits)

You can only use memcmp for memory block comparisons (e.g. tensor equality) if you're sure that these situations cannot occur, such as a contiguous array of primitive data types. Not recommended otherwise.

Linearized Multi-Dimensional Arrays

Contiguous memory blocks are efficient and easy to vectorize, as we found out in the previous sections. Further gains are possible if we can represent the higher-order logic of the AI model in matrices and tensors, but retain the power of sequential address spaces underneath. Rather than thinking of a matrix as rows and columns, or higher dimensions for tensors, we should still think of them as one big chunk of data.

Using dynamic memory to store multi-dimensional arrays is a difficult problem to solve. One idea is to allocated a one-dimensional array, and then “map” a two-dimensional structure on it by using arithmetic with the index offsets. This is what the compiler is doing behind the scenes when it defines the builtin C++ multi-dimensional arrays.

The main difficulty with multi-dimensional arrays of primitive types (e.g. float) is that “new[]” or malloc/calloc return blocks of memory with no structure, and it is difficult to impose the structure of a multi-dimensional array. It is only legal to use one dimension of indexing on the pointer to this block of memory.

For example, if we wish to declare an array similar to arr[5][3] on the heap, we cannot allocate 15 integers, pass the address to a pointer and then use two levels of brackets on the pointer variable such as ptr[i][j]. Instead, it is necessary to map the two indices, x and y, to a single index, i. For the array arr[5][3] the mapping is:

    i = x * 3 + y;
    val = ptr[i];

Note that this is pointer arithmetic based on the size of the array's data, not the number of bytes.

More generally, a macro for defining the mapping for two-dimensional arrays, using the indices and the number of rows and columns, is shown below:

    #define MAP2(x1, x2, size1, size2) (x1 * size2 + x2)

Note that the size1 macro parameter is not actually used, but is left there to avoid confusion (or maybe to create it). The size of the first dimension is not needed in any linearized index calculations.

The MAP2 macro can be used for multi-dimensional dynamic arrays, as below. Unfortunately, the MAP2 macro must be used for every array reference. The code fragment below allocates a two-dimensional dynamic array and then uses the MAP2 macro to set all elements to zero.

    int num_rows = 5, num_columns = 3;
    int *p = calloc(num_rows * num_columns, sizeof(int));
    for (int i = 0; i < num_rows; i++)
        for (int j = 0; j < num_columns; j++)
            p[MAP2(i, j, num_rows, num_columns)] = 0; // p[i][j]=0

Similarly, macros can be defined for mapping three-dimensional and higher dimensional arrays to the one-dimensional array index. The macros for dimensions three and four are shown below, and macros for higher dimensions can be devised by following the pattern.

    #define MAP3(x1, x2, x3, size1, size2, size3) \
        (x1 * size2 * size3 + x2 * size3 + x3)
    #define MAP4(x1, x2, x3, x4, size1, size2, size3, size4) \
        (x1 * size2 * size3 * size4 + x2 * size3 * size4 \
        + x3*size4 + x4)

Model Compression

Model compression is the general class of optimizations that “compress” a model down to a smaller size. The goal is usually both memory and speed optimization via a smaller model that requires fewer operations and/or lower-precision arithmetic. Some techniques work after training, and some model compression methods require a brief re-training or fine-tuning followup.

One key point to reducing a model size is that the number of weights is usually directly correlated to: (a) the number of arithmetic operations at runtime, and (b) the total bytes of memory-to-cache data transfers needed. Hence, shrinking a model can proportionally reduce time cost, which is not true for all space optimizations.

We've already examined a lot of the possible ways to make a smaller model in earlier chapters. The most popular types of model compression are:

Quantization
Pruning
Distillation

There are also a number of other less well-known types of model compression:

Weight sharing (parameter sharing)
Layer fusion
Weight clustering
Sparsity (not only via pruning)
Low-rank matrices

There are also several ensemble multi-model architectures that offer memory efficiency by having at least one small model in the mix:

Mixture-of-experts
Big-little models
Speculative decoding

All of these model compression techniques are discussed in separate chapters. Whenever fewer computations are required, there is also the additional benefit that fewer memory transfers will be required for the data at runtime.

GPU Memory Management

The GPU has its own memory, and a good GPU for AI algorithms has many gigabytes. One of the important aspects of optimizing an AI engine is how it handles not only the LLM model in CPU RAM, but also its algorithm for uploading the model into GPU VRAM, and managing the back-and-forth between the two types of RAM.

The GPU also has its own caches. The L1 cache is the on-processor memory cache of a GPU (fast and small), and the L2 cache is its own VRAM (usually larger). Hence, cache management issues apply at multiple levels including the CPU RAM cache and the multiple GPU cache levels.

There are various ways to improve the memory efficiency of an AI model by changing the algorithms of the Transformer engines in which it runs. Many of these options are algorithm changes to the underlying C++ kernels, which run in the CPU or GPU, depending on the platform.

Memory optimization has many facets. Some of the possible techniques for general management of the model data in CPU RAM, and when sent to the GPU, include:

Pipelining and data marshaling algorithms
Data locality optimizations (e.g. tiled kernels)
Multi-level cache management
Prefetching
Dataflow optimizations
Partitioning
Paging/swapping algorithms
Multi-GPU scheduling optimizations
Offloading
Recomputation

An important point is that the VRAM is closely aligned to “CUDA cores” (GPU computation units). So, it would be a mistake to think that to optimize your 3B model you could just copy 12GB of contiguous data into 12GB of VRAM and exploit the GPU in a meaningful way. Rather, each “core” is likely to have control of a portion of the VRAM. The 12GB is going to need to be split up and populated with respect to the GPU primitives and parallelization models native to the card and the low-level algorithm will need to be aware of that too.

Pipelining and Data Marshaling: One of the fundamental tasks of the AI engine, or an ML compiler, is to ensure a continual stream of data is uploaded to the GPU VRAM, ready for computation. This is easier said than done!

The GPU is super-fast, and it's hard to keep up. There are many research papers on the various data marshaling algorithms to upload data to keep the GPU pipeline full.

This is also complicated by the fact that the GPU VRAM may not be large enough to contain an entire LLM (if it's a big one), in which case there is an issue with uploading model data to the GPU, and then sometimes doing it again. And the whole area is further complicated in two ways:

(a) multi-query optimizations for the same model on one GPU,

(b) multi-GPU optimizations that parallelize single-query model inference over multiple GPUs, and

Data Locality and Tiling: Data locality is an optimization that we have already examined indirectly in various chapters. Note that data locality is a general technique and can apply to either CPU RAM or GPU VRAM, albeit at different access speeds. The gains from better data locality apply broadly to any cache.

The primary goal is to speed up data processing by using memory addresses that are already in the cache, whether it is the CPU RAM cache or the GPU VRAM cache. Hence, this method speeds up the execution by reducing memory access costs. Some of the ways to improve data locality include:

Contiguous memory blocks — see prior sections.
Tensor layouts — see Chapter 23.
Tiled loop optimizations — see Chapter 15.
Tiled MatMul/GEMM — see Chapter 34.

Multi-GPU Scheduling Optimizations: One way to address a GPU waiting for memory uploads is to multiplex different inference jobs across multiple GPUs. This has the benefit of increasing throughput of multiple inference calculations in each GPU, with improved cost efficiency by avoiding wasted GPU cycles. However, the latency of a single inference job is not improved by this method.

Offloading and Recomputation: The term “offloading” in Computer Science theory generally refers to a low-power device offloading computation to a separate more powerful device. In the AI context, this would seem to refer to a CPU offloading processing to a GPU. However, the term “offloading” is not really used with this meaning in AI research. Other terms are used such as “hardware acceleration,” “parallelization,” and “vectorization.”

In much AI research, the term “offloading” actually refers to the opposite type of situation, from a GPU back to the CPU (i.e. from a high-power device down to a low-power device). This type of offloading is done to save memory space, not to speed up parallel processing (although indirectly it does this, too). The “offloading” occurs in the reverse and is combined with recomputation (also called “rematerialization”).

The recomputation optimization method involves not storing results of a computation that you might need later, but instead waiting until later, and then recomputing them all over again. Hence, recomputation trades time for space and is effectively the opposite of caching and data reuse optimizations, which trade space for time. Recomputation involves doing calculations a second time, which is redundant computation. Hence, this is ideally not something you want to do often, since it involves a lot of extra CPU or GPU time. But it is a technique that can be considered when memory is at a premium, and is sometimes done as a GPU optimization that enables maintaining a large model in the GPU cache.

The overall goal of “offloading-with-recomputation” is to reduce processing time by optimization of memory usage, where the data (e.g. a tensor) is offloaded out of the GPU to save GPU RAM space, and then later re-sent to the GPU (or recomputed) if it's needed again. When this recomputation optimization is used during training, it is often called “checkpointing” or “gradient checkpointing.”

Transformer Component Memory Optimization

Some of the memory management improvements that are specific to Transformer components and low-level programming include:

Memory-efficient Transformer components (e.g. Flash Attention)
Kernel fusion
KV cache management
Shared KV caches

KV Cache Management: We have examined the speedup available from KV caching in Chapter 29 (Caching). However, this internal cache also has specific characteristics with regard to its memory requirements. Unlike many other parts of the Transformer, the KV cache's memory requirements can grow and shrink dynamically. There are various research papers on how to optimally manage the KV cache and its memory needs. The KV cache can also be shared across queries and across different computations for the same prompt.

Kernel Fusion: Kernel fusion is the merging of two sequential operations into one combined operation. Although this is usually done to improve speed, it has the secondary memory benefit of avoiding the need to store the interim calculations after the first operator for re-loading by the second operator. See Chapter 31 for more about kernel fusion.

• Next: Chapter 37. Profiling and Benchmarking

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs