Aussie AI
GPU Memory Management
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
GPU Memory Management
The GPU has its own memory, and a good GPU for AI algorithms has many gigabytes. One of the important aspects of optimizing an AI engine is how it handles not only the LLM model in CPU RAM, but also its algorithm for uploading the model into GPU VRAM, and managing the back-and-forth between the two types of RAM.
The GPU also has its own caches. The L1 cache is the on-processor memory cache of a GPU (fast and small), and the L2 cache is its own VRAM (usually larger). Hence, cache management issues apply at multiple levels including the CPU RAM cache and the multiple GPU cache levels.
There are various ways to improve the memory efficiency of an AI model by changing the algorithms of the Transformer engines in which it runs. Many of these options are algorithm changes to the underlying C++ kernels, which run in the CPU or GPU, depending on the platform.
Memory optimization has many facets. Some of the possible techniques for general management of the model data in CPU RAM, and when sent to the GPU, include:
- Pipelining and data marshaling algorithms
- Data locality optimizations (e.g. tiled kernels)
- Multi-level cache management
- Prefetching
- Dataflow optimizations
- Partitioning
- Paging/swapping algorithms
- Multi-GPU scheduling optimizations
- Offloading
- Recomputation
An important point is that the VRAM is closely aligned to “CUDA cores” (GPU computation units). So, it would be a mistake to think that to optimize your 3B model you could just copy 12GB of contiguous data into 12GB of VRAM and exploit the GPU in a meaningful way. Rather, each “core” is likely to have control of a portion of the VRAM. The 12GB is going to need to be split up and populated with respect to the GPU primitives and parallelization models native to the card and the low-level algorithm will need to be aware of that too.
Pipelining and Data Marshaling: One of the fundamental tasks of the AI engine, or an ML compiler, is to ensure a continual stream of data is uploaded to the GPU VRAM, ready for computation. This is easier said than done!
The GPU is super-fast, and it's hard to keep up. There are many research papers on the various data marshaling algorithms to upload data to keep the GPU pipeline full.
This is also complicated by the fact that the GPU VRAM may not be large enough to contain an entire LLM (if it's a big one), in which case there is an issue with uploading model data to the GPU, and then sometimes doing it again. And the whole area is further complicated in two ways:
(a) multi-query optimizations for the same model on one GPU,
(b) multi-GPU optimizations that parallelize single-query model inference over multiple GPUs, and
(c) all of the above (multi-query over multi-GPU).
Data Locality and Tiling: Data locality is an optimization that we have already examined indirectly in various chapters. Note that data locality is a general technique and can apply to either CPU RAM or GPU VRAM, albeit at different access speeds. The gains from better data locality apply broadly to any cache.
The primary goal is to speed up data processing by using memory addresses that are already in the cache, whether it is the CPU RAM cache or the GPU VRAM cache. Hence, this method speeds up the execution by reducing memory access costs. Some of the ways to improve data locality include:
- Contiguous memory blocks — see prior sections.
- Tensor layouts — see Chapter 23.
- Tiled loop optimizations — see Chapter 15.
- Tiled MatMul/GEMM — see Chapter 34.
Multi-GPU Scheduling Optimizations: One way to address a GPU waiting for memory uploads is to multiplex different inference jobs across multiple GPUs. This has the benefit of increasing throughput of multiple inference calculations in each GPU, with improved cost efficiency by avoiding wasted GPU cycles. However, the latency of a single inference job is not improved by this method.
Offloading and Recomputation: The term “offloading” in Computer Science theory generally refers to a low-power device offloading computation to a separate more powerful device. In the AI context, this would seem to refer to a CPU offloading processing to a GPU. However, the term “offloading” is not really used with this meaning in AI research. Other terms are used such as “hardware acceleration,” “parallelization,” and “vectorization.”
In much AI research, the term “offloading” actually refers to the opposite type of situation, from a GPU back to the CPU (i.e. from a high-power device down to a low-power device). This type of offloading is done to save memory space, not to speed up parallel processing (although indirectly it does this, too). The “offloading” occurs in the reverse and is combined with recomputation (also called “rematerialization”).
The recomputation optimization method involves not storing results of a computation that you might need later, but instead waiting until later, and then recomputing them all over again. Hence, recomputation trades time for space and is effectively the opposite of caching and data reuse optimizations, which trade space for time. Recomputation involves doing calculations a second time, which is redundant computation. Hence, this is ideally not something you want to do often, since it involves a lot of extra CPU or GPU time. But it is a technique that can be considered when memory is at a premium, and is sometimes done as a GPU optimization that enables maintaining a large model in the GPU cache.
The overall goal of “offloading-with-recomputation” is to reduce processing time by optimization of memory usage, where the data (e.g. a tensor) is offloaded out of the GPU to save GPU RAM space, and then later re-sent to the GPU (or recomputed) if it's needed again. When this recomputation optimization is used during training, it is often called “checkpointing” or “gradient checkpointing.”
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |