Aussie AI

Memory Optimization

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

Memory optimization involves using less memory during model inference. This means that inference requires less resources, and can also reduce CPU usage by leading to less data being swapped in and out of memory. Memory optimization can refer to either CPU memory or GPU memory.

Some research reports that model inference is memory-bound rather than CPU-bound. In such cases, memory management is key to improving latency and throughput. On the other hand, researchers have also examined increasing memory usage to save time by caching and computation reuse.

Model Compression Techniques

The main class of optimizations that reduce memory requirements by making the model smaller is called "model compression". These methods reduce memory by making the model smaller. Model compression includes sub-strategies such as:

Recomputation: Trading Time for Space

On memory-constrained device, it is possible to reduce space requirements at the cost of extra processor time. This is called "recomputation", or sometimes in research papers it is called "rematerialization" or "checkpointing". When this is used to optimize training when a model is too large to fit inside GPU memory, it is called "gradient checkpointing." The portion of this algorithm that involves swapping tensors off the GPU back to the CPU is often called "offloading."

The recomputation optimization method involves not storing results of a computation that you might need later, but instead waiting until later, and then recomputing them all over again. Hence, recomputation trades time for space and is effectively the opposite of caching and data reuse optimizations, which trade space for time.

Recomputation involves doing calculations a second time, which is redundant computation. This is not something you want to have to do often, since it involves a lot more CPU or GPU time. But it is a technique that can be considered when memory is at a premium, and is sometimes done as a GPU optimization.

Research on Recomputation: Research papers on the recomputation memory optimization technique include:

Research on Memory Optimization

For model compression and its popular subtypes, see research paper lists on the individual pages (e.g. quantization, pruning). Other research that is specifically on memory management and reducing memory includes:

Memory-Bound versus CPU-Bound

Surprisingly, researchers discovered that LLM inference was not CPU-bound (or GPU-bound), but was memory-bound, with the cost of accessing all those tensors full of weights (and activations) being the main efficiency bottleneck.

Subsequently, it was found to be more nuanced in decoder-only transformer architectures (e.g. GPT),so that:

  • Prefill phase — CPU-bound
  • Decoding phase &mdash memory-bound

The prefill phase is the initial phase of "prompt processing" where every token in the prompt is processed (in parallel) to generate the overall KV caches. This has been found to thrash the CPU, or rather, the GPU. Prefill is a busy time, but it also takes a long time, and is the cause of the initial delay before an LLM starts answering your question.

The decoding phase is then the next phase, whereby the autoregressive algorithm spits out one token at a time. Because it cannot be fully parallelized, this tends not to fill the GPU pipeline, but is continually accesssing the entire model, one layer at a time. Hence, it's memory-bound.

Research papers on memory-bound versus CPU-bound nature of transformers:

  • Amir Gholami; Zhewei Yao; Sehoon Kim; Coleman Hooper, 25 March 2024, AI and Memory Wall, IEEE Micro ( Early Access ), pp 1-5, https://ieeexplore.ieee.org/abstract/document/10477550
  • Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
  • Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111

More AI Research

Read more about: