Aussie AI

Why Optimize Memoryand?

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Why Optimize Memory?

C++ programmers are not especially used to optimizing for memory rather than speed. However, AI engines are one application where memory optimization is a significant part of the whole efficiency puzzle because of the sheer volume of data in the model (“weights”) and the interim calculations (“activations”). There are two problems with today's AI architectures that together create the need for memory optimization:

    (a) models are too big, and

    (b) GPUs are too fast.

Obviously, it's great to have a fast GPU, and their amazing speed has led to the revolution in AI capabilities, and indeed is what allows us to have such huge models. However, the problem is that memory chips haven't kept up with the speed increases in GPU technologies, leading to the problem that AI engines are often memory-bound, rather than CPU-bound (or rather, GPU-bound).

What this means is that the GPU is often sitting there doing nothing, waiting for data to be uploaded from memory. This is sometimes called a “bubble” in the GPU pipeline. And there are various software techniques to pop the bubbles, which is what this chapter is about.

At the top-level, there are two fundamental ways to improve memory efficiency in an AI application:

  • Smaller models — model compression
  • Memory management — advanced engine algorithms (“kernels”)

And at the bottom-level, there's always the C++ code. It's important to get the fundamentals of memory management right at all levels.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++