Aussie AI

GPU Specs

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

GPU Specs

Sadly, you will need to rent some GPUs for your rapacious AI engines, and this will skyrocket your hosting bills. There are several important considerations related to GPUs:

  • GPU choice
  • GPU RAM (VRAM)
  • GPU billing methods

GPU Choice. Which brand of GPU should you choose? For a data center backend execution of AI inference or training, the leader of the pack is currently NVIDIA. Alternatively, as your second choice, you could try a GPU made by NVIDIA. Or if you can't afford that, then you really should pass the hat around and save up to pay for a GPU from NVIDIA. Your basic data center GPU options from NVIDIA, sorted by GPU RAM (and cost), include:

  • P100 — 12GB or 16GB
  • V100 — 16GB or 32GB
  • A100 — 40GB or 80GB
  • H100 — 80GB

Okay, yes, there are some other options. There is the Google TPU and some data center GPUs from AMD and Intel that you can consider.

If it's not in the data center, such as running a smaller model on a high-end PC (e.g. a “gaming” desktop), then there are more options, and many more GPU RAM sizes to consider. You can choose between various NVIDIA RTX series, AMD Radeons, and several other GPU vendors.

GPU RAM. The amount of RAM on a GPU is important and directly impacts the performance of AI inference. This is sometimes called “VRAM” for “Video RAM” in a somewhat outdated reference to using GPUs for video games, but it's often just called “GPU RAM” when used for AI engines. The “G” in “GPU” used to stand for “Graphics,” but now it just means “Gigantic” or “Generative” or something.

How much GPU RAM is needed? Short answer: at least 12GB for smaller models (e.g. 3B), ideally 24GB to run a 7B model or a 13B model. Quantization is also helpful in making models small enough to fit in a GPU.

Typically, for open source models you want the entire model able to sit inside a single GPU's RAM. However, this also impacts how many instances of the AI engine can run on a single server with one GPU. The GPU needs a static copy of the model (once), but also needs to store all the interim calculations of activations separately for each instance. If you're running an open source 7B model, multiple copies fit inside a decent GPU. Less so for 13B models, and trying to run a 70B model in an 80GB GPU gets a touch more difficult. Quantized models are your friend.

NVIDIA 3080Ti has 12GB and works for 3B/7B models, mainly for POC development and researchy type stuff. NVIDIA 3090 has 24GB and works well for 3B/7B and you can toy around with 13B if careful. NVIDIA 4070Ti (12GB) is similar to a 3080Ti; NVIDIA 4080 has 16GB and NVIDIA 4090 has 24GB. For bigger models requiring more GPU RAM, you're looking at V100, A100, or H100.

The above discussion mainly relates to small and medium-size open source models. Running a big commercial mega-model isn't really possible with only a single GPU. The big companies are running H100's by the bucketload with complex multi-GPU scheduling algorithms. If you want to know about that stuff, send them a resume (or read research papers).

Note that there's not really the concept of “virtual memory” or “context switching” or “paging” of GPU RAM. The operating system isn't going to help you here. Managing GPU RAM is a low-level programming task and you basically cram the entire model into the GPU and hope to never unload it again.

You will more than one box with a GPU, even for smaller models, assuming multiple model instances per GPU. To get a decent response time, you want a model instance on a GPU to be immediately available to run a user's query. How many total instances you need depends on your user load, and whether they like watching blue spinning circles.

GPU Billing. There are various billing methods for GPUs, and you have to compare providers. A lot of GPU power is billed on an hourly basis, with monthly options, and managing this expense can make a big difference overall. The load profile differs for inference versus training, and also obviously depends on user patterns in the production application.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++