Aussie AI

16. Hardware Acceleration

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“You're gonna need a bigger boat.”

— Jaws, 1975.

Why Hardware Acceleration?

Hardware acceleration has come a long way since the Intel 8087 floating-point coprocessor in 1980. Every CPU now comes with builtin floating-point operations, and even opcode instructions that perform complex mathematics like exponentials and logarithms in hardware.

Parallelizing computations is now where the action's hot in AI, which needs lots of vectors and matrices running in parallel mode (i.e. tensor computations). The most powerful parallel computations are GPUs which can chomp through a continuous stream of data in parallel.

GPUs are not the only type of hardware acceleration. Even without GPUs, typical CPUs have multi-core and multi-thread parallelism. You can even do small-vector parallel instructions in the CPUs using special SIMD opcode instructions. For example, x86 CPUs have SIMD accessible via C++ AVX intrinsic functions, and Apple M1/M2/M3 chips support Arm Neon for parallelism.

Types of Hardware Acceleration

There are lots of different types of silicon chips available for your AI engine. The basic types of hardware chips are:

Central Processing Unit (CPU)
Graphics Processing Unit (GPU)
Tensor Processing Unit (TPU)
Application-Specific Integrated Circuit (ASIC)
Field-Programmable Gate Array (FPGA)

If you want to build your own hardware, and there are plenty of research papers that do, then use an FPGA or ASIC. Even prior to the AI hype, ASICs proved their value in the Bitcoin mining boom, and FPGAs were commonly behind Azure, AWS and GCP, particularly around security/data protection.

If you're not a hardware designer, you're more likely to want the main CPU and GPU options. CPU parallelism is via AVX or Arm Neon SIMD instructions. For GPUs, you're most likely looking at an NVIDIA chip, from the P100 at the low end to the H100 at the top end (with V100 or A100 in the middle). Alternatively, the TPU is a special custom AI chip created by Google, and is in the same vein as other GPU chips.

CPU Hardware Acceleration

Many of the major CPU chips offer builtin hardware acceleration.

x86/x64 (Intel/AMD) — AVX SIMD instructions (including AVX-2, AVX-512, and AVX-10)
ARM — Neon SIMD instructions (e.g. on phones)
Apple M1/M2/M3 — ARM Neon, Apple AMX instructions, or Apple Neural Engine (ANE).

AVX intrinsics are the topic of the next chapter. These can be used on x86/x64 platforms with Microsoft MSVS or GCC/Clang C++ compilers to run CPU data crunching in parallel.

The ARM Neon is a hardware acceleration processor. ARM-based architectures can run the Neon acceleration opcodes, which are 128-bit SIMD instructions that can parallelize both integer and floating-point computations. At the time of writing, the current version is based on Armv8. Notably, the Apple iPhone platform is based on ARM silicon and has Neon acceleration capabilities.

Apple M1/M2/M3 chips are based on ARM, so the ARM Neon acceleration works. There are also some additional Apple-specific hardware accelerations such as Apple AMX and Apple Neural Engine (ANE).

Detecting CPU Acceleration in C++

It is tricky to check what CPU or GPU support is available to your C++ program. There are different methods for Microsoft Visual Studio, GCC, and Apple.

Preprocessor macros. The first point is that you can only use preprocessor macros if the “single platform” assumption is true. In other words, if you're building on the single platform that you're running in production, or you're a developer toying with an engine on your own single PC.

In such cases, you can detect the current build environment using preprocessor macros. For example, if you're on a Windows box with Microsoft Visual Studio, you might try this:

    #if __AVX2__
       // ... supports AVX2
    #endif

This works fine if you are running C++ on your developer desktop machine, and don't plan to run it anywhere else. But this doesn't check runtime availability of AVX2 on your user's machine. It's only testing whether you've got the AVX2 architecture flag enabled in your compiler on your build machine. Hence, it's misleading and although you can do a #if or #ifdef test for whatever macro you like, it isn't very helpful for multi-platform programming.

Run-time platform testing. The #if method can check the major platforms that you're compiling on (e.g. Windows vs Linux vs Apple), but you cannot check what exact CPU you are running on, or what capabilities it has. The preprocessor macros are processed at compile-time, and can only detect what machine it's building on. This isn't very useful in determining if your user is running the code on a CPU that supports SIMD instructions, or if their box has a GPU on it.

Instead, you need to call C++ intrinsics to detect CPU capabilities at runtime. On the x86/x64 architecture this intrinsic uses the “CPUID” opcode. The C++ intrinsic calls differ by compile platform:

MSVS: __cpuid or __cpuidex (superseding __isa_available in <isa_availability.h>)
GCC/Clang: __builtin_cpu_supports or __builtin_cpu_is functions.

GPU Hardware Acceleration

For the sticklers, AI GPU chips are not really a “GPU” because that stands for “Graphics Processing Unit,” and they aren't used for “Graphics” in an AI architecture (even when creating an image). In fact, they're really a General-Purpose GPU (GPGPU), but nothing other than AI matters in the tech industry, so we stole the acronym from the gamers.

GPUs are great big SIMD processors. There is a huge range of vectorized opcodes available for any given GPU. Each GPU isn't just one vectorized stack, but has lots of separate “cores” that process AI workloads (e.g. FMA) in parallel. Each core runs a SIMD operation such as a small matrix multiply or FMA in a single GPU clock cycle. For example, a V100 “Tensor Core” can do a 4x4x4 half-precision (16-bit) matrix/tensor multiply in a cycle, which is a lot more advanced than a typical vectorized operation. Hence, it's a parallel-of-parallel architecture with:

(a) all the GPU cores running in parallel, and

(b) each core doing vectorized SIMD operations.

The chips also have their own GPU RAM (sometimes called “VRAM”) and there are also multiple levels of caches of that RAM. If you're assessing the specs of a GPU, consider:

FLOPs throughput
Cores
RAM
Clock speed
Memory bandwidth rate
Cooling systems (they run hot!)

GPU Pricing. If you're looking at renting a data center GPU, NVIDIA is top of the list for AI computations. The choice between a P100, V100, A100, or H100 is examined further in the AI deployment chapter. To run a version of Meta Llama2, a V100 is workable for that, but with not many instances per box. As of writing, pricing for a V100 runs below a buck an hour and there are 730 hours in a month, so you can do the math (pricing varies with vendors anyway). You can get an A100 for more than a buck an hour, and a H100 for roughly double that (for now). On the horizon, NVIDIA has a H200 coming mid-2024 with about 141GB RAM (versus the H100's 80GB), and also the B100 in late 2024 for even higher performance than a H200.

You can also buy a GPU chip outright from your private jet using your diamond-encrusted phone. Okay, so that's a bit of an exaggeration. Pricing changes, but as of writing, you're looking at around ten grand for a V100 by itself, but pricing is higher if it's part of a “system” on a motherboard or a box (and this confuses ChatGPT if you ask it about GPU pricing).

Another option is used GPUs, which are cheaper, but might have spent their prior life in a Bitcoin-mining forced labor camp. GPUs do have a limited lifetime and can overheat with partial or total failure.

Detecting GPU Support in C++

Detecting GPU capabilities that are available at runtime in C++ is even more problematic than detecting CPU accelerators or SIMD instructions. The available options for GPU detection include:

NVIDIA CUDA C++ compiler (nvcc)
AMD ROCm
Microsoft DirectML (DirectX)
Apple Metal
Vulkan API (e.g. vkEnumeratePhysicalDevices, vkGetPhysicalDeviceProperties)
Low-level GPU shader APIs

NVIDIA requires CUDA code to be compiled with their nvcc compiler, and the compiler itself has builtin mechanisms for testing the GPU capabilities. That results of that output can be used to set #define options within the C++ code too. The compiler also comes with some builtin defines.

GPU detection is not just determining if a GPU is available. More detail will typically be required, down to “is feature X available” or “which implementation of feature X is available.” For example, NVIDIA has a “GPU Architecture” and a “GPU Feature List” to test for capabilities.

AI Meta-Compilers. The alternative to trying to test GPU capabilities at runtime in C++ is to write code higher up the chain. There are also various ways to write cross-platform code for GPU platforms at a higher level than C++ code, such as:

OpenCL
OpenMP
SYCL
OpenACC

These methods are all designed to make your code portable to different hardware environments. Typically, you write C++-like code, which is then pre-compiled into an internal form that is managed by the wrapper code, and instantiated on the particular platform on which it is currently running.

Assembly Language versus Intrinsics

Assembly language, or “assembler”, is the low-level language for CPU machine instructions. Like C++, it is still a symbolic human-readable language, but unlike C++, it translates mostly one-to-one to machine code instructions. The syntax for assembler is much simpler than C++, and more obscure, but it's also very, very fast.

When to use assembly language. The first question to ask yourself before writing assembler in C++ is whether you need to. The use of assembler should only be considered for the most bottlenecking parts of the code, like deep inside the inner loops of a GEMM kernel. Otherwise, you're probably micro-optimizing something that's not that critical.

Another question is whether to use “intrinsics” instead of assembler. Each C++ compiler has literally hundreds of builtin low-level functions called “intrinsics” that are very fast, probably because the compiler-writers have written them in assembler. There are also lots of intrinsics to use for GPU operations and CPU SIMD extensions such as AVX-512. There are also intrinsics that map one-to-one to x86 CPU instruction codes on that platform. Look through the long list of C++ intrinsics for your compiler platform to see if there's one that does what you need. The use of intrinsics is via a standard C++ function call syntax, so you don't need to learn assembler to take advantage of them.

Assembly language syntax: Here are some of the basics of assembly language coding:

Assembly code filenames usually have a suffix of “.S”, “.s” or “.asm” (but don't need to).
Inline assembly inside C++ could be via asm("string"), __asm__("string"), or asm { tokens }, depending on the compiler.
Comments start with a semicolon (but you can also use C++ comments for inline assembly).
One line per assembly statement.
Jump or branch labels need a suffix colon and should start a line (either their own line or before a statement).

Disadvantages of Assembly Language: The reason that the C language came into being was to overcome some of the low-level problems of programming in assembly or machine code. There are various downsides to using assembly language:

Non-portable — assembly is specific to the CPU and many features depend on CPU sub-releases.
Pitfalls — and you thought C++ had troubles.
Maintainability — few programmers know assembly.
Complexity — everything's harder at the low-level.

To summarize, there's only two reasons to use assembly language: speed and security (of your job).

Inline Assembly Language

Most C++ compilers support features allowing you to specify assembly language sequences in the middle of a C++ program, which is called “inline assembly language.” You don't need to put assembler into a separate code file, because you can use assembly language directives inside C++ sequences.

The directive to use to introduce an assembly language statement into C++ is somewhat compiler-dependent, but the whole concept of assembly language is platform-dependent anyway!

The “asm” expression is the official C++ standard version. This is like a function call with a semicolon ending it. The asm statement contains the assembly language statements inside a large string constant, ending with a newline escape (i.e. “\n”), inside round brackets. Multiple assembly commands can be merged by putting two string literals on subsequent lines and using the adjacent string literal concatenation feature of C++.

    asm (
      " ; ... instructions\n" // C++ Comment
      " ; ... more instructions\n" 
    );

The Microsoft style is different, with a code block rather than an expression. You don't need to put the assembly statements inside a string literal, and you don't need the “\n” newline escapes, either. The basic syntax looks like this:

    __asm {
       ; ... instructions // C++ comment 
    }

This is the Gnu and Clang style with “__asm__” as a C++ function-like expression (similar to “asm”):

    __asm__ (
      " ; ... instructions\n" // C++ Comment
    );

Mixing C++ and assembly language is not something recommended just for fun. Not only do you need to know the assembly statements and all about the CPU registers, but you'll need to know about function calling conventions (e.g. __cdecl vs __stdcall vs __thiscall) and name mangling in C++. Which actually sounds kind of fun.

• Next: Chapter 17. AVX Intrinsics

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs