Aussie AI

Hardware Acceleration

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

It all started with the "math coprocessor" chips back in the 1990s. The modern-day version is the Graphics Processing Unit (GPU). As the name suggests, they were originally intended to handle graphics calculations, and are certainly still used for floating point calculations in gaming boxes to display the amazingly fast 3D first-person views that are found in games such as FortNite and MineCraft. However, the role of GPUs has broadened to become that of a general mathematical calculation engine, which has found extensive use in two other massive trends: cryptographic calculations (e.g. bitcoin mining), and the matrix calculations inherent to neural networks and Transformer engines for AI. Such chips are more accurately called "General Purpose GPUs" or GPGPUs, but lately they are all simply called GPUs.

Hardware acceleration is by far the most successful method of optimization for AI engines to date. As the number of floating point operations used by AI models has grown into the billions, the fastest GPU chips have kept up with numerous improvements to hardware acceleration algorithms. The primary advancements have included raw on-chip speed increases to reduce response time, increased on-chip memory size and performance, and the use of parallelization and pipelining methods for improved throughput.

Types of AI Hardware Acceleration

There are various types of hardware acceleration that can make a model run faster.

  • Graphics Processing Unit (GPU)
  • Application-Specific Integrated Circuit (ASIC)
  • Field-Programmable Gate Array (FPGA)
  • Central Processing Unit (CPU)
  • Neural Processing Unit (NPU)

Specific hardware acceleration architectural techniques include:

  • General Purpose GPUs (GPGPUs)
  • Caches (on-chip memory caching)
  • Multi-core CPUs
  • Multi-threaded CPUs
  • Single-Instruction Multiple Data (SIMD)
  • Non-Uniform Memory Access (NUMA)

Software Integrations to Hardware Accelerators

Software interfaces to hardware accelaration:

  • BLAS (Basic Linear Algebra Subroutines)
  • CUDA (NVIDIA's proprietary Compute Unified Device Architecture)
  • AVX (Advanced Vector Extensions; also AVX2, AVX-512 and AVX10)
  • OpenCL
  • cuBLAS (NVIDIA GPU BLAS version in CUDA)

Software Strategies for Hardware Acceleration

General software acceleration strategies for maximizing the benefits from hardware-accelerated computation:

  • Pipelining. This refers to keeping the GPU busy with a stream of data to chomp through, and avoiding "bubbles" in the pipeline, which is time when the GPU has nothing to do.
  • Partitioning and dataflow management. This is the software technique of organizing data so it's ready to send quickly to the GPU, usually in contiguous memory.
  • Cache management. Judicious use of the various levels of cache memory can help with pipelining efficiency.
  • Parallelizing. It's all parallel, isn't it? This point refers to writing the overarching algorithms in a parallelism-friendly manner, ensuring that nothing waits for nobody.
  • Deep learning compilers. The full stack of software acceleration to maximize hardware.

Other software acceleration issues that are closely related to hardware efficiency:

For many other optimization strategies that are orthogonal to hardware acceleration, and can be used to further optimize a model, see the complete list of AI acceleration techniques.

Survey Papers on AI Hardware Accelerators

Papers that review hardware acceleration frameworks:

AI Announcements from Hardware Vendors

Hardware-Acceleration Research

Various papers on hardware acceleration, out of thousands, include:

GPU Research

Research papers on various GPU issues:

Multi-GPU Research

Research papers on various multi-GPU inference and scheduling issues:

GPU Software Platforms

The main GPU software acceleration frameworks include:

  • CUDA (NVIDIA)
  • ROCm (AMD)
  • Triton (open source, originally by Meta)
  • OneAPI (Intel)
  • Vulkan
  • SYCL

CPU Execution of AI Workloads

Although GPUs are the mainstay of LLM execution, there is increasing focus on using CPUs for inference. This arises from the need to run on-device inference for AI phones and AI PCs, some of which may have an NPU, or some that may only have limited SIMD capabilities such as x86 AVX intrinsics.

Research on CPU execution of LLMs:

Neural Processing Unit (NPU)

An NPU is a hardware component designed specifically for AI workloads. The NPU is typically built into the CPU, or an add-on hardware component, but is inherently much less capable than a full GPU. Nevertheless, the NPU is the basis for hardware acceleration on AI phones and also some AI PCs.

More AI Research

Read more about: