Aussie AI

Vectorization of Lookup Tables

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Vectorization of Lookup Tables

The use of lookup-tables is already a powerful speed optimization, but we can double down by adding vectorization. The AVX SIMD instruction sets include a variety of “gather” intrinsics that perform parallel array lookups from a vector of integer indices, using a base address.

The basic algorithm we're going to use for AVX SIMD optimizations of a LUT precalculation of some mathematical function is as follows:

  • Offline: Precalculate a big LUT for 24 bits with 2^24 elements using non-AVX basic C++ methods.
  • Input: vector of 4 float values (AVX-1) or 8 float values (AVX-2).
  • Use a cast to treat these float arrays as arrays of integers.
  • Load these “int” arrays into an AVX register.
  • AVX shift right by 8 with the AVX-2 “_mm_srli_epi32” intrinsic, which shifts right and adds zero bits, so that they are now 24-bit numbers in 32 bits, with a zero sign bit (hence, all indices are positive integers).
  • AVX “gather” with base on the LUT array, and scale of 4 (i.e., float byte size).
  • Store the AVX register results back into an array of float values.
  • Output: vector of 4/8 float values with the LUT-calculated function.

Note that we can use a smaller (or bigger) LUT than 24 bits simply by modifying the bitshift counts.

LUTs with AVX Shuffle. Another way to implement a LUT in AVX is to use “shuffle” operations on another register. This only works for small lookup tables, that have few enough elements to fit inside AVX registers. In other words, this can be fast, but only for 16 or 32 elements in the LUT for AVX-2, or more if you use AVX-512. This optimization is unlikely to be relevant to computing the massive 16-bit or 24-bit LUTs that we need for AI mathematical functions.

AVX SIMD Pointer Dereferences. A corollary to the AVX LUT “gather” functionality is they can possibly be used to vectorize arrays of pointers, where the pointers are directly aimed at the data without any intervening lookup-table. For example, suppose we have an array of pointers to float (i.e. rather than an array of integer indices), and we want to access these addresses to generate the corresponding array of float. This is analogous to using a lookup table, but with a base address of zero. Hence, we could potentially use AVX “gather” intrinsics with a zero base address, and the integer offsets equal to the address (i.e. the pointers converted to integer). The x86 platform has 64-bit pointers, so 64-bit integer index offsets are required in the “gather” intrinsic. For example, the AVX2 “_mm256_i64gather_epi32” and “_mm256_i64gather_ps” intrinsics seem to be along these lines with 64-bit indices. I haven't actually tested this approach to check if it works.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++