Aussie AI
Precomputed Lookup Tables
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Precomputed Lookup Tables
One of the simplest methods to speed up activation functions is to use a pre-computed table lookup. If you are quantizing to 16 bits, whether FP16 or INT16, then the input to the function is 16-bits, and there are only 2^16=65,536 different possible input values. Hence, your precomputed table for activation function calculations is only 65536x2=128k bytes for output in 16-bit precision (i.e. FP16/INT16 is 2 byte outputs), or 256kb for 32-bit precision outputs (FP32/INT32). However, it's not quite so good for non-quantized 32-bit (i.e. FP32 or INT32), since 2^32 is about 4 billion possible values, and ideally needs to store 4-byte results, so it's 4x4=16 Gigabytes of RAM for that precomputed table.
The simplest way to build the pre-computed table for an activation function is to do it dynamically at initialization-time. Simply call your non-optimized activation function code, get the results, and store it in the precomputed table. The idea is:
for (int i = 0; i < 65536; i++) { // INT16 only g_global_GELU_table[i] = aussie_GELU_slow(i); }
When you access the pre-computed table, it's simply an indexed access for a global array. So, the fast, precomputed version of GELU is a table lookup like this:
// INT16 only, doesn't work for FP16! gelu_value = g_global_GELU_table[x];
It all sounds very easy, but there's some weird wrinkles.
Firstly, C++ doesn't have a portable standardized native 16-bit float
type, so you have to mess around with either declaring your own type (as a class),
trying to use platform-specific extensions (e.g. “__fp16
”, “_Float16
” or C++23's “std::float16_t
”)
or hacking it by using the “short int
” 16-bit integer type.
Secondly, the above code examples required integers and don't really work in practice for FP16. And to make it work is a lot of bit-hacking for conversions back-and-forth between unsigned integers and floating-point numbers. Here's the idea for 32-bit float variables (i.e. non-quantized FP32):
float g_global_GELU_table_FP32[1<<32]; // ~4 billion // ... void aussie_GELU_setup_table_FP32() { // Initialize GELU precomputed table unsigned long int i64 = 0; // Has to be "long"! yassert(sizeof(i64) > 4); for (; i64 < (1 << 32); i64++) { int i32 = (int)i64; // 64-bit to 32-bit float f32 = *(float*)&i32; // Type-cast trick g_global_GELU_table_FP32[i32] = aussie_GELU_basic2(f32); // FP32 } }
And the fast GELU lookup table version for FP32 becomes:
float gelu_fast_FP32(float f) // Table lookup GELU { int i32 = *(int*)&f; return g_global_GELU_table_FP32[i32]; // FP32 version }
In this FP32 case, the loop iterator has to be a 64-bit long int, otherwise the loop will be infinite because a 32-bit int will overflow and wrap-around to zero without failing the loop test. In reality, the above code doesn't actually work on a low-end laptop with the MSVS C++ compiler, because it requires an array of size 16GB, so it requires a more powerful box.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |