Aussie AI

Precomputed Lookup Tables

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Precomputed Lookup Tables

One of the simplest methods to speed up activation functions is to use a pre-computed table lookup. If you are quantizing to 16 bits, whether FP16 or INT16, then the input to the function is 16-bits, and there are only 2^16=65,536 different possible input values. Hence, your precomputed table for activation function calculations is only 65536x2=128k bytes for output in 16-bit precision (i.e. FP16/INT16 is 2 byte outputs), or 256kb for 32-bit precision outputs (FP32/INT32). However, it's not quite so good for non-quantized 32-bit (i.e. FP32 or INT32), since 2^32 is about 4 billion possible values, and ideally needs to store 4-byte results, so it's 4x4=16 Gigabytes of RAM for that precomputed table.

The simplest way to build the pre-computed table for an activation function is to do it dynamically at initialization-time. Simply call your non-optimized activation function code, get the results, and store it in the precomputed table. The idea is:

    for (int i = 0; i < 65536; i++) {
        // INT16 only
        g_global_GELU_table[i] = aussie_GELU_slow(i);
    }

When you access the pre-computed table, it's simply an indexed access for a global array. So, the fast, precomputed version of GELU is a table lookup like this:

    // INT16 only, doesn't work for FP16!
    gelu_value = g_global_GELU_table[x];

It all sounds very easy, but there's some weird wrinkles. Firstly, C++ doesn't have a portable standardized native 16-bit float type, so you have to mess around with either declaring your own type (as a class), trying to use platform-specific extensions (e.g. “__fp16”, “_Float16” or C++23's “std::float16_t”) or hacking it by using the “short int” 16-bit integer type.

Secondly, the above code examples required integers and don't really work in practice for FP16. And to make it work is a lot of bit-hacking for conversions back-and-forth between unsigned integers and floating-point numbers. Here's the idea for 32-bit float variables (i.e. non-quantized FP32):

    float g_global_GELU_table_FP32[1<<32];  // ~4 billion
    // ...
    void aussie_GELU_setup_table_FP32()
    {
        // Initialize GELU precomputed table
        unsigned long int i64 = 0;  // Has to be "long"!
        yassert(sizeof(i64) > 4);
        for (; i64 < (1 << 32); i64++) {
                int i32 = (int)i64;  // 64-bit to 32-bit
                float f32 = *(float*)&i32;  // Type-cast trick
                g_global_GELU_table_FP32[i32] = aussie_GELU_basic2(f32);  // FP32
        }
    }

And the fast GELU lookup table version for FP32 becomes:

    float gelu_fast_FP32(float f)    // Table lookup GELU
    {
        int i32 = *(int*)&f;
        return g_global_GELU_table_FP32[i32];   // FP32 version
    }

In this FP32 case, the loop iterator has to be a 64-bit long int, otherwise the loop will be infinite because a 32-bit int will overflow and wrap-around to zero without failing the loop test. In reality, the above code doesn't actually work on a low-end laptop with the MSVS C++ compiler, because it requires an array of size 16GB, so it requires a more powerful box.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++