Aussie AI

35. Lookup Tables & Precomputation

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

“The purpose of computing is insight, not numbers.”

— Richard Hamming.

 

 

Precomputation with Lookup Tables

Look-up tables (LUTs) are a well-known simple data structure for optimizing code. They have been used to optimize neural networks in various ways. Some examples include:

  • Precomputed activation functions
  • Zero-multiplication networks
  • Approximation of non-linear functions

Precalculation or precomputation is a code optimization where results are partially or fully calculated ahead of time. This method is similar to caching and computation reuse but refers to calculations being performed long before they are needed, often at program startup or compile-time, and stored in lookup tables. Like caching, this method trades extra space for time.

Vectorization of LUTs is possible with hardware acceleration primitives that support parallel memory accesses using integer indices. For example, the x86 CPU with AVX intrinsics has a set of “gather” instructions for doing indexed lookup that can be used to load from a LUT into the internal registers, and “scatter” instructions for storing the registers back to an indexed LUT.

Typical precalculations are those where the results are computed at program initialization or compile-time. The best methods generate the results at compile-time, and are simply loaded as data, such as numeric constants or pre-initialized data arrays. There are multiple ways to do this:

  • Program startup initialization
  • Lazy evaluation
  • Binary data file
  • Precompiled source code

One method for precomputation of larger amounts of data in an array or lookup table is to perform the initialization dynamically at program startup. A lookup table can be populated with the required results, before the main logic of the program begins. Or alternatively, the idea of “lazy evaluation” allows storing the precomputation into a lookup table only when the program first needs the data.

A faster alternative is to calculate all this data offline before program startup, and store the results in a binary data file. This data file can then be loaded into an array at program startup, without needing to perform any of the arithmetic computations. Whether this is beneficial depends on the cost of the computations versus the cost of file loading.

The logical extension of the precomputation method for a large number of numeric results is to write special C++ code that performs these calculations, but then outputs the results into a text file in the exact format of a C++ source code file (rather than a data file), that declares a global array name and the numeric values. This auto-created C++ code is then linked with your program.

Example: LUT Precomputation for sqrt

Let's say that you want to optimize a slow non-linear function like “sqrtf” (or “expf” or “logf”). These are good candidates for optimization, and often occur in AI activation functions and Softmax.

The first point is that you'd better do a really good job, because there are actually hardware instructions for these common math functions, even in x86 architectures. So, you could easily optimize this into a table lookup, and find that your C++ code is still slower than the single CPU instruction that's called by the standard C++ library versions. Hence, investigate the C++ intrinsic functions for common math functions before you assume that you can do better than electrons zipping through silicon.

This example investigates precomputing “sqrtf” even though that may not be as fast as hardware-acceleration. However, the same ideas apply to precomputing more sophisticated derivative functions, such as Softmax and activation functions, which are not hardware-supported (or not yet, anyway). The same general ideas apply.

The basic method for table lookup optimization is:

  • Declare a big array (the bigger the better).
  • Run a loop sending every value to the real “sqrtf” function.
  • Store each result in the big array.
  • Now you have a precomputed table of all possible values.
  • Later, use an array index lookup to compute the function fast.

How is than any faster? I mean, we've just called “sqrtf” a bazillion times with numbers that we probably won't ever need. Yes, there is extra cost, and we are running slower during program initialization. There are at least two ways to fix this: load the array values from a pre-built binary data file instead, or precompile the array data into a C++ source code file.

However, this complaint underestimates just how many times an AI model will call these functions. The activation functions get called more times than there are chloroplasts on a tree. And furthermore, even with this startup cost, once that is all done and dusted, we have a big array of precomputed data that we can use to speed up the query processing phase (i.e. AI inference), which is our main goal. And in a production environment, any extra startup cost is hopefully amortized over many inference queries.

Example: Precomputing sqrt of integer: For simplicity, we're going to first assume that we're computing a float square root of integers. The function we are precomputing is “int-to-float” type. This makes it easier, because the int can be used as an array index.

Here's my big array with about 65,000 entries:

    #define AUSSIE_SQRT_PRECOMP_MAX (1u<<16)
    float g_sqrt_precomp_table[AUSSIE_SQRT_PRECOMP_MAX];

Here's the unoptimized function “int-to-float” version of “sqrtf” that we are planning to precompute:

    float aussie_sqrtf_basic_int(int x)
    {
        return sqrtf((float)x);
    }

Here's the initialization call to the precomputation routine, sending in the array, the size N, and the function pointer:

    aussie_generic_precompute_int(
        g_sqrt_precomp_table,   // Big array
        AUSSIE_SQRT_PRECOMP_MAX,  // N
        aussie_sqrtf_basic_int    // Function pointer
    );

And here's the code to run the big precomputation loop:

    void aussie_generic_precompute_int(float arr[], unsigned int maxn, float (*fnptr)(int))
    {
        for (unsigned int i = 0; i < maxn; i++) {
                arr[i] = fnptr(i);
        }
    }

So, that's all there is to the startup initialization of the lookup table. Once this function returns, we now have a big array full of data. Here's what the new optimized “sqrtf” looks like:

    float aussie_table_lookup_sqrt(int i)
    {
        return g_sqrt_precomp_table[i];
    }

And we can either make that function “inline” or use a C++ preprocessor macro:

    #define AUSSIE_TABLE_LOOKUP_SQRT_BASIC(i) \
         ( g_sqrt_precomp_table[(i)] )

So, here are a few provisos about this code:

    1. Might be slower than sqrt in hardware (needs benchmarking).

    2. Unsafe array index accesses (e.g. crashes on negatives or larger numbers).

    3. unsigned int types might overflow and spin infinitely for precomputing tables of size “1<<32” (need to change to unsigned long).

    4. The memory size of the precomputed table for 1<<16 is already about 262k (65k times 4 bytes).

Float-to-Float Precomputation

Using a precomputed table lookup for a float-to-float function is more complicated than integers. However, this is also the main approximation needed for AI activation functions, or even simple math functions like sqrtf or expf or logf.

Why is it tricky? The reason that float inputs are more difficult is that we need to convert a float into an array index in order to look it up. For example, we could try type casts:

   int offset = (int)f;

But that limits us to only precalculating values for 1.0, 2.0, 3.0, etc. Our approximation works poorly on any fractions, and we also haven't limited the array index to a fixed finite range, so it won't work for any negative values or very large positive values. And the type cast of a float is also slow!

Scaled Multiple: Another idea is that we could scale it upwards to get more decimals:

   int offset = (int) (f * 1000.0f);

This approach at least gives us 3 decimal places: e.g. 1.234 or 23.456, or similar. We will still have to check for negatives and large values to bound it. But again, this is even slower!

Bitwise Floating-Point Truncations: The above truncation via a floating-point scaled multiple is not very fast. Twiddling the bits is much faster. For example, if we have a standard 32-bit float type, it has 1 sign bit, 8 exponent bits, and 23 mantissa bits. This is from left-to-right, with the sign bit as the most significant bit, and the low-end mantissa bits are the least significant bits. Remember that this is like Scientific notation:

  • Number = Mantissa x 2 ^ Exponent

Also, the sign bit makes it all negative, if set. Note that exponent in 8-bits encodes the numbers -128 to +127, so that ranges from very small 2^-128 near-zero values, to very huge 2^127 sized values.

If the mantissa was in decimal, and it was “1234567” and the exponent was “17” then we'd have:

  • Number = 1.234567 x 10^17

If the mantissa was 23 bits, it's actually binary digits, with about 3 binary digits per decimal digit, so a 23-bit mantissa is about 7 or 8 decimal digits. Note that the mantissa is actually 24 bits, not 23, because there's an extra “implicit one” mantissa bit, not that it changes the above calculation, but you needed to know that for C++ trivia night.

So, if we think about it for a year or two, it becomes obvious that the rightmost bits of the mantissa are simply the rightmost digits in “1.234567”, and if we truncate some of the rightmost bits, it's like truncating a very small fraction (e.g. “1.234567” becomes “1.2345” or whatever).

Hence, a first idea is just to cut off 2 of the 4 bytes of a 32-bit float. This leaves us with 1 sign bit, 8 exponent bits, and 7 mantissa bits (plus 1 implied bit makes 8 mantissa bits). In decimal, the 8-bit mantissa now encodes only about 2 or 3 decimal digits, as if we've truncated “1.234567” to “1.23”.

Incidentally, congratulations, you've created “bloat16” type, which is what Google did with TPUs, making a 2-byte float format with 1 sign bit, 8 exponent bits, and 7 stored mantissa bits. So, now you can get into your blue telephone booth, time travel back a decade, file a patent, and retire on your royalties. If you're ever a contestant on Wheel of Fortune you probably won't need to know that the “b” in “bfloat16” stands for “brain float” and that is such a great name. But I digress.

Anyhow, this idea actually works for precomputation. A 2-byte integer in bloat16 format is easy to extract from a 4-byte FP32 float (i.e., the uppermost two bytes). The trick for bitwise processing is to convert the float to unsigned int, because the bitwise shift operators don't work on float (it's planned for C++37, as I heard at my fungus collector's club trivia night).

   float f32 = 3.14f;
   unsigned u32 = *(unsigned int*)&f32;

Extracting the top-most 2 bytes (16 bits) is simply a right bitshift:

   unsigned ubf16 = ( u32 >> 16 );

Note that here's a good reason that we had to use “unsigned” integer type. The right bitshift operator (>>) has undefined behavior on negatives, so “int” type wouldn't work predictably (or portably) if the floating-point sign bit was set.

The result is a 16-bit unsigned integer to use as the array index. Hence, there are only 1<<16=65,536 entries in our precomputation table. Assuming we store results as 4-byte float values, this makes the precomputation array's memory size about 262kb. What's more, it works for negative float numbers, because the sign bit is still part of that shemozzle, and we also don't need to check any minimum or maximum bounds, because it works for all 32-bit float numbers.

Precomputing with 24-Bit Lookup Tables: Interestingly, none of the above code is especially tied to 16-bit sizes. The bfloat16 version truncates 32-bit float to 16-bit by truncating the rightmost 16 mantissa bits. But we can actually choose to keep however many mantissa bits we like. The trade-off is that more mantissa bits increase accuracy, but at the cost of needing a much bigger precomputation array (doubling the storage size for each extra bit).

Let's try only cutting the rightmost 8 mantissa bits, leaving us with 24 stored bits total (i.e. 1 sign bit, 8 exponent bits, and 15 stored mantissa bits). The mantissa bits reduce from 23 to 15 (plus one implied bit makes 16), so this now stores about 5 decimal digits (e.g. “1.2345”), giving quite good precision on our results. When I tested the 16-bit version, it had some reasonably large errors of almost 0.1 in computing sqrt, whereas this 24-bit version has much lower errors, as expected.

Code changes are minor. The bitshift operations simply change from 16 bits to 8 bits (i.e. 32-24=8 bits). This is the precomputation loop for 24 bits:

    void aussie_generic_precompute_24bit_float(float farr[], unsigned int maxn, float (*fnptr)(float))
    {
        for (unsigned int u = 0; u < maxn; u++) {
                unsigned int unum = (u << 8u);  // 32-24=8 bits!
                float f = *(float*)&unum;
                farr[u] = fnptr(f);
        }
    }

And this is the call to the precomputation function in the startup phase:

    aussie_generic_precompute_24bit_float(
        g_sqrt_float_24bit_precomp_table, // Bigger array float[1<<24]
        (int)AUSSIE_SQRT_24bit_MAX,    // 1 << 24
        aussie_sqrtf_basic_float       // Function pointer
    );

The table lookup routine also similarly shifts 8 bits, rather than 16, but is otherwise unchanged:

    float aussie_table_lookup_sqrt_24bit_float(float f)
    {
        unsigned u = *(unsigned int*)&f;
        u >>= 8;  // 32-24=8 bits
        return g_sqrt_float_24bit_precomp_table[u];
    }

Note that this only works if we are sure that both “float” and “unsigned int” are 32-bits, so we should check that during startup with some assertions via static_assert. If we are sure of that fact, then not only will it work, but we don't also need to check the array bounds. It won't try a negative array index, and won't overflow no matter what bit pattern we send it in as a float.

But there is one problem. If we send the fast table lookup version the special float value of NaN (“not a number”), then the table lookup routine will actually return a valid numeric answer, which probably isn't what we want. Maybe we need to add a check for that special case, and this needs more testing.

The new size of the precomputation array is 2^24=16,777,216, so we have about 16.7 million results Full disclosure: I used an AI because I ran out of fingers and toes. If our results are 32-bit float values, our bloat16 precomputed array above requires about 262kb, and the new size with 24-bits is a lookup table (array) of about 67 megabytes. It wouldn't have worked on my old TRS-80 CoCo in 1986, but it'll work nowadays.

Precalculating C++ Source Files

One way to improve on the precomputation of a big array is to skip it entirely during startup by writing a lot of code. It's like using an AI coding copilot, only it's not really. I mean, come on, the day an AI writes better code than me is the day that I retire to the hologram beach with my robot dog companions.

The idea here is to write a program to generate a C++ source file that contains the global precomputed lookup table. Yes, it's a C++ program that creates part of a C++ program, which is almost like your AI has become self-aware, only one step away from Skynet. Well, maybe not, it's just a dumb C++ program written by a dumb human creating some dumb data.

Anyway, this auto-generated C++ code can be compiled and linked into your C++ program, and used like a global array of data in other parts of the program. Zero calculations are required at runtime, and the data can be read-only.

The benefit is that this auto-generated code method does not even require the time cost of startup initialization for any precomputations. There's not even the cost of data file loading. Instead, the data is auto-loaded by the linker-loader during executable file instantiation (i.e. when the user starts the app). The only downsides for the user are that the size of the executable program increases, which means more disk space usage, and that application program startup may take longer and it will use more memory (regardless of whether it ever needs this precomputed data). Also, various offline tasks take longer for the software developers, such as compilation and linking for testing, which is why we bill per hour.

I tried this out for precalculating GELU with a 24-bit table. The C++ source file was size 514k for 24-bit precomputation table of size 1<<24. This is what the auto-generated source code should look like:

    // Precomputed table source code: GELU, "gelu_precomp_24bits.cpp"
    float g_gelu_table_precompute_24bits[] = { 
    0f,
    1.793662034335765850782373866611092648039e-43f,
    3.587324068671531701564747733222185296077e-43f,
    5.380986103007297552347121599833277944116e-43f,
    7.174648137343063403129495466444370592155e-43f,
    ...
    ...
    };

Here's the code to generate the code to generate the code to generate the code:

   void aussie_generic_setup_table_FP32_24bits_PRINT_SOURCE( // Print C++ of 24-bits GELU precomputed table 
        char* nickname,
        char* outfname,
        float (*fnptr)(float),  // e.g. GELU
        int maxn,  // eg. 1<<24
        float arrout[]  // array to store (optional, can be NULL)
    )
    {
        if (!fnptr) {
                yassert(fnptr);
                return;
        }
        // Generate C++ source code so we can pre-compile the precomputed GELU table (24-bits)
        // There are 2^24 = 16.7 million numbers...
        FILE* fp = stdout;
        bool writingfile = false;
        bool add_commented_number = true;
        if (outfname && *outfname) {
                fp = fopen(outfname, "w");
                if (!fp) {
                        yassert(fp);  // file write failed
                        return;  // fail
                }
                writingfile = true;
                add_commented_number = false;  // No extra comments for file output version
        }
        unsigned int u = 0;
        fprintf(fp, "// Precomputed table source code: %s, \"%s\"\n", nickname, outfname);
        fprintf(fp, "float g_gelu_table_precompute_24bits[] = { \n");
        char numbuf[5000] = "";
        for (; u < maxn /*1<<24*/ ; u++) {  // For all 2^24=~16.7M...
                unsigned int uval = u << 8;  // put zeros in the least significant 8 mantissa bits
                float f = AUSSIE_UINT_TO_FLOAT(uval);
                float g = fnptr(f);  // Call GELU or whatever
                if (arrout) arrout[u] = g;  // Store precomputed data (e.g. GELU)...

                // Format: %g means the smaller of %e or %f
                // ... %e is the exponent format (scientific-like format)
                char* buf = numbuf;
                sprintf(buf, "%40.40gf", g);  // Format %g (Number) and suffix "f" (float constant type)
                if (strchr(buf, 'n')) {
                        // Nan or "-nan" ... 
                        strcpy(buf, "0.0 /*nan*/");  // Dummy value for NaN
                }
                // Remove prefix padding spaces...
                while (buf[0] == ' ') buf++;

                // Remove suffix zeros ...
                int len = (int)strlen(buf);
                if (buf[len - 1] == 'f') len--;  // skip suffix f
                if (buf[len - 1] == '0') {
                        while (len > 5) {
                                if (buf[len - 1] == '0' && isdigit(buf[len - 2])) {
                                        if (buf[len] == 'f') {
                                                buf[len - 1] = 'f';  // remove it, but leave 'f'...
                                                buf[len] = 0;
                                        }
                                        else {
                                                buf[len - 1] = 0;  // remove it...
                                                buf[len] = 0;
                                        }
                                        len--;
                                }
                                else break;
                        }
                }

                if (add_commented_number) {
                        fprintf(fp, "%s // (%40.40f) [%u] \n", buf, f, u);
                }
                else {  // No comments...
                        fprintf(fp, "%s,\n", buf);
                }

                // Progress update
                if (u % 100000 == 0 && u != 0) {
                        if (writingfile) fprintf(stdout, "%u -- %s\n", u, buf);  // Progress to stdout...
                        fprintf(fp, "// U= [%u]\n", u);  // Comment occasionally
                }
        }
        fprintf(fp, "}; \n");  // Close initializer...
        if (fp && fp != stdout) fclose(fp);
    }

Conclusions on Source Code Generation: Does it work? Yes and no. It builds the output file quite quickly, zipping through 1<<24 computations and writing to disk. But I can't get this 24-bit version with its 500k CPP source file to actually compile in the Microsoft Visual Studio IDE. Maybe it works on Windows command-line or Linux GCC, but I haven't tried.

Anyway, this self-generating code idea is certainly quite workable for table lookups of approximations for FP16 numbers (16-bit half-precision floating-point), because the lookup table needs to “only” contain 2^16=65,536 numbers. This is about a 200k C++ source file in plain text, and creates linked data of about 65k times 4 bytes equals about 256k space usage (or half that space if you also store the computation as 16-bit numbers rather than 32-bit floats or integers).

I know what you're thinking: And one final point, just between you and me, an idea is forming in your head. Maybe, just maybe, you could try to precompute an entire AI model into C++ source code, and build it into an EXE file. In theory, you could save the cost of the model loading phase of inference. But it'll be a multi-gigabyte executable file, even if you can build it, and you probably can't anyway, because it'll likely be so large as to crash your C++ compiler. I'm not even sure that Windows or Linux could load such a big executable file. But maybe there are some high-end computing architectures where this might work. Good luck with that! (A better plan is to look into using an ML compiler.)

 

Next: Chapter 36. AI Memory Optimizations

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++