Aussie AI

Example: LUT Precomputation for sqrt

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Example: LUT Precomputation for sqrt

Let's say that you want to optimize a slow non-linear function like “sqrtf” (or “expf” or “logf”). These are good candidates for optimization, and often occur in AI activation functions and Softmax.

The first point is that you'd better do a really good job, because there are actually hardware instructions for these common math functions, even in x86 architectures. So, you could easily optimize this into a table lookup, and find that your C++ code is still slower than the single CPU instruction that's called by the standard C++ library versions. Hence, investigate the C++ intrinsic functions for common math functions before you assume that you can do better than electrons zipping through silicon.

This example investigates precomputing “sqrtf” even though that may not be as fast as hardware-acceleration. However, the same ideas apply to precomputing more sophisticated derivative functions, such as Softmax and activation functions, which are not hardware-supported (or not yet, anyway). The same general ideas apply.

The basic method for table lookup optimization is:

  • Declare a big array (the bigger the better).
  • Run a loop sending every value to the real “sqrtf” function.
  • Store each result in the big array.
  • Now you have a precomputed table of all possible values.
  • Later, use an array index lookup to compute the function fast.

How is than any faster? I mean, we've just called “sqrtf” a bazillion times with numbers that we probably won't ever need. Yes, there is extra cost, and we are running slower during program initialization. There are at least two ways to fix this: load the array values from a pre-built binary data file instead, or precompile the array data into a C++ source code file.

However, this complaint underestimates just how many times an AI model will call these functions. The activation functions get called more times than there are chloroplasts on a tree. And furthermore, even with this startup cost, once that is all done and dusted, we have a big array of precomputed data that we can use to speed up the query processing phase (i.e. AI inference), which is our main goal. And in a production environment, any extra startup cost is hopefully amortized over many inference queries.

Example: Precomputing sqrt of integer: For simplicity, we're going to first assume that we're computing a float square root of integers. The function we are precomputing is “int-to-float” type. This makes it easier, because the int can be used as an array index.

Here's my big array with about 65,000 entries:

    #define AUSSIE_SQRT_PRECOMP_MAX (1u<<16)
    float g_sqrt_precomp_table[AUSSIE_SQRT_PRECOMP_MAX];

Here's the unoptimized function “int-to-float” version of “sqrtf” that we are planning to precompute:

    float aussie_sqrtf_basic_int(int x)
    {
        return sqrtf((float)x);
    }

Here's the initialization call to the precomputation routine, sending in the array, the size N, and the function pointer:

    aussie_generic_precompute_int(
        g_sqrt_precomp_table,   // Big array
        AUSSIE_SQRT_PRECOMP_MAX,  // N
        aussie_sqrtf_basic_int    // Function pointer
    );

And here's the code to run the big precomputation loop:

    void aussie_generic_precompute_int(float arr[], unsigned int maxn, float (*fnptr)(int))
    {
        for (unsigned int i = 0; i < maxn; i++) {
                arr[i] = fnptr(i);
        }
    }

So, that's all there is to the startup initialization of the lookup table. Once this function returns, we now have a big array full of data. Here's what the new optimized “sqrtf” looks like:

    float aussie_table_lookup_sqrt(int i)
    {
        return g_sqrt_precomp_table[i];
    }

And we can either make that function “inline” or use a C++ preprocessor macro:

    #define AUSSIE_TABLE_LOOKUP_SQRT_BASIC(i) \
         ( g_sqrt_precomp_table[(i)] )

So, here are a few provisos about this code:

    1. Might be slower than sqrt in hardware (needs benchmarking).

    2. Unsafe array index accesses (e.g. crashes on negatives or larger numbers).

    3. unsigned int types might overflow and spin infinitely for precomputing tables of size “1<<32” (need to change to unsigned long).

    4. The memory size of the precomputed table for 1<<16 is already about 262k (65k times 4 bytes).

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++