Aussie AI

22. Vector Algorithms

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Happiness is not a state of being.
Happiness is a vector, it is movement.”

— Neal Shusterman.

Vector Dot Product

Vector dot product is the most basic algorithm in an AI engine. All of the tensor operations and matrix multiplications break down into instances of a dot product calculation.

The dot product is so-named because its mathematical notation is a dot. It is also known as the “scalar product” because its result is a scalar (single number), rather than a vector.

The vector dot product takes two vectors as input, and computes a single float number. The algorithm is a product of the elements of each vector, added together. Here's the code:

    float aussie_vecdot_basic(float v1[], float v2[], int n)
    {
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
            sum += v1[i] * v2[i];
        }
        return sum;
    }

Properties of the dot product include:

Two vectors as input.
Scalar output (single number).
Can be positive or negative.
Is zero if either vector is all zeros.
Can also be zero for two non-zero vectors (e.g. if the vectors are “perpendicular” in 2-D or 3-D space).
Has a physical meaning related to the “angle” between the two vectors.
Is an integer if both vectors contain integers.
Dot product of a vector with itself is the square of the vector's magnitude (equivalently, the vector's L2-squared norm).
Is very slooow. Dot product-based operations inside matrices and tensors are the main culprit for AI needing all those GPUs.

The dot product differs from the “vector product” of two vectors (also called “cross product”) that returns a vector, and is a completely different mathematical operation. The vector cross product is interesting mathematically in that it computes a vector perpendicular in 3 dimensions, but it's not very useful for AI inference. The dot product is where the action's at in big tensors.

Vector Norms

Vector norms are measurements of vectors, and are used in various AI algorithms. For example, we can measure if two vectors are “close” to each other.

Vector norms map vectors to a single number. Note that vector norms are not the same thing as the “normalization” layer in a Transformer (ie. LayerNorm or BatchNorm). Note also that a vector “norm” is not at all related to the similarly-named “normal vector” (a vector perpendicular to a surface). The norm is a number, whereas the normal is a vector, and they're not on speaking terms since that incident last summer.

L2 Norm: The basic norm of a vector is the level-2 (L2) norm, and you probably already know it. This is the length of the vector in physical space, also called the vector's “modulus” or “magnitude” in Mathematics 101. If you treat a vector as a “point” in space, the L2 norm is its straight-line distance from the origin.

The calculation of the L2 norm of a vector is a generalization of Pythagoras's Theorem: sum the squares of all the vector elements, and then take the square root. The code looks like:

    float aussie_vector_L2_norm(float v[], int n)
    {
        float sum = 0.0f;
        for (int i = 0; i < n; i++) {
            sum += (v[i] * v[i]);   // Square
        }
        return sqrtf(sum);
    }

Because we square every element, they all get turned positive. Zero squared is still zero. Once we've summed all the squares, we usually get a big positive number, which we then square root to get a smaller positive number. Hence, the result of the L2 norm is compressing a whole vector down to a single positive floating-point number.

The properties of the L2 norm are:

Floating-point number (e.g. 0.567 or 5.6789 or 3.0 or whatever)
Positive number (not ever negative)
Zero only if the whole vector is zero.
Represents the “length” (or “modulus” or “magnitude”) of a vector, called the “Euclidean distance”.
Usually a non-integer, even if the vector was all integers.

For a simple 2-D or 3-D vector in Senior Math, the L2 norm is the physical length of the vector in 2-D or 3-D space (or the length of the line from the origin to the equivalent point). For AI, which has vectors in 1024-dimensions, or N-dimensional vectors for whatever N is being used, there's not really a physical explanation of the L2 norm that's easy to visualize, but it's kind of a measure of the length of the vector in N-dimensional space. The value of the L2 norm can be zero, but only if all the vector's elements are zero.

Note that the value of the L2 norm is not unique. Two different vectors can have the same value for the L2 norm. In fact, an infinite number of vectors can have the same value, and those vectors are the set of vectors with the same length (magnitude), which will define a sphere in N-dimensional space.

L2-squared norm: A minor modification of the L2 norm is the “squared L2 norm”, which is, as you may have guessed, the square of the L2 norm. To put it another way, it's just the L2 norm without the square-root at the end. The code looks like:

    float aussie_vector_L2_squared_norm(float v[], int n)
    {
        float sum = 0.0f;
        for (int i = 0; i < n; i++) {
            sum += (v[i] * v[i]);  // Square
        }
        return sum;  // NOT sqrtf(sum);
    }

The value of the L2-squared norm is a positive number, but a much larger one. The physical meaning is the square of the physical/Euclidean length of the vector. The L2-squared norm also equals the vector's dot product with itself.

Why use the L2-squared norm? Because it's faster to skip the square-root operation, of course. Also, if the vector contains integers, then the L2-squared norm is also an integer, which can make it even faster to compute for an AI engine running in integer-only mode. The L2-squared norm is just as good as basic L2 for some uses. The properties of L2 and L2-squared norms are very similar except that one is a much larger number. Both are positive and related to Euclidean distance, and both increase monotonically the further the vector is away from the origin.

Level 1 Norm: As you can guess from my calling it the L2 norm, there's also an L1 norm, and there's L3 norms, and more. Let's look at the L1 norm, because it's even simpler, although it's not usually something that's covered when studying vectors in Math class.

The L1 norm is simply the sum of the absolute values of all the vector elements. We don't square them. We don't take the square root. We just make them positive and add them up. The code looks like:

    float aussie_vector_L1_norm(float v[], int n)
    {
        float sum = 0.0f;
        for (int i = 0; i < n; i++) {
            sum += fabsf(v[i]);   // Absolute value
        }
        return sum;
    }

Using the absolute values of elements reverses any negative vector elements to positive. The absolute value ensures the whole total can't go negative, and any negative value also adds to the total. A zero element is fine in the vector, but does nothing. The result of the L1 norm is a single positive float number, which can be fractional or whole, ranging from zero to as high as it goes (i.e. if you have big numbers in the vector elements, then the L1 norm will also be large).

The properties of the L1 norm are:

Floating-point number (fractional or whole).
Positive number (never negative).
Zero only if all vector elements are zero.
Physical meaning is an obscure distance measure (the “Manhattan distance”).
Will be an integer if the vector elements are integers.

What does an L1 norm mean? It's kind of like the distance you'd travel if you walked the longest way by going along each element/dimension of the vector, one at a time, and not going backwards (no negatives). So, the L2 norm was the fastest diagonal direct way to get to a point, but the L1 norm is going the scenic route, and the L1 norm is usually bigger than the L2 norm.

Like the L2 norm, the L1 norm is not unique. Multiple vectors can have the same L1 norm. For example, the vectors (1,2) and (0.5,2.5) will have L1 norm of 3.0. I'm not really sure what the set of all the vectors with the same L1 norm means. Maybe it's this: all the points that you can walk to from the origin if you travel a certain distance (going forwards-only)?

L3 Norms and Above: The mathematical vector norms can be generalized to L3 and higher norms, even to infinity. For an L3 norm, you cube all the vector elements (made positive by absolute value), and take the cube root at the end. It's tricky to find the cube root in C++ until you remember that a cube root is exponentiation to the power of 1/3 (from Year 10 math), so we can use the “powf” function. Here's the code:

    float aussie_vector_L3_norm(float v[], int n)
    {
        float sum = 0.0f;
        for (int i = 0; i < n; i++) {
            sum += (v[i] * v[i] * v[i]);  // Cube
        }
        const float frac_third = 1.0f / 3.0f;
        return powf(sum, frac_third);
    }

Can you guess what an L4 norm is? The higher order versions are really fun and interesting if you wear socks with your sandals, but not very useful in AI coding.

Matrix Norms

There are norms for matrices, but they're not really that often used. Taking a “measurement” of a matrix via a “norm” (or a “metric”) to compare it to other matrices isn't a common task.

The silly ones are element-wise matrix norms. You can define an L1 or L2 norm on a matrix using the same algorithm over all its elements. You can also find the maximum element inside a matrix, and call that the “max norm” if you like to sound math-ish. The reason I say these are dumb? Because they ignore the structure of the matrix, so it's a kind of “pseudo-norm” of a matrix. It's really just treating a matrix like it's a big, flat vector, and to me it seems more like misusing a vector norm on a matrix.

More sensible matrix norms consider the rows or columns of the matrices as separate vectors. An NxN matrix has N column vectors or N matrix vectors, so there are N vector norms. Should we add them up? No, taking the maximum of the vector-wise L1 or L2 row/column vector norms has a more useful meaning as a matrix norm than the element-wise matrix L1 or L2 pseudo-norms. You can do this maximum-of-vector-norms either for rows or columns, but not both.

Vector Min and Max

Finding the maximum or minimum element of a vector is useful, and somewhat relevant to the L1/L2 norms. The maximum is a kind of “metric” of the size of a vector. Also, the maximum function over a vector is used in “greedy decoding” to pick the word with the highest predicted probability, which is then output. The minimum function would give us the least likely word, which might also be interesting if useless.

The simple linear code for vector max is:

    float aussie_vector_max(float v[], int n)  // Maximum
    {
        float vmax = v[0];
        for (int i = 1 /*not 0*/; i < n; i++) {
            if (v[i] > vmax) vmax = v[i];
        }
        return vmax;
    }

The vector minimum function looks similar in sequential C++ code:

    float aussie_vector_min(float v[], int n)  // Mininum
    {
        float vmin = v[0];
        for (int i = 1 /*not 0*/; i < n; i++) {
            if (v[i] < vmin) vmin = v[i];
        }
        return vmin;
    }

These functions are crying out for optimizations: loop unrolling, pointer arithmetic, etc. However, what they really need is vectorization. There are parallelized max and min primitives in GPUs and CPU-based AVX intrinsics that you can use.

Top-K Vector Algorithm

The top-k algorithm is more complicated than vector max or min: find the largest k elements in a vector. Note that “maximum” is the same as top-k with k=1. If you want the short version of the top-k story in C++, there's a std::partial_sort function that sorts the top k elements, and there's also std::sort for a full array sort.

The top-k algorithm is quite important in AI engines. It gives us “top-k decoding” which is how to choose which word to output. The whole encoder-decoder computes a vector giving us the probability that each word should be output next. Using the maximum probability word gives us “greedy decoding” which always outputs the most likely word. But that's kind of boring and predictable, so top-k decoding randomly chooses between the k most likely words (e.g. top-50), which is still very accurate and also more interesting because it has some creative variation.

Example: Hand-coded top-2 algorithm: Since top-1 is the maximum of a vector, we can also find a fairly simple linear scan for k=2. The basic idea is to scan through and keep track of the two largest values as we go.

    void aussie_vector_top_k_2(
        float v[], int n, float vout[])
    {
        // Order the first 2 elements
        float vmax1 = v[0], vmax2 = v[1];
        if (v[1] > v[0]) {
            vmax1 = v[1];  // Reverse them
            vmax2 = v[0];
        }
        for (int i = 2 /*not 0*/; i < n; i++) {
            if (v[i] > vmax2) {
                // Bigger than the smallest one
                if (v[i] > vmax1) {
                    // Bigger than both (shuffle)
                    vmax2 = vmax1;
                    vmax1 = v[i];
                }
                else {
                    // In the middle (fix 2nd only)
                    vmax2 = v[i];
                }
            }
        }
        vout[0] = vmax1;  // Biggest
        vout[1] = vmax2;  // 2nd biggest
    }

Note that the above top-2 algorithm is still impractical for our word decoding algorithm. We need to know not only the top probabilities, but also which two indices in the vector had those probabilities, because that's how we know which words map to which probabilities. So, we'd need to modify the above code to track and return the two array indices as well (or instead).

Example: Shuffle top-k algorithm: For a larger value of k the code becomes more complicated. The above code for k=2 motivates the general idea for a brute-force algorithm: shuffle sort the first k elements, and then scan the rest, shuffling any larger items up into place. We can merge the two shuffling phases into one block of code that handles both the startup and ongoing scan.

    void aussie_vector_top_k_shuffle(
        float v[], int n, int k, float vout[])
    {
        yassert(n >= k);
        vout[0] = v[0];
        int nout = 1;
        for (int i = 1 /*not 0*/; i < n; i++) {
            float fnew = v[i];
            int maxj;
            if (nout < k) {
                vout[nout++] = fnew;
                maxj = nout - 2;
            }
            else {
                maxj = nout - 1;
            }
            maxj = nout - 1;
            for (int j = maxj; j >= 0; j--) {
                if (fnew > vout[j]) {
                    // Shuffle & insert
                    if (j + 1 < k)  // Shuffle down
                        vout[j + 1] = vout[j];
                    vout[j] = fnew;
                    // Keep going
                }
                else {      // Done.. insert it
                    if (j != maxj) {
                        if (j + 1 < k)
                            vout[j + 1] = vout[j];
                        vout[j] = fnew;
                    }
                    break;
               }
            } // end for j
        } // end for i
    }

The above example is a simplistic and inefficient top-k algorithm, not to mention that it was horribly fiddly and failed my unit tests for hours (i.e. a special kind of “fun”). Several loop optimizations suggest themselves: loop sectioning for the outer i loop to do the first k iterations as a separate loop (avoiding lots of tests against k), and loop peeling of the first iteration of the inner j loop (i.e. j==maxj). This version also should be extended to track the indices from where the top-k values came.

Theoretical Top-K Algorithms: There's a lot of theory about computing the top-k function of an array for large k values, which is the same structure as our “logit” word probability vectors (we want “k=50” or more). These theoretical top-k algorithm papers mainly consider sequential processing, rather than vectorization. Even so, it's not a simple linear scan like max or min functions, but doesn't need to be as slow as shuffling.

Example: Top-k with qsort sorting: The simplest method for large k is to sort the array with a fast method (e.g. the quicksort algorithm) and then pick off the top k elements from the sorted array. In C++ there are the std::sort methods or the older style qsort function. Here's an example using the C++ standard qsort function:

    int aussie_top_k_qsort_cmp(
        void const* addr1, void const* addr2)
    {
        float f1 = *(float*)addr1;
        float f2 = *(float*)addr2;
        if (f1 < f2) return +1;  // Reversed (descending)
        else if (f1 > f2) return -1;
        else return 0;
    }

    void aussie_vector_top_k_qsort(
         float v[], int n, int k, float vout[])  
    {
        // Top-k with general k (qsort algorithm)
        // Sort the array
        qsort(v, n, sizeof(vout[0]), aussie_top_k_qsort_cmp);
        // Copy top-k elements
        for (int i = 0; i < k; i++) vout[i] = v[i];  
    }

Top-k with qsort and permutation array: We really need a version that returns the indices of the probabilities, rather than just their values. So, I coded up a qsort version that sorts via a permutation array, and then returns the top-k of these permutation indices.

    void aussie_permutation_identity(int permut[], int n)
    {
        for (int i = 0; i < n; i++) permut[i] = i;
    }

    float* g_float_array_for_qsort = NULL;

    int aussie_top_k_qsort_permutation_cmp(
        void const* addr1, void const* addr2)
    {
        int index1 = *(int*)addr1;
        int index2 = *(int*)addr2;
        float f1 = g_float_array_for_qsort[index1];
        float f2 = g_float_array_for_qsort[index2];
        if (f1 < f2) return +1; // Reverse (descending)
        else if (f1 > f2) return -1;
        else return 0;
    }

    void aussie_vector_top_k_qsort_permut(
        float v[], int n, int k, 
        float vout[], int permut_out[]
    ) 
    {
        // Create a dynamic permutation array
        int* permut_arr = ::new int[n];
        // Identity permutation
        aussie_permutation_identity(permut_arr, n);  

        // Sort the array (by permutation)
        g_float_array_for_qsort = v;
        qsort(permut_arr, n, sizeof(permut_arr[0]), 
           aussie_top_k_qsort_permutation_cmp);
        // Copy top-k elements
        for (int i = 0; i < k; i++) {
                permut_out[i] = permut_arr[i];
                vout[i] = v[permut_arr[i]];
        }
        delete[] permut_arr;
    }

Top-k without sorting: Sorting the whole array is somewhat wasteful if we only want the top 50 elements. There are various faster top-k algorithms that don't fully sort the array.

Standard C++ top-k libraries: Note that the standard C++ libraries have support for sorting algorithms in std::vector, such as with std::sort. There is also a top-k version called std::partial_sort, which sorts the top k elements of an array, which can then be selected for the top-k algorithm. Presumably, the std::partial_sort function is a faster algorithm than std::sort, by not fully sorting the whole array, but I haven't tested it. There is also std::nth_element, which is similar to top-k.

• Next: Chapter 23. Tensors

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

22. Vector Algorithms

Vector Dot Product

Vector Norms

Matrix Norms

Vector Min and Max

Top-K Vector Algorithm

Quick Links

Product

New to Writing?

Writing Styles