Aussie AI

23. Tensors

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“The Matrix is everywhere. It is all around us.
It is the world that has been pulled over your eyes
to blind you from the truth.
A prison for your mind.”

— The Matrix, 1999.

What are Tensors?

Tensors are terrifying at first! I avoided learning about them for ages. All those nested loops are scary. But eventually it dawned on me that they're just three-dimensional arrays, and the computations are nothing harder than multiplication and addition.

An important point is that “tensors” in Computer Science are much different to the mathematical forms used in Physics. AI tensors are used in “linear algebra” and are much more basic than the 4-D space-time tensors in Einstein's theory of general relativity. Which may explain why all those brainy physicists are so smug, despite being unable to predict if it'll rain tomorrow.

Tensors in AI are simply multi-dimensional arrays, and are usually 3-dimensional in AI engines. Each slice of a 3-D tensor is a two-dimensional matrix. And like vectors and matrices, tensors have these basic properties:

(a) Each element stores a single number (i.e. no strings or objects).

(b) All elements have the same data type (e.g. int or float).

(d) There are no missing elements. The concept of “missing” can only be represented by zero in a normal tensor.

There are exceptions, of course. There are “sparse tensors” that can represent elements as missing. Also, you can technically store strings or objects in a C++ three-dimensional array, but then it's more of a misuse of a tensor. Numbers are where it's at.

AI Tensor Computations

Everything's a tensor in AI. Tensors are technically the superset of all of the computational structures, and the number of dimensions is called the “rank” or “dimension” or “axes” of a tensor. Matrices are rank-2 tensors, vectors are rank-1 tensors, and even scalars are rank-0 tensors.

Another way to think about tensors is in terms of nested loops. Scanning a vector requires one loop, and a matrix needs two nested loops. Tensor operations require three or more nested loops to process all their data.

Conceptually, there's a hierarchy of operations on model data in an AI engine:

3-D tensor operations break down into 2-D matrix multiplications.
2-D matrix multiplications break down into vector dot products.
1-D vector dot products break down to a single float number (a scalar).
0-D scalars are probabilities between 0 and 1 (known as “logits”).

The final output logits are probabilities indicating whether to write “squirrel” or “aardvark” if the prior word was “ferocious”. I really don't know the answer to that, but your AI engine sure does, and to about three decimal places.

Note that I said “conceptually” not “in practice”, because the model data doesn't really get processed in a nice hierarchy like that. The real C++ code that does tensor operations sends off random strips of data to the GPU in a pattern like a turkey in a breakdancing competition. It's fun to watch, but who knows what's going on.

Neural Network Theory and Tensors

I'm not going to take you in detail through the theory of how neural networks function. But in broad strokes, there are “neurons” in layers, where each neuron has a “signal,” and there are also connections between neurons that forward the strength of a signal on to the next layer of neurons. So, each neuron in connected to every neuron in the previous layer by an “arc” and on that arc is a “weight” that says how strong or weak to consider the incoming neuron's signal.

But how do we get to tensors from that? Not obvious.

Let's step back a little and be one with the neuron. So, we are just one neuron in a layer of 100 neurons. And the previous layer has 100 neurons, and we are “fully connected” with arcs from every one of those 100 prior neurons. With 100 neurons in the previous layer, our little lonely neuron has to consider the signals from all of the 100 neurons in the prior layer, with 100 weights on the arcs to help decide how much attention to pay to each of the 100 prior neurons.

If we consider the previous layer of 100 neurons as a “vector” of each neuron's computed values. What this means is that every one of the 100 prior neurons has a number of its computed signal, so we have a vector of 100 signal numbers from the prior layer (i.e. a vector full of 100 neuron computed values).

Again, our little neuron has to receive a computed signal value from every one of the 100 prior layer neurons, so we have 100 arcs coming into our little neuron, each with a different number, that is the “weight” of that arc. The computed value of a prior neuron is multiplied by the “weight” that's on each arc (i.e. there's 100 weights, one for each arc). So, every one of the arcs from the 100 neurons in the prior layer has a weight, and what does that sound like? A vector of weights.

So, we have a bunch of 100 prior-layer neuron's computed values in a vector, where each one of those 100 signal values is multiplied by a weight that's in a vector of 100 weights. Hence, we've got to pairwise multiplication, where we multiply 100 neuron values times 100 associated weights. Hence, we've got a bunch of element-wise multiplications of two vectors (100 values times 100 weights), which creates a vector of 100 multiplication computations.

But our little neuron cannot have 100 computed values, but can really only have one number, the total computed signal for our current neuron. There are various things we could do to “reduce” our interim vector of 100 multiplications, but the simplest is to add them all up, and this gives us one number. Now we have one number, and it's the computed signal value for our current neuron.

Umm, I remember that from High School. If we multiply two vectors together with the numbers in pairs, and then add it all up: vector dot product.

In summary, we have a vector dot product for our single neuron in the current layer, based on two vectors from the prior layer (the vector of 100 calculated neuron values, and the vector of 100 weights).

But this is just for our one lonely neuron. Except, it's not lonely, since it has 99 friends, because it's in a layer of 100 neurons itself. So, our neuron and its 99 friends in the current layer, all have to do a different dot product computation because the weights are different for each set of arcs into each neuron. We have a whole vector of 100 neurons in the current layer, for which we have to compute dot products of 100 values times 100 weights (i.e. using the prior layer). So, we have to do 100 vector dot products to calculate the result for our neuron and its 99 friends. If we do 100 repetitions of vector dot products, this sounds like...matrix multiplication.

But that's not all. There's a third dimension based on the “tokens” in the prompt, which is represented by an “embeddings” vector. And with this third dimension thrown in, well, then it's a whole vector worth of matrix multiplications, and we get to a 3-D operation called a “tensor product.” Tensors are three-dimensional blocks full of numbers (i.e. cubes or rectangular prisms), which generalize two-dimensional matrices, which generalize one-dimensional vectors, which generalize zero-dimensional scalars. And if you have any common sense, you've stopped reading this section by now, so I'm not going to try explaining this mind-bending tensor stuff any further.

Tensor Arithmetic

Tensors are a convenient and efficient representation of multi-dimensional data. Since AI computations involve a lot of matrix multiplications, it is useful to represent a sequence of matrix operations as a tensor operation.

Importantly, the arithmetic performed is the same. Using a tensor is computationally efficient for parallelization of algorithms, and also mathematically concise for theoretical analysis, but is not some fantastically amazing new AI algorithm. It's just crunching lots of numbers with the standard methods. Usually, it's the same as an array of matrices, where you do matrix multiplication on each one.

In practice, tensor kernels will send out different chunks of that computation all over the place for parallel speedup, but it's still computing the exact same numbers as if you did it all brute-force in nested loops. You could even follow along with a pen and paper, except that the computer is better because it won't forget to carry the negative sign.

Tensor shape. Another point is the shape of a tensor. I'm sure you know that matrices may be square or rectangular in shape, but can't be a skewed parallelogram or a circle. Yes, you're right, there are triangular matrices, but now you're messing up my nice clean point.

Anyway, a 3-D tensor can have different sizes on each of its three dimensions. Hence, a 3-D tensor can be a cube if all three sizes are identical, but usually they have the shape of a more general rectangular prism. And it still has a brick-like shape, and can't really represent a triangle, cone, or sphere. Tensors are much less scary if you sing Everything is Awesome while you code the nested loops.

Unary Tensor Operations

Like a 2-D matrix, there are various simple operations we can define on a single tensor. The various element-wise operations apply individually to each tensor item.

Clear or set to a value
Add or subtract a scalar
Multiply or divide by a scalar

Similarly, we could apply a particular unary mathematical function to each element separately: square root, exponentiation, natural logarithm, and more.

Binary Elementwise Tensor Operations

Adding two matrices means simply adding each pair of elements in the matrix, which only works if the two matrices have the same size and shape. The same idea generalizes to the addition of tensor elements of two tensors with the same size (i.e. all three dimensions are the same). Hence, we can do element-wise binary arithmetic on each element in two tensors to create a third tensor of the same size:

Addition or subtraction
Multiplication or division
Maximum or minimum

Note that element-wise multiplication of tensor elements is not “tensor multiplication” in the same way that matrix multiplication isn't just paired multiplications of the elements in two matrices. Such an element-wise multiplication is called the “Hadamard product” of matrices, and is so useless that I don't think I was ever taught that in High School. The Hadamard product is not what is used by AI inference computations, but I've seen a few research papers where it was proposed as an optimization (probably unsuccessfully). Matrix multiplication is more complex, with its row-by-column vector dot product multiplications, and so is generalizing that to tensors.

That's how we get to “tensor product” of two tensors. It's really just nested loops doing matrix multiplications on slices of each tensor. And then matrix multiplications are just nested loops doing vector dot products. Like I said, tensors are just three-dimensional arrays doing multiplication and addition.

Sparse Tensors

Sparse tensors occur when most of the values are zero. These are a generalization of sparse vectors and sparse matrices, and offer the same advantages: compressed storage and faster arithmetic operations (by skipping operations involving zero).

The level of sparsity required for optimization usually means 80-90% of the weights are zero. With so few non-zero values, tensor arithmetic involves fewer operations and the memory requirements are low (i.e. store only the non-zero weights). Such sparsity is often the result of a “pruning” optimization (see Chapter 33), but there are also obscure theoretical means to get sparse tensors using tensor algebra (let's not even go there!).

When there is a high degree of sparsity, such as when 80-90% of the values are zero, it becomes more efficient to use alternative algorithms. Sparse tensors can be stored in a permutation index format, where only the index locations of non-zero items are stored (e.g. storing a four-tuple with the non-zero value and the three indices at which it is located in the tensor). Operations on sparse tensors can use the alternative storage format to create much more efficient kernels that avoid most of the computations involving the missing zero values.

Parallelization of sparse tensor operations is a double optimization, because there are fewer operations (only on non-zero weights), and you can parallelize them as well. Although a permuted index data format is not the usual contiguous memory space amenable to vectorization, there are other methods to vectorize permutation indices, such as with “gather” and “scatter” SIMD operations.

A sparse tensor product kernel parallelized on a GPU goes at the speed of light. Which reminds me of those Physicists again. They've stopped reading by now, so we can talk freely. I mean, they've been arguing about whether light is a blob or a wiggle for 100 years. Any computer programmer could solve this in ten seconds. Here's the algorithm:

    if (always_bounces_off_stuff(light)) {
        // Particle
    }
    else if (always_goes_around_things(light)) {
        // Wave
    }
    else {
        // Neither (obviously)
    }

• Next: Chapter 24. Normalization

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs