Aussie AI

Loop Optimizations

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

Loops are often sources of inefficiency and can be optimized in numerous ways. And the basic algorithms for neural networks are full of loops, with nesting to multiple levels. Increasing throughput of GPU data is one of the main goals achieved by loop optimizations.

Types of Loop Optimizations

Loop transformation optimizations may include:

Loop Fusion

Loop fusion is a well-known code optimization where two parallel loops are merged into a single loop. This does not change the amount of in-loop computation in either loop body, but reduces the loop overhead of the exit test by half. There is also often a benefit from data locality that reduces data movement and temporary data storage, which can also improve overall speed.

Note that loop fusion is not great at parallel, because complicated loop bodies are actually harder to parallelize. Most of the benefits arise in traditional sequential code execution, which is why its theory dates back many decades. For modern parallel execution on GPUs, loop fusion is often a poor choice, and more benefits may arise from loop fission (the opposite of fusion) and loop vectorization.

Example: Loop Fusion: The general idea is to combine the body of two loops into a single loop. Here is a simplistic example with the (non-fused) loops for initializing two vectors using two sequential loops:

   for (i = 0; i < n; i++) v1[i] = 0;
   for (i = 0; i < n; i++) v2[i] = 0;

And here is the version with loop fusion:

   for (i = 0; i < n; i++) {
       v1[i] = 0;
       v2[i] = 0;
   }

Note that the loop fusion version incurs the same number of assignments for initialization, but only half of the loop overhead cost (i.e. half of the "i < n" and "i++" operators have been optimized away). And for the sake of argument, let's pretend we don't know a better way to initialize a vector in C++ like memset or calloc, or load-time global or static variable initialization.

AI Loop Fusion: There are plenty of loops happening in AI inference engines. The idea of loop fusion as applied to neural networks and Transformers is usually performed via kernel operator fusion. However, there are also some AI papers on loop fusion and their usage in graph optimizations done by machine learning compilers:

Loop Fusion General Coding: Research papers specific to loop fusion as a general programming technique (written about since the 1990s):

Loop Perforation

Code perforation or loop perforation is an optimization technique that trades accuracy for speed, by skipping some computations or iterations of a loop in a probabilistic manner. Randomly skipping some percentage of the loop bodies doesn't sound like a good plan, but it has its merits. The intentional introduction of randomness to code is known as a "stochastic" algorithm; if uninentional, it's known as a "bug", but now you can claim you were adding stochastic functionality. Essentially, it is similar to approximations, but in a generalized way for any iterative code.

Example: Loop Perforation: Here is an example of adding loop perforation to a vector dot product computation. This is an incredibly slow version, and is not recommended, but is just to give the idea:

    float yapi_vecdot_perforated_slow(float v1[], float v2[], int n, int percent_perforation)   
    {
        // Loop perforation -- vector dot product
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
	        if ( ( rand() % 100 ) + 1 <= percent_perforation) {
	                // This iteration is perforated...
	                continue; // Skip it...
	        }
	        sum += v1[i] * v2[i];
        }
        return sum;
    }

AI Loop Perforation: Various research has shown the AI models have some resilience to the introduction of randomness, which speeds up inference, and "stochastic" algorithms even can be beneficial to accuracy when used during training. Research on loop perforation in AI engines:

General Loop Perforation Optimizations: Research on loop perforation as a general programming optimization technique:

Loop Unrolling

Loop unrolling is a code optimization where the body of a loop is repeated in sequential code. This speeds up the algorithm because the overhead of the loop iteration test is avoided. In some cases, the entire loop can be unrolled, usually when the loop iterations are finite and known. In other cases, the loop body can be repeated multiple times, and thereby the loop test only occurs every few iterations.

For an AI engine, loop unrolling is used as an optimization in a few places. It is one of the optimizations used by kernel fusion, along with loop fusion and others. The logical extension of loop rolling is done by machine learning compilers, at least from a conceptual point of view. These ML compilers unroll the inference loop and the lower-level loops in matrix operations, thereby creating a finite graph representation of the entire inference sequence. If all is unrolled, there are no loops in the graph (an "acyclic" graph) and it is of finite size. The process of model inference is propagation of data through the graph. There are many "graph optimizations" that can be made on this graph representation of the AI model.

Example: C++ Loop Unrolling of Vector Dot Product. Here is the basic C++ non-unrolled vector dot product code:

   float yapi_vecdot_basic(float v1[], float v2[], int n)   // Basic vector dot product
   {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += v1[i] * v2[i];
	}
	return sum;
   }

If we know the value of n, e.g. that n=5, then we can completely unroll it:

   return v1[0] * v2[0]
        + v1[1] * v2[1]
        + v1[2] * v2[2]
        + v1[3] * v2[3]
        + v1[4] * v2[4]
        ;

If we don't know the value of n, we can still unroll multiple iterations. Here's an example of 4-level loop unrolling of vector dot product in C++ by assuming that n is a multiple of 4:


   float yapi_vecdot_unroll4_basic(float v1[], float v2[], int n)  // Loop-unrolled Vector dot product 
   {
	if (n % 4 != 0) {
		yassert(n % 4 == 0);
		return 0.0; // fail
	}
	float sum = 0.0;
	for (int i = 0; i < n; ) {
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
	}
	return sum;
   }

And here's a generalization of that 4-level unrolling with extra code to handle the leftover cases if n is not a multiple of 4. Although the extra cases look messy, they are not actually the main performance bottleneck.

    float yapi_vecdot_unroll4_better(float v1[], float v2[], int n)  // Loop-unrolled Vector dot product 
    {   
	int i = 0;
	float sum = 0.0;
	if (n % 4 != 0) {
		// Handle the extra cases...
		switch (n % 4) {
		case 1: 
			sum += v1[i] * v2[i]; i++; 
			break;
		case 2: 
			sum += v1[i] * v2[i]; i++;
			sum += v1[i] * v2[i]; i++;
			break;
		case 3:
			sum += v1[i] * v2[i]; i++;
			sum += v1[i] * v2[i]; i++;
			sum += v1[i] * v2[i]; i++;
			break;
		default: yassert_not_reached(); break;
		} // end switch
		// Keep going below with the rest of the vector
	}
	for (; i < n; ) {  // Unrolled 4 times...
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
	}
	return sum;
    }

This code is just an example for explanation. There are various further code optimizations that can be done for production-level efficiency. For example, we would change it to use pointer arithmetic rather than array indices, we might try replacing the four i++ operators with i+=4, change the integer mod operator (%) to a bitwise-and operator test (i.e., use "n & 3" not "n % 4", which works since 4 is a power-of-two), and it also might be better to use "+" rather than "+=". Finally, if we carefully code the leftover cases, the main loop could be unrolled to many more levels than just four.

AI Research Papers on Loop Unrolling: Papers with discussion of loop unrolling include:

General Research on Loop Unrolling: Papers on loop unrolling as a general code optimization technique include:

Loop Tiling

Loop tiling is a method of executing sub-parts of a loop in a way that maximizes data locality, increases cache utilization, and improves parallel execution. This is also called "loop blocking" because it processes the data a "block" at a time. The sub-partitions of the data are called "tiles" or "blocks".

This technique is used inside neural network code to improve the efficiency of MatMul's and thereby get better throughput of tensor calculations from a GPU. A tiled version of MatMul processes "tiles" or "blocks" of the matrix at a time (small square or rectangular sections), with the aim of keeping small parts of the matrix in the cache while they are processed. The algorithm progresses across the matrix a tile/block at a time, rather than scanning all the way down one dimension (row or column). The same number of arithmetic operations are performed as a non-tiled MatMul, but data locality and cache freshness should improve the overall speed.

AI Research on Loop Tiling. Loop tiling is a powerful technique for speeding up GPU throughput via increased pipelining. Research papers on the use of loop tiling include:

General Research on Loop Tiling: Papers on loop tiling as a general loop optimization and coding technique include:

Loop Fission

Loop fission is an optimization that is the opposite of loop fusion. Instead of fusing two loops into one, we take one loop and split parts of it into two loops. This is often called "loop fission", but has also been called other names such as "loop splitting" or "loop distribution."

Loop fission can be more efficient for parallel execution (e.g. vectorization for GPUs), but is often slower for sequential execution. Whereas loop fusion aims to remove the overhead of one of the loops, loop fission tolerates an increased loop overhead in return for simpler loop bodies that can be parallelized. The kernel optimization of "kernel fission" is based on loop fission, and loop fission is one technique used to achieve vectorization for GPUs.

The main reason to use loop fission is hardware acceleration via loop parallelization. A complicated single loop can often run faster as two loops if hardware acceleration can be accessed. This is true even if the two loops must run sequentially, because the iterations of each loop are parallelized, but there's a double benefit if the two whole loops can also run in parallel.

Example: Loop Fission in BatchNorm: A good example arises in part of the code for batch normalization. Each element of the vector needs to have two operations performed on it: subtract the mean (re-centering) and multiply by a variance factor (re-scaling). The naive implementation of the second half of BatchNorm looks like this:

    float denom = sqrtf(variance + epsilon);  // Scaling factor
    for (int i = 0; i < n; i++) {
    	v[i] = (v[i] - fmean) / denom; // Normalize all elements to re-center and scale
    }

This is difficult to hardware accelerate because it's unlikely that there's a combined "subtract-and-then-divide" operation to apply to all elements of a vector in parallel. The first point is that maybe there's an "add-and-then-multiply", in which case we can use the negative of the additive factor and the reciprocal of the scaling factor. However, assuming there's not, loop fission can be used to split the single complicated loop into two sequential loops.

    float negmean = -fmean;  // Use negative for addition not subtraction
    float denom = sqrtf(variance + epsilon);  // This is like std. deviation, but adjusted/smoothed by epsilon
    float recip = 1.0f / denom;  // Use reciprocal multiply not division
    yapi_vector_add_scalar(v, n, negmean);     // Loop 1: Re-center using mean
    yapi_vector_multiply_scalar(v, n, recip);  // Loop 2: Re-scale by factor

Each of the two loops is now easy to hardware accelerate, because they are both very simple vector operations: "multiply-by-scalar" and "add-scalar". Every platform is likely to have hardware acceleration APIs for those simpler operations. So, to summarize, we got an explosion to hypersonic rocket speed using atomic operations with loop fission. Isn't that just the bomb?

Research Papers on Loop Fission: Here's some research on loop fission for parallelization (see also kernel operator fission):

Loop Reversal

Loop reversal is the optimization of making the loops go backwards. The speedup can occur in two ways:

  • Data locality: sometimes there is better data locality to go backwards (but not often).
  • Loop zero test: testing with zero is often faster than a less-than ("i < n") comparison. This optimization is called "looping down to zero."

Reversing a loop can be considered where the order of loop iterations doesn't matter, such as a commutative operation like addition. Hence, it works for vector dot product.

Example: Basic Loop Reversal: Programmers are used to the typical loop structure:

    for (i = 0; i < n; i++)   // Forwards

But sometimes the reversal can work better:

    for (i = n - 1; i >= 0; i--)    // Backwards

Bug alert! Note that i has to be a signed type such as plain old int, not unsigned or size_t, because the "i >= 0" test will never fail with an unsigned type.

Example: Reversed Vector Dot Product in C++. Loop reversal can be used on vector dot product. Here's the basic idea:

    float yapi_vecdot_reverse_basic(float v1[], float v2[], int n)   // REVERSED basic vector dot product
    {
	float sum = 0.0;
	for (int i = n - 1; i >= 0; i--) {  // Note: i cannot be "unsigned" or "size_t" type
		sum += v1[i] * v2[i];
	}
	return sum;
    }

Here's a slightly better example where i has been removed, and the parameter n is decremented instead. So we have a double benefit of going backwards:

    float yapi_vecdot_reverse_basic2(float v1[], float v2[], int n)   // REVERSED basic vector dot product #2
    {
	float sum = 0.0;
	n--;  // Use "n" not "i"
	for (; n >= 0; n--) {  // Note: n cannot be "unsigned" or "size_t" type
		sum += v1[n] * v2[n];
	}
	return sum;
    }

But the above example still has the "n >= 0" test, which isn't that much faster than "i < n". Maybe the C++ compiler will convert it to a sign bit test, or maybe not. We can do better by using a faster equality test (i.e. "==" or "!=") to test for zero.

    float yapi_vecdot_reverse_zerotest(float v1[], float v2[], int n)   // Reversed-with-zero-test vector dot product
    {
	float sum = 0.0;
	int i = n - 1;
	do {
		sum += v1[i] * v2[i];
		i--;
	} while (i != 0);   // Zero-test is faster than ">=" test...
	sum += v1[0] * v2[0];  // Don't skip the last one!
	return sum;
    }

Note that we had to handle "0" as a special case after the loop, that was accidentally omitted in my first coding attempt. Unit tests can be helpful some times! But this code has other bugs, such as an infinite loop if passed a vector of size n=1. And the variable "i" should be written out of this version, too. And then it should be modified to use faster pointer arithmetic rather than array indices. And then I should give up and go buy an expensive GPU instead.

Research on Loop Reversal: Research papers on loop reversal as a generic programming optimization technique:

Loop Code Motion

Loop code motion is moving loop-invariant code from inside the loop body to the pre-initialization code for the loop. Any code that has the same value should not be performed inside the loop body. Instead, it should be pre-calculated before the loop, and stored in a temporary variable. This is sometimes called "hoisting" the code out of the loop.

Example: Loop Code Motion: One common example of unnecessary recalculation of loop-invariant values is in the loop test. The code in the boolean test for the loop is actually part of the loop body.

An example of code that re-calculates the loop limit:

   for (i = 0; i < vec.num_elements(); i++) {
      ...
   }

The "num_elements" call is probably loop-invariant, assuming the vector doesn't change size during processing. Maybe the "num_elements" function is declared "inline" and the C++ compiler will fix it anyway. Nevertheless, this is a candidate for loop code motion, using a temporary variable instead:

   int n = vec.num_elements();  // Loop-invariant calculation moved outside loop body
   for (i = 0; i < n; i++) {
      ...
   }

General Research on Loop Code Motion: Papers on loop code motion as a general programming optimization started in the 1990s or earlier, usually from the point of view of getting programming language compilers to do loop-invariant code motion automatically:

Loop Distribution

Loop distribution is a loop optimization that is a type of loop code motion that creates two loops. It is a special case of "loop code motion" where the hoisted code is a conditional test. Some early papers in the 1990s called it "loop unswitching". And some papers use the term "loop distribution" to refer to the general optimization of splitting a loop into two parallel loops, which we call "loop fission".

The goal of loop distribution is to move a conditional test (i.e. if-test) out of the loop body, which is applying loop code motion for the if-test, in order to create two loops. This sounds similar to loop fission, but loop distribution is a more general optimization that doesn't require parallelization to get a speed improvement (whereas loop fission does). Instead, loop distribution gets a benefits in ordinary sequential execution because it moves the if-test computation out of the loop to a once-only pre-initialization test (i.e. "hoisted"), rather than each iteration, and ends up creating two separate loops on two pathways. Note that only one loop gets actually executed, and these two loops are not executed in parallel, so this technique is not really a type of loop fission.

Example: Loop Distribution: Here's a dummy example of implementing an add-or-subtract function using a passed-in Boolean flag.

    void yapi_vector_addition_slow(float v[], int n, bool do_addition, float scalar)
    {
	for (int i = 0; i < n; i++) {
		if (do_addition) v[i] += scalar;  // Add scalar
		else v[i] -= scalar;   // Subtraction
	}
    }

The problem is that the if-test, "if(do_addition)", is computed for every loop iteration, and yet "do_addition" a loop-invariant variable. The faster version is to use loop distribution to move the if-test into the loop initialization, and then split the two pathways inside the loop to instead have two separate loops. Here's the faster version:

    void yapi_vector_addition_loop_distribution(float v[], int n, bool do_addition, float scalar)
    {
	if (do_addition) { // Add scalar
		for (int i = 0; i < n; i++) {
			v[i] += scalar;  // Addition
		}
	}
	else {  // Subtract scalar
		for (int i = 0; i < n; i++) {
			v[i] -= scalar;   // Subtraction
		}
	}
    }

This example is still far from optimal. For starters, it should be using pointer arithmetic rather than array indices.

Research Papers on Loop Distribution: Some research papers address the loop distribution optimization, but it's not as commonly researched as loop unrolling or code motion. Note that some papers use the term "loop distribution" when really they are talking about parallelization optimizations, which is technically the slightly different technique of "loop fission". Here's some references:

Loop Reordering

Loop reordering is changing the ordering of two loops, or more commonly, changing the nesting of two nested loops, to reverse the inner and outer loops. Loop reordering is an optimization that doesn't reduce the total number of computations, but improves data locality and cache access speed by doing the operations in a different order. This reduces the cost of accessing the data into memory (or low-level caches), rather than the cost of the arithmetic. It is therefore related to memory/dataflow optimizations and pipelining optimizations.

In neural networks, there are many loops, and many ways of nesting them. The convolution layers have literally seven layers of nested loops. Hence, there are various research papers exploring different orders to perform the various computations.

Research papers on loop reordering:

Pointer Arithmetic

Pointer arithmetic is a tricky C++ optimization that can be used to get rid of incremented variables. Instead, a pointer can be incremented instead.

Here's a vector dot product with basic incremented loop variable:

    float yapi_vecdot_basic(float v1[], float v2[], int n)   // Basic vector dot product
    {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += v1[i] * v2[i];
	}
	return sum;
    }

And here's the same code when converted to pointer arithmetic:

    float yapi_vecdot_pointer_arithmetic(float v1[], float v2[], int n)   // Pointer arithmetic vector dot product
    {
	float sum = 0.0;
	float* endv1 = v1 + n;  // v1 start plus n*4 bytes
	for (; v1 < endv1; v1++,v2++) {
		sum += (*v1) * (*v2);
	}
	return sum;
    }

How does this work? We got rid of the temporary variable "i" by using pointer arithmetic ("*v1") instead of array indices ("v1[i]"). We are also using the function parameters "v1" and "v2" as temporary local variables, as permitted in C++, so we don't need an extra temporary pointer variable. The way this works with pointer arithmetic is v1 and v2 are treated as pointers, which works due to the near-equivalence of pointers and arrays in C++. The for loop incrementers "v1++" and "v2++" are both adding 4 bytes (the size of a float) to the pointers (and note they are separated by the C++ comma operator). The "endv1" end marker is calculated as the address of "v1[0]" plus "n*4" bytes, because the "+" operator in "v1+n" is pointer arithmetic addition scaled by the size of the pointed-to object, rather than normal integer addition.

Note that a further micro-optimization is possible. We can change the less-than test ("v1 < endv1") to an inequality test ("v1 != endv1"), because equality tests are slightly faster than less-than tests. Since this test is effectively inside the loop and done every iteration, this might be worth doing. The trade-off is safety: it'll become an infinite loop if you get the pointer math slightly wrong, but hey, your code has no bugs, right?

Loop Coalescing

Loop coalescing is a loop optimization that involves flattening two nested loops into one non-nested loop. The benefit in speed can arise by simplifying the loop, which makes it easier to parallelize via hardware acceleration, and also maybe a different data access pattern which might improve data locality and cache freshness. This optimization is not always possible, as nested loop logic is often quite complicated, and flattening a nested loop may actually worsen data locality in many instances. However, the linear nature of a simple loop can make the code to send off chunks to a GPU much easier.

Research on Loop Coalescing: Various research papers mention loop coalescing as an optimization technique:

Loop Collapsing

Loop collapsing is closely related to loop coalescing, since both aim to flatten nested loops, but loop collapsing is a special situation.

Research on Loop Collapsing: Papers on collapsing nested loops:

Loop Peeling

Loop peeling is a type of loop unrolling that involves unraveling only the first few iterations of a long loop. This is also similar to loop splitting into two sections, where the first section is over the early range, and the second range is the main section of all remaining iterations.

Loop peeling is beneficial to the overall loop efficiency if there is code in the loop body that is only required for one or two early iterations, which can then be removed from the main loop body. Similarly, there can be benefit in unraveling the last few iterations of a loop, which is a similar technique.

Research on Loop Peeling: There are research papers on loop peeling as an optimization technique:

Loop Splitting

Loop splitting refers to splitting the sequential iterations of a loop into two loops, which each perform part of the original loop's iterations. Loop splitting is sometimes called "loop sectioning". Note that "loop peeling" is a special case of loop splitting where the first section is a small range of a few initial iterations.

Loop splitting has at least two "split-out" loops, one for the early iterations, and one for the remainder. However, loops can also be split out into more than two loops. Each split-out loop is shorter than the original loop. Unlike loop fission, the two loops operate over different subportions of the iterator variable range, executing the same number of total iterations, rather than double iterations as in loop fission.

This optimization is beneficial if the loop body has different sections of code that only relate to a subset of the iterator range. Hence, the loop bodies of the two loops can be reduced to execute less code. Overall, there is still the same number of iterations performed in the two loops combined, but each loop performs only a proportion of the original iterations.

Research on Loop Splitting: There are research papers on splitting loops as an optimization technique:

Loop Interchange

Loop interchange is an optimization of nested loops that switches the inner and outer loops. In a typical nested loop, the outer loop body and loop test is executed rarely, almost lazily, whereas the inner loop body is scrambling along in a frantic mess. Loop interchange simply switches them, reversing their roles.

Why is this an optimization? Although the same number of loop iterations still occur in total, and the newly-made inner loop body is also thrashed, various improvements can arise from reversing the iterator variables. One possible improvement is in data locality, which can reduce cache misses and speeds up the overall execution. Note that this benefit is not guaranteed just by switching loops, and sometimes loop interchange can worsen data locality; careful analysis is needed. Another possibility is that reversing nested loops can create opportunities to apply other loop optimizations to the new inner loop.

Research on Loop Interchange: There are various research papers about loop interchange as an optimization technique for nested loop structures:

Loop Sentinel

Suppose we have to scan a vector to check whether there's a negative elements. Here's the simplest code:

    bool yapi_vector_has_negative_basic(float v[], int n)
    {
	for (int i = 0; i < n; i++) {
		if (v[i] < 0.0) return true;  // Found negative
	}
	return false;  // No negatives
    }

Assuming that n is large, the bottleneck is the loop body, in which there are 4 operations:

  • Loop test "i < n"
  • Array lookup: "v[i]"
  • Less-than test: " < 0.0"
  • Incrementer: "i++"

Now, we can get rid of the array operation "v[i]" by switching to pointer arithmetic. And this also reduces the "i < n" test to a pointer equality test (slightly better), and the "i++" becomes a pointer increment (not much different, and could even be worse since pointer increment is effectively "+=4"). So we basically still have 3 operations in the bottleneck code.

    bool yapi_vector_has_negative_pointer_arithmetic(float v[], int n)
    {
	float* endv = &v[n];
	for ( ; v != endv; v++) {
		if (*v < 0.0) return true;  // Found negative
	}
	return false;  // No negatives
    }

Can we do better?

Loop sentinels can do better.

Here's how it works with a sentinel: pretend we succeed, even if we don't. The idea is to add a dummy "fake success" element into the array so that we always succeed at the end. Then it's not necessary to test whether we're at the end, so this removes the "i < n" or "ptr == endptr" tests in the loop condition.

Here's the loop sentinel version:

   bool yapi_vector_has_negative_sentinel(float v[], int n)
   {
	v[n] = -99.0;  // Dummy negative (BUG!)
	int i = 0;
	for ( ; /*GONE!*/; i++) {
		if (v[i] < 0.0) break;  // Found negative
	}
	if (i == n) return false;  // At the dummy (fake success)
	return true;  // Found a negative (for real)
   }

How does this work? Since the test is looking for negatives, this code adds a dummy negative at the end. Hence, the loop will always succeed at the end, and the loop test can be removed without creating an infinite loop. The only change is that we can't just "return true" inside the loop if we find a negative, because it might be at the end, which should be "return false" for the dummy success.

Bug alert! Unfortunately, this code is buggy! The sentinel assignment "v[n]" is actually an array bounds overflow. Or even if it's not, we've also modified the array with a dummy value. One way to fix it is to always have every vector in the whole program have an extra safety element in its size, which is a ghastly solution.

A better way to fix it is to manipulate the last valid element "v[n-1]" instead. Then, we have to remember to fix it before we leave town.

    bool yapi_vector_has_negative_sentinel2(float v[], int n)
    {
	float save = v[n - 1];  // Save it!
	v[n - 1] = -99.0;  // Dummy negative at end
	int i = 0;
	for ( ; /*GONE!*/; i++) {
		if (v[i] < 0.0) break;  // Found negative
	}
	v[n - 1] = save;  // Restore it!
	if (i == n - 1) {
		// At the dummy (fake success)
		if (save < 0.0) return true;  // Must check it!
		return false;  
	}
	return true;  // Found a negative (for real)
    }

This sentinel example now has 3 main operations in the loop path and is hopefully bug-free. This is a 25% reduction at the cost of a few extra overhead tests and assignments before and after the loop. Sentinels also work with pointer arithmetic, so we can double down on optimizations by doing both. We just have to change "v[i]" to use pointer arithmetic instead (i.e. "*v"), and then the code has changed from 4 operations to 2 operations on the critical path, which is a 50% improvement in speed. Here's my code:

    bool yapi_vector_has_negative_sentinel3(float v[], int n)
    {
	// Sentinel + Pointer Arithmetic
	float *savev = &v[n - 1];
	float save = *savev;  // Save it!
	*savev = -99.0;  // Dummy negative at end v[n - 1]
	for (; /*GONE!*/; v++) {
		if (*v < 0.0) break;  // Found negative
	}
	*savev = save;  // Restore it!
	if (v == savev) {
		// At the dummy (fake success)
		if (save < 0.0) return true;  // Must check it!
		return false;
	}
	return true;  // Found a negative (for real)
    }

The above code is slow and linear for a CPU. Scanning a big vector could at least be multi-threaded to test multiple chunks in parallel, although this ends up doing some unnecessary computations, because it loses the benefit that the loop exits early once it finds a single negative. Is there a better way using hardware acceleration? Maybe there's a parallel accelerated method to check all the sign bits of the elements of a vector? That can go on the "todo" list for now!

Hashed Sentinel Objects: Interestingly, the idea of loop sentinels can be generalized beyond the above linear array search to more complex data structures, such as hash tables or binary trees. The trick is to use a sentinel object.

How to do it? Consider a hash table with a chained linked list for collision resolution. The main loop almost always has a "ptr!=NULL" test to detect the end of the chain list. The idea of sentinels for hash tables is to use a pointer to a dummy global object instead of the NULL pointer. When doing a search, the global sentinel object is set to a dummy value, and we can remove the test "ptr!=NULL" from the hot section of the loop.

Worth doing? This might be worthwhile if the main hash search loop is a bottleneck. But it's a lot of work, rather than just a change to a search function. For other routines where we can't use a sentinel, the main loop test changes to a "ptr!=globalptr" pointer comparison. All of the hash table data structure methods need to change to use the global sentinel pointer instead of NULL (e.g. insertions, deletions, scanning, etc.). Might be easier to go buy a bigger GPU.

Research on Loop Sentinels: Loop sentinels are a well-known trick, dating back as far as the wonderful Knuth treatises.

Loop Strip Mining (Loop Sectioning)

Loop strip mining is a loop optimization that scans or "mines" various "strips" of an array. It is related to "loop tiling", which operates on arrays in two dimensions, but strip mining only applies to processing one-dimensional arrays. Loop strip mining is also called "loop sectioning".

Research on Loop Strip Mining: Papers on loop strip mining:

Loop Spreading

Loop spreading is an optimization of two non-nested sequential loops that have different iteration ranges. Typically, this refers to where the end ranges differ significantly. If the loop ranges only differ by an off-by-one issue, then only loop normalization is required. Loop spreading modifies one of the loops, so that part of this loop fully overlaps with the other loop. Hence, this subloop can be fused with the other loop, and possibly parallelized. The remaining iterations that are not overlapping then have to be addressed in a followup partial loop (only for one of the loops).

Loop Normalization

Loop normalization is not directly an optimization, but is a preliminary loop transformation that can make further loop optimizations easier. This can be to fuse the two loops with loop fusion, or to parallelize each loop, such as with loop fission or vectorization. The goal of loop normalization is to make the loop iteration variables act across the same range. Hence, loop normalization is needed when two loops in sequence are starting at different offsets (e.g. one is i=1 and one starts at i=0).

Research on Loop Normalization: Papers on loop normalization:

Loop Skewing

Loop skewing is a somewhat mind-bending method to change nested loops to make them more parallelizable. This technique applies when there are two nested loops, but the inner loop is difficult to parallelize because of a dependency on the outer loop variable.

The loop skewing solution is far from obvious. The range bounds of the inner loop are changed by "skewing" them by a factor based on the outer loop variable. And then, by some magical potion, this somehow breaks the dependence on the outer loop, and then the inner loop can run fast on a GPU. Who knew?

Research on Loop Skewing: Papers on loop skewing:

Other Research on Loop Optimizations

Various other papers on loop optimizations:

More AI Research

Read more about: