Aussie AI

Sequential vs Parallel Loop Optimizations

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Sequential vs Parallel Loop Optimizations

Loops are often sources of inefficiency and can be optimized in numerous ways. And the basic algorithms for neural networks are full of loops, with nesting to multiple levels in tensor operations. Increasing throughput of GPU data is one of the main goals achieved by loop optimizations.

Not all loop transformations are created equal. Some of them are best for sequential code optimizations, whereas other loop transformations are used to parallelize loops for vectorization.

Loop transformations that are good for both sequential and parallel loop optimization include:

Loop unrolling — repeat the loop body to reduce loop test overhead and parallelize the loop body.
Loop peeling — unroll the first few iterations.
Loop coalescing — flatten nested loops.
Loop splitting — split out subportions of the iteration range.
Loop collapsing — another way to flatten nested loops.
Loop interchange — switch the inner and outer loop iterators of nested loops.
Loop reordering — change the ranges and arrangements of inner/outer nested loops.

Some loop transformations are mainly for sequential improvements, and are not parallelization in themselves. However, these techniques can sometimes help with parallelization if they enable another followup loop parallelization optimization. Loop transformation optimizations which tend to be good for sequential code optimizations but not parallelization include:

Loop fusion — combine or “fuse” the bodies of two loops.
Duff's device — amusing but impractical coding trick for loop unrolling.
Loop code motion — move or “hoist” loop-invariant calculations from the loop body to pre-loop initialization.
Loop perforation — randomly skip a subset of loop iterations; it's really a thing.
Loop sentinel — fake it till you make it.
Loop iterator strength reduction — change “*” to “+” if you can.
Loop reversal — going backwards, and yet, still making progress!

Parallelizing loop optimizations with a main goal of vectorization of the loop body include:

Loop fission — opposite of loop fusion; split a single loop body into two loops.
Loop tiling — process sub-parts of contiguous data in separate loops.
Loop distribution — split two sub-parts of a loop body into two simpler separate loops.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Sequential vs Parallel Loop Optimizations

Sequential vs Parallel Loop Optimizations

Quick Links

Product

New to Writing?

Writing Styles