Aussie AI

Loop Optimizations

Last Updated 12 December, 2024

by David Spuler, Ph.D.

Loops are often sources of inefficiency and can be optimized in numerous ways. And the basic algorithms for neural networks are full of loops, with nesting to multiple levels. Increasing throughput of GPU data is one of the main goals achieved by loop optimizations.

Types of Loop Optimizations

Loop transformation optimizations may include:

Loop unrolling (repeating the loop body to reduce loop test overhead)
Loop fusion (combining the bodies of two loops; see also kernel operator fusion which is mainly about loops)
Loop parallelization (vectorization)
Loop perforation (randomly skipping a subset of loop iterations)
Loop tiling (processing sub-parts of contiguous data in separate loops)
Loop reversal (going backwards, and yet, still making progress)
Loop fission (opposite of loop fusion, splitting a single loop body into two loops)
Loop distribution (splitting two sub-parts of a loop body into two separate loops)
Loop code motion (moving or "hoisting" loop-invariant calculations from the loop body to pre-loop initialization)
Loop reordering (changing the arrangements of inner/outer nested loops)
Loop interchange (switching the inner and outer loop iterators of nested loops)
Loop coalescing (flattening nested loops)
Loop splitting (Split out subportions of iteration range)
Loop peeling (Unrolling the first few iterations.)
Loop collapsing (another type of flattening nested loops)
Loop sentinel (Fake it till you make it.)

Loop Fusion

Loop fusion is a well-known code optimization where two parallel loops are merged into a single loop. This does not change the amount of in-loop computation in either loop body, but reduces the loop overhead of the exit test by half. There is also often a benefit from data locality that reduces data movement and temporary data storage, which can also improve overall speed.

Note that loop fusion is not great at parallel, because complicated loop bodies are actually harder to parallelize. Most of the benefits arise in traditional sequential code execution, which is why its theory dates back many decades. For modern parallel execution on GPUs, loop fusion is often a poor choice, and more benefits may arise from loop fission (the opposite of fusion) and loop vectorization.

Example: Loop Fusion: The general idea is to combine the body of two loops into a single loop. Here is a simplistic example with the (non-fused) loops for initializing two vectors using two sequential loops:

   for (i = 0; i < n; i++) v1[i] = 0;
   for (i = 0; i < n; i++) v2[i] = 0;

And here is the version with loop fusion:

   for (i = 0; i < n; i++) {
       v1[i] = 0;
       v2[i] = 0;
   }

Note that the loop fusion version incurs the same number of assignments for initialization, but only half of the loop overhead cost (i.e. half of the "i < n" and "i++" operators have been optimized away). And for the sake of argument, let's pretend we don't know a better way to initialize a vector in C++ like memset or calloc, or load-time global or static variable initialization.

AI Loop Fusion: There are plenty of loops happening in AI inference engines. The idea of loop fusion as applied to neural networks and Transformers is usually performed via kernel operator fusion. However, there are also some AI papers on loop fusion and their usage in graph optimizations done by machine learning compilers:

Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Examines numerous ML compiler optimizations including loop fusion, with example code, and lists compilers supporting this method.)
B Qiao, O Reiche, F Hannig, 2019, From loop fusion to kernel fusion: A domain-specific approach to locality optimization, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), https://ieeexplore.ieee.org/document/8661176 (Theory of loop fusion generalized to graph kernel fusion for image processing.)
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Coverage of loop fusion with numerous other optimizations.)
P Gibson, J Cano, J Turner, EJ Crowley, 2020, Optimizing grouped convolutions on edge devices, 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP), https://ieeexplore.ieee.org/abstract/document/9153227/ PDF: https://arxiv.org/pdf/2006.09791 (Uses various loop transformations to neural network convolutions, including loop fission, loop unrolling, loop fusion, loop reordering, and parallelization.)
H Wu, G Diamos, J Wang, S Cadambi, 2012, Optimizing data warehousing applications for GPUs using kernel fusion/fission, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, https://ieeexplore.ieee.org/abstract/document/6270615/ PDF: http://www.istc-cc.cmu.edu/publications/papers/2012/optimizing-warehousing.pdf (Uses loop fusion and loop fission for GPU parallelization.)
Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf (Fusion and fission optimizations with example algorithms on p.40 and p.45.)
Jie Zhao, Xiong Gao, Ruijie Xia, Zhaochuang Zhang, Deshi Chen, Lei Chen, Renwei Zhang, Zhen Geng, Bin Cheng, Xuefeng Jin, 2022, Apollo: Automatic partition-based operator fusion through layer by layer optimization, https://proceedings.mlsys.org/paper_files/paper/2022/hash/e175e8a86d28d935be4f43719651f86d-Abstract.html PDF: https://proceedings.mlsys.org/paper_files/paper/2022/file/e175e8a86d28d935be4f43719651f86d-Paper.pdf, PDF: https://yaozhujia.github.io/assets/pdf/mlsys2022-paper.pdf
David A. Padua, Michael J. Wolfe, 1986, Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12, 1184–1201, https://dl.acm.org/doi/10.1145/7902.7904, PDF: https://dl.acm.org/doi/epdf/10.1145/7902.7904
Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems, Volume 18, Issue 4, pp 424–453, https://dl.acm.org/doi/10.1145/233561.233564
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
B Blainey, C Barton, JN Amaral, 2002, Removing impediments to loop fusion through code transformations, International Workshop on Languages and Compilers for Parallel Computing, LCPC 2002: Languages and Compilers for Parallel Computing pp 309–328, https://link.springer.com/chapter/10.1007/11596110_21
David Spuler, March 2024, Chapter 15. Loop Vectorization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
David Spuler, March 2024, Loop Fusion, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-fusion

Loop Fusion General Coding: Research papers specific to loop fusion as a general programming technique (written about since the 1990s):

McKinley, K. S., Carr, S., and Tseng, C.-W. Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst., 18(4):424–453, July 1996. ISSN 0164-0925. doi: 10.1145/233561.233564. https://doi.org/10.1145/233561.233564
N Manjikian, TS Abdelrahman, 1997, Fusion of loops for parallelism and locality, https://ieeexplore.ieee.org/abstract/document/577265, PDF: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=5ffb15eae8a31efc94f4a07815ff3f4d1d6486c3 (Early theory paper on loop fusion.)
A Fraboulet, K Kodary, A Mignotte, 2001, Loop fusion for memory space optimization, ISSS '01: Proceedings of the 14th international symposium on Systems synthesis, September 2001, Pages 95–100, https://dl.acm.org/doi/abs/10.1145/500001.500025, PDF: https://www.cse.iitd.ac.in/~nagaraju/TA/2004-2005/SDS-04/project/loop_fusion.pdf
A Darte, 1999, On the complexity of loop fusion, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425), https://ieeexplore.ieee.org/abstract/document/807510/
Allen, F. E., and Cocke, J. 1972. A catalogue of optimizing transformations. In Design and Optimization of Compilers, Prentice-Hall, Englewood Cliffs, N.J., pp. 1–30. PDF: https://www.clear.rice.edu/comp512/Lectures/Papers/1971-allen-catalog.pdf (Loop fusion is called "jamming".)
Bo Qiao, Oliver Reiche, Frank Hannig, and Jïrgen Teich. 2019. From loop fusion to kernel fusion: A domain-specific approach to locality optimization. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 242–253, https://dl.acm.org/doi/10.5555/3314872.3314901
Carlo Bertolli, Adam Betts, Paul HJ Kelly, Gihan R Mudalige, and Mike B Giles. 2012. Mesh independent loop fusion for unstructured mesh applications. Proceedings of the 9th conference on Computing Frontiers. 43–52. https://dl.acm.org/doi/10.1145/2212908.2212917
Aravind Acharya, Uday Bondhugula, and Albert Cohen. 2018. Polyhedral auto-transformation with no integer linear programming. Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 529–542, https://dl.acm.org/doi/10.1145/3296979.3192401
Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshminarayanan Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques. 343–352, https://dl.acm.org/doi/10.1145/1854273.1854317
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101–113. https://dl.acm.org/doi/10.1145/1379022.1375595
Ken Kennedy and Kathryn S McKinley. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 301–320. https://dl.acm.org/doi/10.5555/645671.665526 PDF: https://www.researchgate.net/publication/220404216_Improving_Data_Locality_with_Loop_Transformations
McKinley, K. S., Carr, S., and Tseng, C.-W., 1996, Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst., 18(4):424–453, July 1996. ISSN 0164-0925. doi: 10.1145/233561.233564. https://doi.org/10.1145/233561.233564 PDF: https://dl.acm.org/doi/pdf/10.1145/233561.233564
Mehta, S., Lin, P.-H., and Yew, P.-C. Revisiting loop fusion in the polyhedral framework. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP’14, pp. 233–246, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2656-8. doi: 10.1145/2555243.2555250. http://doi.acm.org/10.1145/2555243.2555250
Zhao, J. and Di, P., Optimizing the memory hierarchy by compositing automatic transformations on computations and data. In Proceedings of the 53rd IEEE/ACM International Symposium on Microarchitecture, MICRO53, pp. 427–441, Piscataway, NJ, USA, 2020. IEEE Press. doi: 10.1109/MICRO50266.2020.00044. https://ieeexplore.ieee.org/document/9251965, https://www.microarch.org/micro53/papers/738300a427.pdf (Combining loop fusion and loop tiling.)
Lenore Zuck, Amir Pnueli, Benjamin Goldberg, Clark Barrett, Yi Fang, Ying Hu, 2005, Translation and run-time validation of loop transformations, Formal Methods in System Design, Volume 27, Issue 3, November 2005, pp 335–360, https://dl.acm.org/doi/10.1007/s10703-005-3402-z, PDF: https://www.academia.edu/download/51241056/s10703-005-3402-z20170107-14015-fucnad.pdf
B Goldberg, L Zuck, C Barrett, 2005, Into the loops: Practical issues in translation validation for optimizing compilers, Electronic Notes in Theoretical Computer Science 132 (2005) 53–71, https://dl.acm.org/doi/10.1016/j.entcs.2005.01.030, PDF: https://www.sciencedirect.com/science/article/pii/S157106610505005X/pdf?md5=cc59c5580bdb88a279fe834e0f0495b9&pid=1-s2.0-S157106610505005X-main.pdf
S Carr, KS McKinley, CW Tseng, 1994, Compiler optimizations for improving data locality, ACM SIGPLAN Notices, ASPLO VI, 10/94, San Jose, California, USA, https://dl.acm.org/doi/pdf/10.1145/195470.195557
KS McKinley, S Carr, CW Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming https://dl.acm.org/doi/pdf/10.1145/233561.233564 (Covers loop reversal, loop fusion, and other optimizations.)
R Maruthamuthu, D Dhabliya, 2023, Advancements in Compiler Design and Optimization Techniques, Volume 399 (2023), E3S Web Conf., 399 (2023), https://www.e3s-conferences.org/articles/e3sconf/abs/2023/36/e3sconf_iconnect2023_04047/e3sconf_iconnect2023_04047.html, https://www.e3s-conferences.org/articles/e3sconf/pdf/2023/36/e3sconf_iconnect2023_04047.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
David A. Padua, Michael J. Wolfe, 1986, Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12, 1184–1201, https://dl.acm.org/doi/10.1145/7902.7904, PDF: https://dl.acm.org/doi/epdf/10.1145/7902.7904
Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems, Volume 18, Issue 4, pp 424–453, https://dl.acm.org/doi/10.1145/233561.233564

Loop Perforation

Code perforation or loop perforation is an optimization technique that trades accuracy for speed, by skipping some computations or iterations of a loop in a probabilistic manner. Randomly skipping some percentage of the loop bodies doesn't sound like a good plan, but it has its merits. The intentional introduction of randomness to code is known as a "stochastic" algorithm; if uninentional, it's known as a "bug", but now you can claim you were adding stochastic functionality. Essentially, it is similar to approximations, but in a generalized way for any iterative code.

Example: Loop Perforation: Here is an example of adding loop perforation to a vector dot product computation. This is an incredibly slow version, and is not recommended, but is just to give the idea:

    float yapi_vecdot_perforated_slow(float v1[], float v2[], int n, int percent_perforation)   
    {
        // Loop perforation -- vector dot product
        float sum = 0.0;
        for (int i = 0; i < n; i++) {
	        if ( ( rand() % 100 ) + 1 <= percent_perforation) {
	                // This iteration is perforated...
	                continue; // Skip it...
	        }
	        sum += v1[i] * v2[i];
        }
        return sum;
    }

AI Loop Perforation: Various research has shown the AI models have some resilience to the introduction of randomness, which speeds up inference, and "stochastic" algorithms even can be beneficial to accuracy when used during training. Research on loop perforation in AI engines:

Michael Figurnov, Dmitry Vetrov, Pushmeet Kohli, 2016, PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions, Advances in Neural Information Processing Systems 29 (NIPS 2016), https://proceedings.neurips.cc/paper_files/paper/2016/hash/f0e52b27a7a5d6a1a87373dffa53dbe5-Abstract.html, PDF: https://proceedings.neurips.cc/paper_files/paper/2016/file/f0e52b27a7a5d6a1a87373dffa53dbe5-Paper.pdf (A prominent paper on loop perforation with neural networks, called "convolution perforation".)
Jorge Castro-Godínez, Deykel Hernández-Araya, Muhammad Shafique, Jörg Henkel, 2020, Approximate acceleration for CNN-based applications on IoT edge devices, 2020 IEEE 11th Latin American Symposium on Circuits & Systems (LASCAS), https://ieeexplore.ieee.org/document/9069040
V Akhlaghi, A Yazdanbakhsh, K Samadi, 2018, Snapea: Predictive early activation for reducing computation in deep convolutional neural networks, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/8416863/, PDF: https://par.nsf.gov/servlets/purl/10077003

General Loop Perforation Optimizations: Research on loop perforation as a general programming optimization technique:

Matevž Fabjančič, Octavian Machidon, Hashim Sharif, Yifan Zhao, Saša Misailović, Veljko Pejović, March 2023, Mobiprox: Supporting Dynamic Approximate Computing on Mobiles, https://arxiv.org/abs/2303.11291 (Uses the perforation optimization technique, amongst others.)
Henry Hoffmann, Sasa Misailovic, Stelios Sidiroglou, Anant Agarwal, Martin Rinard, "Using Code Perforation to Improve Performance, Reduce Energy Consumption, and Respond to Failures" MIT CSAIL Tech. Report 2009-042, September 2009. https://dspace.mit.edu/handle/1721.1/46709
S Sidiroglou-Douskos, S Misailovic, 2011, Managing performance vs. accuracy trade-offs with loop perforation, ESEC/FSE '11: Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, September 2011, Pages 124–134, https://doi.org/10.1145/2025113.2025133, https://dl.acm.org/doi/10.1145/2025113.2025133
S. Misailovic, S. Sidiroglou, H. Hoffmann, M. Rinard, 2010, ACM/IEEE 32nd International Conference on Software Engineering (ICSE 2010), https://ieeexplore.ieee.org/document/6062070
Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Computing Surveys (CSUR) 48, 4 (2016), 1–33. https://dl.acm.org/doi/10.1145/2893356 (General discussion of approximation algorithms, including a section on loop perforation.)
S. Misailovic, D. M. Roy, and M. C. Rinard, “Probabilistically accurate program transformations,” Static Analysis, 2011. PDF: https://people.csail.mit.edu/misailo/papers/sas2011.pdf (Paper focuses on loop perforation in general coding, and examines probabilistic constraints and performance-vs-accuracy trade-offs.)
Ahmad Lashgar; Ehsan Atoofian; Amirali Baniasadi, 2018, Loop perforation in OpenACC, 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), https://ieeexplore.ieee.org/abstract/document/8672231/, PDF: https://www.academia.edu/download/81741595/18ispa.pdf
Amir Yazdanbakhsh; Divya Mahajan; Hadi Esmaeilzadeh; Pejman Lotfi-Kamran, 2017, AxBench: A multiplatform benchmark suite for approximate computing, IEEE Design & Test, Volume 34, Issue 2, April 2017, https://ieeexplore.ieee.org/abstract/document/7755728/, PDF: https://ieeexplore.ieee.org/ielaam/6221038/7862860/7755728-aam.pdf, PDF: http://axbench.org/papers/dt.darksilicon16-camera.pdf
Salar Shakibhamedan, Amin Aminifar, Nima TaheriNejad, Axel Jantsch, 2024, EASE: Energy Optimization through Adaptation — A Review of Runtime Energy-Aware Approximate Deep Learning Algorithms, https://eclectx.org/Publications/2024_M13.pdf (Survey paper on techniques for adaptive inference with a focus on approximations of inference, including loop performance, stochastic algorithms, approximate arithmetic, quantization, pruning and low-rank.)
David Spuler, March 2024, Loop Perforation, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-perforation

Loop Unrolling

Loop unrolling is a code optimization where the body of a loop is repeated in sequential code. This speeds up the algorithm because the overhead of the loop iteration test is avoided. In some cases, the entire loop can be unrolled, usually when the loop iterations are finite and known. In other cases, the loop body can be repeated multiple times, and thereby the loop test only occurs every few iterations.

For an AI engine, loop unrolling is used as an optimization in a few places. It is one of the optimizations used by kernel fusion, along with loop fusion and others. The logical extension of loop rolling is done by machine learning compilers, at least from a conceptual point of view. These ML compilers unroll the inference loop and the lower-level loops in matrix operations, thereby creating a finite graph representation of the entire inference sequence. If all is unrolled, there are no loops in the graph (an "acyclic" graph) and it is of finite size. The process of model inference is propagation of data through the graph. There are many "graph optimizations" that can be made on this graph representation of the AI model.

Example: C++ Loop Unrolling of Vector Dot Product. Here is the basic C++ non-unrolled vector dot product code:

   float yapi_vecdot_basic(float v1[], float v2[], int n)   // Basic vector dot product
   {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += v1[i] * v2[i];
	}
	return sum;
   }

If we know the value of n, e.g. that n=5, then we can completely unroll it:

   return v1[0] * v2[0]
        + v1[1] * v2[1]
        + v1[2] * v2[2]
        + v1[3] * v2[3]
        + v1[4] * v2[4]
        ;

If we don't know the value of n, we can still unroll multiple iterations. Here's an example of 4-level loop unrolling of vector dot product in C++ by assuming that n is a multiple of 4:


   float yapi_vecdot_unroll4_basic(float v1[], float v2[], int n)  // Loop-unrolled Vector dot product 
   {
	if (n % 4 != 0) {
		yassert(n % 4 == 0);
		return 0.0; // fail
	}
	float sum = 0.0;
	for (int i = 0; i < n; ) {
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
	}
	return sum;
   }

And here's a generalization of that 4-level unrolling with extra code to handle the leftover cases if n is not a multiple of 4. Although the extra cases look messy, they are not actually the main performance bottleneck.

    float yapi_vecdot_unroll4_better(float v1[], float v2[], int n)  // Loop-unrolled Vector dot product 
    {   
	int i = 0;
	float sum = 0.0;
	if (n % 4 != 0) {
		// Handle the extra cases...
		switch (n % 4) {
		case 1: 
			sum += v1[i] * v2[i]; i++; 
			break;
		case 2: 
			sum += v1[i] * v2[i]; i++;
			sum += v1[i] * v2[i]; i++;
			break;
		case 3:
			sum += v1[i] * v2[i]; i++;
			sum += v1[i] * v2[i]; i++;
			sum += v1[i] * v2[i]; i++;
			break;
		default: yassert_not_reached(); break;
		} // end switch
		// Keep going below with the rest of the vector
	}
	for (; i < n; ) {  // Unrolled 4 times...
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
		sum += v1[i] * v2[i]; i++;
	}
	return sum;
    }

This code is just an example for explanation. There are various further code optimizations that can be done for production-level efficiency. For example, we would change it to use pointer arithmetic rather than array indices, we might try replacing the four i++ operators with i+=4, change the integer mod operator (%) to a bitwise-and operator test (i.e., use "n & 3" not "n % 4", which works since 4 is a power-of-two), and it also might be better to use "+" rather than "+=". Finally, if we carefully code the leftover cases, the main loop could be unrolled to many more levels than just four.

AI Research Papers on Loop Unrolling: Papers with discussion of loop unrolling include:

Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Examines various loop optimizations include loop unrolling, loop fusion, and others.)
V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on CPUs, In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 (Loop unrolling is one of several optimizations examined.)
P Gibson, J Cano, J Turner, EJ Crowley, 2020, Optimizing grouped convolutions on edge devices, 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP), https://ieeexplore.ieee.org/abstract/document/9153227/ PDF: https://arxiv.org/pdf/2006.09791 (Uses various loop transformations to neural network convolutions, including loop fission, loop unrolling, loop fusion, loop reordering, and parallelization.)
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S., Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 519–530, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176. http://doi.acm.org/10.1145/2491956.2462176, PDF: https://people.csail.mit.edu/jrk/halide-pldi13.pdf
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016, PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016
Zhai, Yujia, 2023, Ph.D. thesis, Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications, Computer Science, Universion of California, Riverside, https://escholarship.org/content/qt8s28g07q/qt8s28g07q.pdf
V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on CPUs, In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 (This paper explores some general code optimizations in relation to CPU and GPU execution, including lazy evaluation, loop unrolling, parallel accumulators, and in-memory partitioning of data for hardware acceleration.)
H Liu, F Deng, 2023, Convolutional Acceleration Algorithm Combining Loop Optimization and Automatic Scheduling, 2023 International Conference for Advancement in Technology (ICONAT), https://ieeexplore.ieee.org/abstract/document/10080410/
F Sun, C Wang, L Gong, C Xu, Y Zhang, 2017, A high-performance accelerator for large-scale convolutional neural networks, 2016 26th International Conference on Field Programmable Logic and Applications (FPL), https://ieeexplore.ieee.org/abstract/document/9269334/, https://ieeexplore.ieee.org/abstract/document/8367329/

General Research on Loop Unrolling: Papers on loop unrolling as a general code optimization technique include:

JC Huang, T Leng, 1999, Generalized loop-unrolling: a method for program speedup, Proceedings 1999 IEEE Symposium on Application-Specific Systems and Software Engineering and Technology. ASSET'99 (Cat. No.PR00122), https://ieeexplore.ieee.org/abstract/document/756775/, PDF: https://www2.cs.uh.edu/~jhuang/JCH/JC/loop.pdf
JW Davidson, S Jinturkar, 1995, An aggressive approach to loop unrolling Citeseer, PDF: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=6430084d77aaddbc692612b42654e73040b93a7b
GS Murthy, M Ravishankar, 2010, Optimal loop unrolling for GPGPU programs, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), https://ieeexplore.ieee.org/abstract/document/5470423
M Booshehri, A Malekpour, P Luksch, 2013, An improving method for loop unrolling, arXiv preprint arXiv:1308.0698, https://arxiv.org/abs/1308.0698 (Loop unrolling and linked-list algorithms.)
Allen, F. E., and Cocke, J. 1972. A catalogue of optimizing transformations. In Design and Optimization of Compilers, Prentice-Hall, Englewood Cliffs, N.J., pp. 1–30. PDF: https://www.clear.rice.edu/comp512/Lectures/Papers/1971-allen-catalog.pdf
Lenore Zuck, Amir Pnueli, Benjamin Goldberg, Clark Barrett, Yi Fang, Ying Hu, 2005, Translation and run-time validation of loop transformations, Formal Methods in System Design, Volume 27, Issue 3, November 2005, pp 335–360, https://dl.acm.org/doi/10.1007/s10703-005-3402-z, PDF: https://www.academia.edu/download/51241056/s10703-005-3402-z20170107-14015-fucnad.pdf
B Goldberg, L Zuck, C Barrett, 2005, Into the loops: Practical issues in translation validation for optimizing compilers, Electronic Notes in Theoretical Computer Science 132 (2005) 53–71, https://dl.acm.org/doi/10.1016/j.entcs.2005.01.030, PDF: https://www.sciencedirect.com/science/article/pii/S157106610505005X/pdf?md5=cc59c5580bdb88a279fe834e0f0495b9&pid=1-s2.0-S157106610505005X-main.pdf
Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Optimising loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: Fpga 2017 ‐ Proceedings of the 2017 ACM/SIGDA International Symposium on field‐ programmable gate arrays, Monterey, CA, February 2017 PDF: https://ieeexplore.ieee.org/ielaam/92/8396231/8330049-aam.pdf
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs including loop tiling and loop unrolling.)
R Maruthamuthu, D Dhabliya, 2023, Advancements in Compiler Design and Optimization Techniques, Volume 399 (2023), E3S Web Conf., 399 (2023), https://www.e3s-conferences.org/articles/e3sconf/abs/2023/36/e3sconf_iconnect2023_04047/e3sconf_iconnect2023_04047.html, https://www.e3s-conferences.org/articles/e3sconf/pdf/2023/36/e3sconf_iconnect2023_04047.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Zhai, Yujia, 2023, Ph.D. thesis, Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications, Computer Science, Universion of California, Riverside, https://escholarship.org/content/qt8s28g07q/qt8s28g07q.pdf (Includes examination of padding-free algorithms such as ByteTransformer.)
David Spuler, March 2024, Chapter 15. Loop Vectorization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on CPUs, In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766 (This paper explores some general code optimizations in relation to CPU and GPU execution, including lazy evaluation, loop unrolling, parallel accumulators, and in-memory partitioning of data for hardware acceleration.)
EventHelix, Jan 1, 2022, Auto-vectorize C and C++ code, https://medium.com/software-design/auto-vectorize-c-and-c-code-34569b2b5f1e
Zheming Jin, July 2024, Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL, Oak Ridge National Laboratory, ORNL/TM-2024/3463, https://info.ornl.gov/sites/publications/Files/Pub217394.pdf
David Spuler, March 2024, Loop Unrolling, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-unrolling
David Spuler, March 2024, Duff's Device for Loop Unrolling, in Generative AI in C++, https://www.aussieai.com/book/ch15-duffs-device-loop-unrolling
Bialas, P., Strzelecki, A. (2016). Benchmarking the Cost of Thread Divergence in CUDA. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds) Parallel Processing and Applied Mathematics. PPAM 2015. Lecture Notes in Computer Science(), vol 9573. Springer, Cham. https://doi.org/10.1007/978-3-319-32149-3_53 https://link.springer.com/chapter/10.1007/978-3-319-32149-3_53 https://arxiv.org/abs/1504.01650 PDF: https://arxiv.org/pdf/1504.01650

Loop Tiling

Loop tiling is a method of executing sub-parts of a loop in a way that maximizes data locality, increases cache utilization, and improves parallel execution. This is also called "loop blocking" because it processes the data a "block" at a time. The sub-partitions of the data are called "tiles" or "blocks".

This technique is used inside neural network code to improve the efficiency of MatMul's and thereby get better throughput of tensor calculations from a GPU. A tiled version of MatMul processes "tiles" or "blocks" of the matrix at a time (small square or rectangular sections), with the aim of keeping small parts of the matrix in the cache while they are processed. The algorithm progresses across the matrix a tile/block at a time, rather than scanning all the way down one dimension (row or column). The same number of arithmetic operations are performed as a non-tiled MatMul, but data locality and cache freshness should improve the overall speed.

AI Research on Loop Tiling. Loop tiling is a powerful technique for speeding up GPU throughput via increased pipelining. Research papers on the use of loop tiling include:

Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Optimising loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: Fpga 2017 ‐ Proceedings of the 2017 ACM/SIGDA International Symposium on field‐ programmable gate arrays, Monterey, CA, February 2017 PDF: https://ieeexplore.ieee.org/ielaam/92/8396231/8330049-aam.pdf
C Zhang, P Li, G Sun, Y Guan, B Xiao, 2015, Optimizing FPGA-based accelerator design for deep convolutional neural networks, FPGA '15: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFebruary 2015, Pages 161–170, https://dl.acm.org/doi/abs/10.1145/2684746.2689060, PDF: https://iceory.github.io/2018/04/25/fpga-based-cnn/FPGA-BASED-CNN.pdf
P Tillet, HT Kung, D Cox, 2019, Triton: an intermediate language and compiler for tiled neural network computations, Proceedings of the 3rd ACM SIGPLAN, http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf (Loop tiling with padding optimizations discussed on p.42.)
C Bastoul, Z Zhang, H Razanajato et al., 2022, 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Optimizing GPU deep learning operators with polyhedral scheduling constraint injection, https://ieeexplore.ieee.org/document/9741260
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S., Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 519–530, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176. http://doi.acm.org/10.1145/2491956.2462176, PDF: https://people.csail.mit.edu/jrk/halide-pldi13.pdf
J. Holewinski, L. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS, 2012. https://web.cs.ucla.edu/~pouchet/doc/ics-article.12.pdf (Uses stencil-based "tiling" optimizations for GPUs.)
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs including loop tiling and loop unrolling.)
Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016, PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016
L Waeijen, S Sioutas, M Peemen, M Lindwer, 2021, ConvFusion: A model for layer fusion in convolutional neural networks, IEEE Access (Volume: 9), https://ieeexplore.ieee.org/abstract/document/9646923/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09646923.pdf (Analysis of loop tiling, loop reordering, data flow, recomputation, and layer fusion.)
F Sun, C Wang, L Gong, C Xu, Y Zhang, 2017, A high-performance accelerator for large-scale convolutional neural networks, 2016 26th International Conference on Field Programmable Logic and Applications (FPL), https://ieeexplore.ieee.org/abstract/document/9269334/, https://ieeexplore.ieee.org/abstract/document/8367329/

General Research on Loop Tiling: Papers on loop tiling as a general loop optimization and coding technique include:

S Carr, KS McKinley, CW Tseng, 1994, Compiler optimizations for improving data locality, ACM SIGPLAN Notices, ASPLO VI, 10/94, San Jose, California, USA, https://dl.acm.org/doi/pdf/10.1145/195470.195557
KS McKinley, S Carr, CW Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming https://dl.acm.org/doi/pdf/10.1145/233561.233564 (Covers loop reversal, loop fusion, and other optimizations.)
Jingling Xue, 2000, Loop tiling for parallelism, Springer, The Springer International Series in Engineering and Computer Science Book 575, https://www.amazon.com/Parallelism-Springer-International-Engineering-Computer-ebook/dp/B000WCNL5W
H Li, J Choi, Y Kwon, JH Ahn, Oct 2023, A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models, IEEE Computer Architecture Letters, https://ieeexplore.ieee.org/abstract/document/10285300/ (Tiled version of matrix multiplication with SVD factorization.)
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul 2016), 11 pages. https://doi.org/10.1145/2897824.2925952, https://research.google/pubs/pub45525/, PDF: https://dl.acm.org/doi/pdf/10.1145/2897824.2925952
Aravind Acharya, Uday Bondhugula, and Albert Cohen. 2018. Polyhedral auto-transformation with no integer linear programming. Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 529–542, https://dl.acm.org/doi/10.1145/3296979.3192401
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation. 101–113. https://dl.acm.org/doi/10.1145/1379022.1375595
Jangda, A. and Bondhugula, U., An Effective Fusion and Tile Size Model for PolyMage, ACM Trans. Program. Lang. Syst., 42(3), November 2020. ISSN 0164-0925. doi: 10.1145/3404 https://dl.acm.org/doi/10.1145/3404846
S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. Effective automatic parallelization of stencil computations. In PLDI, 2007. https://dl.acm.org/doi/10.1145/1273442.1250761 (Loop tiling in algorithms for parallelization.)
Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., Devito, Z., Moses, W. S., Verdoolaege, S., Adams, A., and Cohen, A., 2019, The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated gpu kernels, automatically. ACM Trans. Archit. Code Optim., 16(4), October 2019. ISSN 1544-3566. doi: 10.1145/3355606. https://doi.org/10.1145/3355606, https://dl.acm.org/doi/fullHtml/10.1145/3355606
Zhao, J. and Di, P., Optimizing the memory hierarchy by compositing automatic transformations on computations and data. In Proceedings of the 53rd IEEE/ACM International Symposium on Microarchitecture, MICRO53, pp. 427–441, Piscataway, NJ, USA, 2020. IEEE Press. doi: 10.1109/MICRO50266.2020.00044. https://ieeexplore.ieee.org/document/9251965, https://www.microarch.org/micro53/papers/738300a427.pdf (Combining loop fusion and loop tiling.)
V Sarkar, R Thekkath, 1992, A general framework for iteration-reordering loop transformations, PLDI '92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementationJuly 1992, Pages 175–187, https://doi.org/10.1145/143095.143132, https://dl.acm.org/doi/10.1145/143095.143132, PDF: https://dl.acm.org/doi/pdf/10.1145/143103.143132
Verdoolaege, S., 2010, Isl: An integer set library for the polyhedral model. Proceedings of the Third International Congress Conference on Mathematical Software, ICMS’10, pp. 299– 302, Berlin, Heidelberg, 2010. Springer-Verlag. ISBN 3-642- 15581-2, 978-3-642-15581-9. https://doi.org/10.1007/978-3-642-15582-6_49, https://link.springer.com/chapter/10.1007/978-3-642-15582-6_49 (Theoretical math and a C library for loop transformations using the polyhedral model for program optimization.)
Lenore Zuck, Amir Pnueli, Benjamin Goldberg, Clark Barrett, Yi Fang, Ying Hu, 2005, Translation and run-time validation of loop transformations, Formal Methods in System Design, Volume 27, Issue 3, November 2005, pp 335–360, https://dl.acm.org/doi/10.1007/s10703-005-3402-z, PDF: https://www.academia.edu/download/51241056/s10703-005-3402-z20170107-14015-fucnad.pdf
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
Michael E. Wolf and Monica S. Lam, 1991, A Data Locality Optimizing Algorithm, Proceedings of the ACM SIGPLAN ’91 Conference on Programming Language Design and Implementation. Toronto, Ontario, Canada, June 26-28, 1991. https://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15745-s05/www/lectures/wolf-lam-pldi91.pdf (Early 1991 paper that includes optimizations to loops and matrix multiplication algorithms with a cache locality focus.)
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Many pseudo-code examples of kernel operator fusion, e.g. shows pseudo-code of fusing RELU into MatMul, and a fused MatMul-addbias-RELU.)
W Kelly, W Pugh, 1995, A unifying framework for iteration reordering transformations, Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing, https://ieeexplore.ieee.org/document/472180
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Optimising loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: Fpga 2017 ‐ Proceedings of the 2017 ACM/SIGDA International Symposium on field‐ programmable gate arrays, Monterey, CA, February 2017 PDF: https://ieeexplore.ieee.org/ielaam/92/8396231/8330049-aam.pdf
David Spuler, March 2024, Loop Tiling or Blocking, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-tiling-blocking

Loop Fission

Loop fission is an optimization that is the opposite of loop fusion. Instead of fusing two loops into one, we take one loop and split parts of it into two loops. This is often called "loop fission", but has also been called other names such as "loop splitting" or "loop distribution."

Loop fission can be more efficient for parallel execution (e.g. vectorization for GPUs), but is often slower for sequential execution. Whereas loop fusion aims to remove the overhead of one of the loops, loop fission tolerates an increased loop overhead in return for simpler loop bodies that can be parallelized. The kernel optimization of "kernel fission" is based on loop fission, and loop fission is one technique used to achieve vectorization for GPUs.

The main reason to use loop fission is hardware acceleration via loop parallelization. A complicated single loop can often run faster as two loops if hardware acceleration can be accessed. This is true even if the two loops must run sequentially, because the iterations of each loop are parallelized, but there's a double benefit if the two whole loops can also run in parallel.

Example: Loop Fission in BatchNorm: A good example arises in part of the code for batch normalization. Each element of the vector needs to have two operations performed on it: subtract the mean (re-centering) and multiply by a variance factor (re-scaling). The naive implementation of the second half of BatchNorm looks like this:

    float denom = sqrtf(variance + epsilon);  // Scaling factor
    for (int i = 0; i < n; i++) {
    	v[i] = (v[i] - fmean) / denom; // Normalize all elements to re-center and scale
    }

This is difficult to hardware accelerate because it's unlikely that there's a combined "subtract-and-then-divide" operation to apply to all elements of a vector in parallel. The first point is that maybe there's an "add-and-then-multiply", in which case we can use the negative of the additive factor and the reciprocal of the scaling factor. However, assuming there's not, loop fission can be used to split the single complicated loop into two sequential loops.

    float negmean = -fmean;  // Use negative for addition not subtraction
    float denom = sqrtf(variance + epsilon);  // This is like std. deviation, but adjusted/smoothed by epsilon
    float recip = 1.0f / denom;  // Use reciprocal multiply not division
    yapi_vector_add_scalar(v, n, negmean);     // Loop 1: Re-center using mean
    yapi_vector_multiply_scalar(v, n, recip);  // Loop 2: Re-scale by factor

Each of the two loops is now easy to hardware accelerate, because they are both very simple vector operations: "multiply-by-scalar" and "add-scalar". Every platform is likely to have hardware acceleration APIs for those simpler operations. So, to summarize, we got an explosion to hypersonic rocket speed using atomic operations with loop fission. Isn't that just the bomb?

Research Papers on Loop Fission: Here's some research on loop fission for parallelization (see also kernel operator fission):

P Gibson, J Cano, J Turner, EJ Crowley, 2020, Optimizing grouped convolutions on edge devices, 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP), https://ieeexplore.ieee.org/abstract/document/9153227/ PDF: https://arxiv.org/pdf/2006.09791 (Uses various loop transformations to neural network convolutions, including loop fission, loop unrolling, loop fusion, loop reordering, and parallelization.)
H Wu, G Diamos, J Wang, S Cadambi, 2012, Optimizing data warehousing applications for GPUs using kernel fusion/fission, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum, https://ieeexplore.ieee.org/abstract/document/6270615/ PDF: http://www.istc-cc.cmu.edu/publications/papers/2012/optimizing-warehousing.pdf (Uses loop fusion and loop fission for GPU parallelization.)
K Heid, J Wenzel, C Hochberger, 2018, Improved parallelization of legacy embedded software on soft-core MPSoCs through automatic loop transformations, FSP Workshop 2018; Fifth International Workshop on FPGAs for Software Programmers, https://ieeexplore.ieee.org/abstract/document/8470217/
A Tziouvaras, G Dimitriou, F Foukalas, November 2020, Low power general purpose loop acceleration for NDP applications, PCI '20: Proceedings of the 24th Pan-Hellenic Conference on Informatics, Pages 115–120, https://dl.acm.org/doi/10.1145/3437120.3437288, PDF: https://discovery.ucl.ac.uk/id/eprint/10124909/1/bare_conf.pdf (Analysis of loop fission transformations for hardware acceleration.)
Venkat Konda and Hugh C. Lauer and Katsunobu Muroi and Ken-ichi Tanaka and Hirono Tsubota and Christopher S. Wilson and E. Xu, 1996, A SIMDizing C Compiler for the Mitsubishi Electric Neuro4 Processor Array, https://api.semanticscholar.org/CorpusID:18353947, PDF: https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=1272f26260928fbf1747d609348eeb007fbfd1cd (Optimizing SIMD using loop fission using C in a 1996 paper.)
S Jain, S VenkataKeerthy, R Aggarwal, 2022, Reinforcement Learning assisted Loop Distribution for Locality and Vectorization, 2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), https://ieeexplore.ieee.org/abstract/document/10026979/, PDF: https://www.researchgate.net/profile/Dibyendu-Das/publication/365475992_Reinforcement_Learning_assisted_Loop_Distribution_for_Locality_and_Vectorization/links/637679e937878b3e87bb988e/Reinforcement-Learning-assisted-Loop-Distribution-for-Locality-and-Vectorization.pdf (This paper talks about "loop distribution" but it's really parallelization and "loop fission".)
C Aubert, T Rubiano, N Rusch, T Seiller, 2020, A Novel Loop Fission Technique Inspired by Implicit Computational Complexity, arXiv preprint arXiv:2206.08760, https://arxiv.org/abs/2206.08760
Bo Zhao, Yingying Li, Lin Han, Jie Zhao, Wei Gao, Rongcai Zhao, Ramin Yahyapour, 2018, A Practical and Aggressive Loop Fission Technique, International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2018: Algorithms and Architectures for Parallel Processing, pp 66–75, https://link.springer.com/chapter/10.1007/978-3-030-05234-8_9
YM Lam, JGF Coutinho, CH Ho, PHW Leong, 2009, Multiloop parallelisation using unrolling and fission, SPL 2009 Programmable Logic and Applications, Volume 2010, Article ID 475620, https://www.hindawi.com/journals/ijrc/2010/475620/abs/
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475 (Parallel loop splitting is used.)
Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf (Example of loop fission by moving scaling out of the matrix multiply loop into a following loop on p26.)
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
David A. Padua, Michael J. Wolfe, 1986, Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12, 1184–1201, https://dl.acm.org/doi/10.1145/7902.7904, PDF: https://dl.acm.org/doi/epdf/10.1145/7902.7904
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
David Spuler, March 2024, Chapter 15. Loop Vectorization, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
David Spuler, March 2024, Loop Fission, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-fission

Loop Reversal

Loop reversal is the optimization of making the loops go backwards. The speedup can occur in two ways:

Data locality: sometimes there is better data locality to go backwards (but not often).
Loop zero test: testing with zero is often faster than a less-than ("i < n") comparison. This optimization is called "looping down to zero."

Reversing a loop can be considered where the order of loop iterations doesn't matter, such as a commutative operation like addition. Hence, it works for vector dot product.

Example: Basic Loop Reversal: Programmers are used to the typical loop structure:

    for (i = 0; i < n; i++)   // Forwards

But sometimes the reversal can work better:

    for (i = n - 1; i >= 0; i--)    // Backwards

Bug alert! Note that i has to be a signed type such as plain old int, not unsigned or size_t, because the "i >= 0" test will never fail with an unsigned type.

Example: Reversed Vector Dot Product in C++. Loop reversal can be used on vector dot product. Here's the basic idea:

    float yapi_vecdot_reverse_basic(float v1[], float v2[], int n)   // REVERSED basic vector dot product
    {
	float sum = 0.0;
	for (int i = n - 1; i >= 0; i--) {  // Note: i cannot be "unsigned" or "size_t" type
		sum += v1[i] * v2[i];
	}
	return sum;
    }

Here's a slightly better example where i has been removed, and the parameter n is decremented instead. So we have a double benefit of going backwards:

    float yapi_vecdot_reverse_basic2(float v1[], float v2[], int n)   // REVERSED basic vector dot product #2
    {
	float sum = 0.0;
	n--;  // Use "n" not "i"
	for (; n >= 0; n--) {  // Note: n cannot be "unsigned" or "size_t" type
		sum += v1[n] * v2[n];
	}
	return sum;
    }

But the above example still has the "n >= 0" test, which isn't that much faster than "i < n". Maybe the C++ compiler will convert it to a sign bit test, or maybe not. We can do better by using a faster equality test (i.e. "==" or "!=") to test for zero.

    float yapi_vecdot_reverse_zerotest(float v1[], float v2[], int n)   // Reversed-with-zero-test vector dot product
    {
	float sum = 0.0;
	int i = n - 1;
	do {
		sum += v1[i] * v2[i];
		i--;
	} while (i != 0);   // Zero-test is faster than ">=" test...
	sum += v1[0] * v2[0];  // Don't skip the last one!
	return sum;
    }

Note that we had to handle "0" as a special case after the loop, that was accidentally omitted in my first coding attempt. Unit tests can be helpful some times! But this code has other bugs, such as an infinite loop if passed a vector of size n=1. And the variable "i" should be written out of this version, too. And then it should be modified to use faster pointer arithmetic rather than array indices. And then I should give up and go buy an expensive GPU instead.

Research on Loop Reversal: Research papers on loop reversal as a generic programming optimization technique:

KS McKinley, S Carr, CW Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming https://dl.acm.org/doi/pdf/10.1145/233561.233564 (Covers loop reversal, loop fusion, and other optimizations.)
S Carr, KS McKinley, CW Tseng, 1994, Compiler optimizations for improving data locality, ACM SIGPLAN Notices, ASPLO VI, 10/94, San Jose, California, USA, https://dl.acm.org/doi/pdf/10.1145/195470.195557
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems, Volume 18, Issue 4, pp 424–453, https://dl.acm.org/doi/10.1145/233561.233564
David Spuler, March 2024, Loop Reversal, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-reversal

Loop Code Motion

Loop code motion is moving loop-invariant code from inside the loop body to the pre-initialization code for the loop. Any code that has the same value should not be performed inside the loop body. Instead, it should be pre-calculated before the loop, and stored in a temporary variable. This is sometimes called "hoisting" the code out of the loop.

Example: Loop Code Motion: One common example of unnecessary recalculation of loop-invariant values is in the loop test. The code in the boolean test for the loop is actually part of the loop body.

An example of code that re-calculates the loop limit:

   for (i = 0; i < vec.num_elements(); i++) {
      ...
   }

The "num_elements" call is probably loop-invariant, assuming the vector doesn't change size during processing. Maybe the "num_elements" function is declared "inline" and the C++ compiler will fix it anyway. Nevertheless, this is a candidate for loop code motion, using a temporary variable instead:

   int n = vec.num_elements();  // Loop-invariant calculation moved outside loop body
   for (i = 0; i < n; i++) {
      ...
   }

General Research on Loop Code Motion: Papers on loop code motion as a general programming optimization started in the 1990s or earlier, usually from the point of view of getting programming language compilers to do loop-invariant code motion automatically:

C Click, 1995, Global code motion/global value numbering, Proceedings of the ACM SIGPLAN 1995, La Jolla, CA USA, PDF: https://dl.acm.org/doi/pdf/10.1145/207110.207154
D Monniaux, C Six, 2021, Simple, light, yet formally verified, global common subexpression elimination and loop-invariant code motion, Proceedings of the 22nd ACM SIGPLAN/SIGBED, https://dl.acm.org/doi/abs/10.1145/3461648.3463850, https://arxiv.org/pdf/2105.01344
D Monniaux, C Six, 2022, Formally verified loop-invariant code motion and assorted optimizations, ACM Transactions on Embedded Computing, 2022, PDF: https://hal.science/hal-03628646/file/CSE3_TECS_article.pdf
Allen, F. E., and Cocke, J. 1972. A catalogue of optimizing transformations. In Design and Optimization of Compilers, Prentice-Hall, Englewood Cliffs, N.J., pp. 1–30. PDF: https://www.clear.rice.edu/comp512/Lectures/Papers/1971-allen-catalog.pdf
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475 (Discusses parallel loop-invariant code motion.)
Uday Bondhugula, Oktay Gunluk, Sanjeeb Dash, and Lakshminarayanan Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques. 343–352, https://dl.acm.org/doi/10.1145/1854273.1854317
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
David Spuler, March 2024, Loop Code Motion, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-code-motion

Loop Distribution

Loop distribution is a loop optimization that is a type of loop code motion that creates two loops. It is a special case of "loop code motion" where the hoisted code is a conditional test. Some early papers in the 1990s called it "loop unswitching". And some papers use the term "loop distribution" to refer to the general optimization of splitting a loop into two parallel loops, which we call "loop fission".

The goal of loop distribution is to move a conditional test (i.e. if-test) out of the loop body, which is applying loop code motion for the if-test, in order to create two loops. This sounds similar to loop fission, but loop distribution is a more general optimization that doesn't require parallelization to get a speed improvement (whereas loop fission does). Instead, loop distribution gets a benefits in ordinary sequential execution because it moves the if-test computation out of the loop to a once-only pre-initialization test (i.e. "hoisted"), rather than each iteration, and ends up creating two separate loops on two pathways. Note that only one loop gets actually executed, and these two loops are not executed in parallel, so this technique is not really a type of loop fission.

Example: Loop Distribution: Here's a dummy example of implementing an add-or-subtract function using a passed-in Boolean flag.

    void yapi_vector_addition_slow(float v[], int n, bool do_addition, float scalar)
    {
	for (int i = 0; i < n; i++) {
		if (do_addition) v[i] += scalar;  // Add scalar
		else v[i] -= scalar;   // Subtraction
	}
    }

The problem is that the if-test, "if(do_addition)", is computed for every loop iteration, and yet "do_addition" a loop-invariant variable. The faster version is to use loop distribution to move the if-test into the loop initialization, and then split the two pathways inside the loop to instead have two separate loops. Here's the faster version:

    void yapi_vector_addition_loop_distribution(float v[], int n, bool do_addition, float scalar)
    {
	if (do_addition) { // Add scalar
		for (int i = 0; i < n; i++) {
			v[i] += scalar;  // Addition
		}
	}
	else {  // Subtract scalar
		for (int i = 0; i < n; i++) {
			v[i] -= scalar;   // Subtraction
		}
	}
    }

This example is still far from optimal. For starters, it should be using pointer arithmetic rather than array indices.

Research Papers on Loop Distribution: Some research papers address the loop distribution optimization, but it's not as commonly researched as loop unrolling or code motion. Note that some papers use the term "loop distribution" when really they are talking about parallelization optimizations, which is technically the slightly different technique of "loop fission". Here's some references:

S Carr, KS McKinley, CW Tseng, 1994, Compiler optimizations for improving data locality, ACM SIGPLAN Notices, ASPLO VI, 10/94, San Jose, California, USA, https://dl.acm.org/doi/pdf/10.1145/195470.195557
Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Examines numerous ML compiler optimizations including loop distribution.)
Allen, F. E., and Cocke, J. 1972. A catalogue of optimizing transformations. In Design and Optimization of Compilers, Prentice-Hall, Englewood Cliffs, N.J., pp. 1–30. PDF: https://www.clear.rice.edu/comp512/Lectures/Papers/1971-allen-catalog.pdf (Loop distribution is called "unswitching" in this paper.)
Ken Kennedy and Kathryn S McKinley. 1993. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 301–320. https://dl.acm.org/doi/10.5555/645671.665526 PDF: https://www.researchgate.net/publication/220404216_Improving_Data_Locality_with_Loop_Transformations
Lenore Zuck, Amir Pnueli, Benjamin Goldberg, Clark Barrett, Yi Fang, Ying Hu, 2005, Translation and run-time validation of loop transformations, Formal Methods in System Design, Volume 27, Issue 3, November 2005, pp 335–360, https://dl.acm.org/doi/10.1007/s10703-005-3402-z, PDF: https://www.academia.edu/download/51241056/s10703-005-3402-z20170107-14015-fucnad.pdf
B Goldberg, L Zuck, C Barrett, 2005, Into the loops: Practical issues in translation validation for optimizing compilers, Electronic Notes in Theoretical Computer Science 132 (2005) 53–71, https://dl.acm.org/doi/10.1016/j.entcs.2005.01.030, PDF: https://www.sciencedirect.com/science/article/pii/S157106610505005X/pdf?md5=cc59c5580bdb88a279fe834e0f0495b9&pid=1-s2.0-S157106610505005X-main.pdf
W Kelly, W Pugh, 1995, A unifying framework for iteration reordering transformations, Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing, https://ieeexplore.ieee.org/document/472180
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
Kathryn S. McKinley, Steve Carr, Chau-Wen Tseng, 1996, Improving data locality with loop transformations, ACM Transactions on Programming Languages and Systems, Volume 18, Issue 4, pp 424–453, https://dl.acm.org/doi/10.1145/233561.233564
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
David Spuler, March 2024, Loop Distribution, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-distribution

Loop Reordering

Loop reordering is changing the ordering of two loops, or more commonly, changing the nesting of two nested loops, to reverse the inner and outer loops. Loop reordering is an optimization that doesn't reduce the total number of computations, but improves data locality and cache access speed by doing the operations in a different order. This reduces the cost of accessing the data into memory (or low-level caches), rather than the cost of the arithmetic. It is therefore related to memory/dataflow optimizations and pipelining optimizations.

In neural networks, there are many loops, and many ways of nesting them. The convolution layers have literally seven layers of nested loops. Hence, there are various research papers exploring different orders to perform the various computations.

Research papers on loop reordering:

P Gibson, J Cano, J Turner, EJ Crowley, 2020, Optimizing grouped convolutions on edge devices, 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP), https://ieeexplore.ieee.org/abstract/document/9153227/ PDF: https://arxiv.org/pdf/2006.09791 (Explores data locality from loop reordering for convolutions by reordering nested loops.)
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul 2016), 11 pages. https://doi.org/10.1145/2897824.2925952, https://research.google/pubs/pub45525/, PDF: https://dl.acm.org/doi/pdf/10.1145/2897824.2925952
V Sarkar, R Thekkath, 1992, A general framework for iteration-reordering loop transformations, PLDI '92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementationJuly 1992, Pages 175–187, https://doi.org/10.1145/143095.143132, https://dl.acm.org/doi/10.1145/143095.143132, PDF: https://dl.acm.org/doi/pdf/10.1145/143103.143132
D Grubisic, B Wasti, C Cummins, 2023, LoopTune: Optimizing Tensor Computations with Reinforcement Learning, https://arxiv.org/abs/2309.01825
Lenore Zuck, Amir Pnueli, Benjamin Goldberg, Clark Barrett, Yi Fang, Ying Hu, 2005, Translation and run-time validation of loop transformations, Formal Methods in System Design, Volume 27, Issue 3, November 2005, pp 335–360, https://dl.acm.org/doi/10.1007/s10703-005-3402-z, PDF: https://www.academia.edu/download/51241056/s10703-005-3402-z20170107-14015-fucnad.pdf
B Goldberg, L Zuck, C Barrett, 2005, Into the loops: Practical issues in translation validation for optimizing compilers, Electronic Notes in Theoretical Computer Science 132 (2005) 53–71, https://dl.acm.org/doi/10.1016/j.entcs.2005.01.030, PDF: https://www.sciencedirect.com/science/article/pii/S157106610505005X/pdf?md5=cc59c5580bdb88a279fe834e0f0495b9&pid=1-s2.0-S157106610505005X-main.pdf
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs including loop reordering, loop tiling and loop unrolling.)
L Waeijen, S Sioutas, M Peemen, M Lindwer, 2021, ConvFusion: A model for layer fusion in convolutional neural networks, IEEE Access (Volume: 9), https://ieeexplore.ieee.org/abstract/document/9646923/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09646923.pdf (Analysis of loop tiling, loop reordering, data flow, recomputation, and layer fusion.)
F Sun, C Wang, L Gong, C Xu, Y Zhang, 2017, A high-performance accelerator for large-scale convolutional neural networks, 2016 26th International Conference on Field Programmable Logic and Applications (FPL), https://ieeexplore.ieee.org/abstract/document/9269334/, https://ieeexplore.ieee.org/abstract/document/8367329/
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Reordering convolutions with 7-level nested loops.)
W Kelly, W Pugh, 1995, A unifying framework for iteration reordering transformations, Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing, https://ieeexplore.ieee.org/document/472180
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
David Spuler, March 2024, Loop Reordering, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-reordering

Pointer Arithmetic

Pointer arithmetic is a tricky C++ optimization that can be used to get rid of incremented variables. Instead, a pointer can be incremented instead.

Here's a vector dot product with basic incremented loop variable:

    float yapi_vecdot_basic(float v1[], float v2[], int n)   // Basic vector dot product
    {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		sum += v1[i] * v2[i];
	}
	return sum;
    }

And here's the same code when converted to pointer arithmetic:

    float yapi_vecdot_pointer_arithmetic(float v1[], float v2[], int n)   // Pointer arithmetic vector dot product
    {
	float sum = 0.0;
	float* endv1 = v1 + n;  // v1 start plus n*4 bytes
	for (; v1 < endv1; v1++,v2++) {
		sum += (*v1) * (*v2);
	}
	return sum;
    }

How does this work? We got rid of the temporary variable "i" by using pointer arithmetic ("*v1") instead of array indices ("v1[i]"). We are also using the function parameters "v1" and "v2" as temporary local variables, as permitted in C++, so we don't need an extra temporary pointer variable. The way this works with pointer arithmetic is v1 and v2 are treated as pointers, which works due to the near-equivalence of pointers and arrays in C++. The for loop incrementers "v1++" and "v2++" are both adding 4 bytes (the size of a float) to the pointers (and note they are separated by the C++ comma operator). The "endv1" end marker is calculated as the address of "v1[0]" plus "n*4" bytes, because the "+" operator in "v1+n" is pointer arithmetic addition scaled by the size of the pointed-to object, rather than normal integer addition.

Note that a further micro-optimization is possible. We can change the less-than test ("v1 < endv1") to an inequality test ("v1 != endv1"), because equality tests are slightly faster than less-than tests. Since this test is effectively inside the loop and done every iteration, this might be worth doing. The trade-off is safety: it'll become an infinite loop if you get the pointer math slightly wrong, but hey, your code has no bugs, right?

Loop Coalescing

Loop coalescing is a loop optimization that involves flattening two nested loops into one non-nested loop. The benefit in speed can arise by simplifying the loop, which makes it easier to parallelize via hardware acceleration, and also maybe a different data access pattern which might improve data locality and cache freshness. This optimization is not always possible, as nested loop logic is often quite complicated, and flattening a nested loop may actually worsen data locality in many instances. However, the linear nature of a simple loop can make the code to send off chunks to a GPU much easier.

Research on Loop Coalescing: Various research papers mention loop coalescing as an optimization technique:

CD Polychronopoulos, 1987, Loop coalescing: A compiler transformation for parallel machines, https://www.osti.gov/biblio/5642685
Matthew T. O'Keefe, Henry G. Dietz, 1990, Loop Coalescing and Scheduling for Barrier MIMD Architectures, Technical Report 727, Department of Electrical and Computer Engineering, Purdue University, https://docs.lib.purdue.edu/ecetr/727/, PDF: https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1731&context=ecetr
V Sarkar, R Thekkath, 1992, A general framework for iteration-reordering loop transformations, PLDI '92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementationJuly 1992, Pages 175–187, https://doi.org/10.1145/143095.143132, https://dl.acm.org/doi/10.1145/143095.143132, PDF: https://dl.acm.org/doi/pdf/10.1145/143103.143132
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
David A. Padua, Michael J. Wolfe, 1986, Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12, 1184–1201, https://dl.acm.org/doi/10.1145/7902.7904, PDF: https://dl.acm.org/doi/epdf/10.1145/7902.7904
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
David Spuler, March 2024, Loop Coalescing, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-coalescing

Loop Collapsing

Loop collapsing is closely related to loop coalescing, since both aim to flatten nested loops, but loop collapsing is a special situation.

Research on Loop Collapsing: Papers on collapsing nested loops:

David A. Padua, Michael J. Wolfe, 1986, Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12, 1184–1201, https://dl.acm.org/doi/10.1145/7902.7904, PDF: https://dl.acm.org/doi/epdf/10.1145/7902.7904
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Covers loop collapsing as one optimization.)
David Spuler, March 2024, Loop Collapsing, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-collapsing

Loop Peeling

Loop peeling is a type of loop unrolling that involves unraveling only the first few iterations of a long loop. This is also similar to loop splitting into two sections, where the first section is over the early range, and the second range is the main section of all remaining iterations.

Loop peeling is beneficial to the overall loop efficiency if there is code in the loop body that is only required for one or two early iterations, which can then be removed from the main loop body. Similarly, there can be benefit in unraveling the last few iterations of a loop, which is a similar technique.

Research on Loop Peeling: There are research papers on loop peeling as an optimization technique:

David Callahan, Ken Kennedy, 1988, Compiling programs for distributed-memory multiprocessors, The Journal of Supercomputing, volume 2, pages 151–169, https://link.springer.com/article/10.1007/BF00128175
Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, Roger A. Bringmann, 1992, ACM SIGMICRO Newsletter, Volume 23, Issue 1-2, Dec. 1992, pp 45–54, https://doi.org/10.1145/144965.144998, PDF: https://dl.acm.org/doi/pdf/10.1145/144965.144998
Wikipedia, 2023 (accessed), Loop splitting, https://en.wikipedia.org/wiki/Loop_splitting
B Goldberg, L Zuck, C Barrett, 2005, Into the loops: Practical issues in translation validation for optimizing compilers, Electronic Notes in Theoretical Computer Science 132 (2005) 53–71, https://dl.acm.org/doi/10.1016/j.entcs.2005.01.030, PDF: https://www.sciencedirect.com/science/article/pii/S157106610505005X/pdf?md5=cc59c5580bdb88a279fe834e0f0495b9&pid=1-s2.0-S157106610505005X-main.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
David Spuler, March 2024, Loop Peeling, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-peeling

Loop Splitting

Loop splitting refers to splitting the sequential iterations of a loop into two loops, which each perform part of the original loop's iterations. Loop splitting is sometimes called "loop sectioning". Note that "loop peeling" is a special case of loop splitting where the first section is a small range of a few initial iterations.

Loop splitting has at least two "split-out" loops, one for the early iterations, and one for the remainder. However, loops can also be split out into more than two loops. Each split-out loop is shorter than the original loop. Unlike loop fission, the two loops operate over different subportions of the iterator variable range, executing the same number of total iterations, rather than double iterations as in loop fission.

This optimization is beneficial if the loop body has different sections of code that only relate to a subset of the iterator range. Hence, the loop bodies of the two loops can be reduced to execute less code. Overall, there is still the same number of iterations performed in the two loops combined, but each loop performs only a proportion of the original iterations.

Research on Loop Splitting: There are research papers on splitting loops as an optimization technique:

Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, Roger A. Bringmann, 1992, ACM SIGMICRO Newsletter, Volume 23, Issue 1-2, Dec. 1992, pp 45–54, https://doi.org/10.1145/144965.144998, PDF: https://dl.acm.org/doi/pdf/10.1145/144965.144998
David Callahan, Ken Kennedy, 1988, Compiling programs for distributed-memory multiprocessors, The Journal of Supercomputing, volume 2, pages 151–169, https://link.springer.com/article/10.1007/BF00128175
Wikipedia, 2023 (accessed), Loop splitting, https://en.wikipedia.org/wiki/Loop_splitting
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475 (Parallel loop splitting is used.)
David Spuler, March 2024, Loop Splitting, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-splitting

Loop Interchange

Loop interchange is an optimization of nested loops that switches the inner and outer loops. In a typical nested loop, the outer loop body and loop test is executed rarely, almost lazily, whereas the inner loop body is scrambling along in a frantic mess. Loop interchange simply switches them, reversing their roles.

Why is this an optimization? Although the same number of loop iterations still occur in total, and the newly-made inner loop body is also thrashed, various improvements can arise from reversing the iterator variables. One possible improvement is in data locality, which can reduce cache misses and speeds up the overall execution. Note that this benefit is not guaranteed just by switching loops, and sometimes loop interchange can worsen data locality; careful analysis is needed. Another possibility is that reversing nested loops can create opportunities to apply other loop optimizations to the new inner loop.

Research on Loop Interchange: There are various research papers about loop interchange as an optimization technique for nested loop structures:

William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475 (Discusses "parallel loop interchange" as an optimization.)
Randy Allen, Ken Kennedy, 2004, Automatic loop interchange, JR Allen, K Kennedy, ACM SIGPLAN Notices, Volume 39, Issue 4, April 2004, pp 75–90, https://dl.acm.org/doi/10.1145/989393.989405, PDF: https://dl.acm.org/doi/pdf/10.1145/502874.502897
Lenore Zuck, Amir Pnueli, Benjamin Goldberg, Clark Barrett, Yi Fang, Ying Hu, 2005, Translation and run-time validation of loop transformations, Formal Methods in System Design, Volume 27, Issue 3, November 2005, pp 335–360, https://dl.acm.org/doi/10.1007/s10703-005-3402-z, PDF: https://www.academia.edu/download/51241056/s10703-005-3402-z20170107-14015-fucnad.pdf
Wikipedia, 2023 (accessed), Loop interchange https://en.wikipedia.org/wiki/Loop_interchange
W Kelly, W Pugh, February 1995, A unifying framework for iteration reordering transformations, https://drum.lib.umd.edu/handle/1903/607, https://dl.acm.org/doi/book/10.5555/152595, PDF: https://www.researchgate.net/profile/Wayne-Kelly-4/publication/2665285_A_Framework_for_Unifying_Reordering_Transformations/links/56fa540d08ae38d710a38712/A-Framework-for-Unifying-Reordering-Transformations.pdf
W Kelly, W Pugh, 1995, A unifying framework for iteration reordering transformations, Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing, https://ieeexplore.ieee.org/document/472180
David A. Padua, Michael J. Wolfe, 1986, Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12, 1184–1201, https://dl.acm.org/doi/10.1145/7902.7904, PDF: https://dl.acm.org/doi/epdf/10.1145/7902.7904
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Vipul Vaibhaw, Jan 13, 2021, Matrix Multiplication: Optimizing the code from 6 hours to 1 sec, https://vaibhaw-vipul.medium.com/matrix-multiplication-optimizing-the-code-from-6-hours-to-1-sec-70889d33dcfa (Practical analysis of coding optimizations of matrix multiplication in Python and C++ for various speedups.)
David Spuler, March 2024, Loop Interchange, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-interchange

Loop Sentinel

Suppose we have to scan a vector to check whether there's a negative elements. Here's the simplest code:

    bool yapi_vector_has_negative_basic(float v[], int n)
    {
	for (int i = 0; i < n; i++) {
		if (v[i] < 0.0) return true;  // Found negative
	}
	return false;  // No negatives
    }

Assuming that n is large, the bottleneck is the loop body, in which there are 4 operations:

Loop test "i < n"
Array lookup: "v[i]"
Less-than test: " < 0.0"
Incrementer: "i++"

Now, we can get rid of the array operation "v[i]" by switching to pointer arithmetic. And this also reduces the "i < n" test to a pointer equality test (slightly better), and the "i++" becomes a pointer increment (not much different, and could even be worse since pointer increment is effectively "+=4"). So we basically still have 3 operations in the bottleneck code.

    bool yapi_vector_has_negative_pointer_arithmetic(float v[], int n)
    {
	float* endv = &v[n];
	for ( ; v != endv; v++) {
		if (*v < 0.0) return true;  // Found negative
	}
	return false;  // No negatives
    }

Can we do better?

Loop sentinels can do better.

Here's how it works with a sentinel: pretend we succeed, even if we don't. The idea is to add a dummy "fake success" element into the array so that we always succeed at the end. Then it's not necessary to test whether we're at the end, so this removes the "i < n" or "ptr == endptr" tests in the loop condition.

Here's the loop sentinel version:

   bool yapi_vector_has_negative_sentinel(float v[], int n)
   {
	v[n] = -99.0;  // Dummy negative (BUG!)
	int i = 0;
	for ( ; /*GONE!*/; i++) {
		if (v[i] < 0.0) break;  // Found negative
	}
	if (i == n) return false;  // At the dummy (fake success)
	return true;  // Found a negative (for real)
   }

How does this work? Since the test is looking for negatives, this code adds a dummy negative at the end. Hence, the loop will always succeed at the end, and the loop test can be removed without creating an infinite loop. The only change is that we can't just "return true" inside the loop if we find a negative, because it might be at the end, which should be "return false" for the dummy success.

Bug alert! Unfortunately, this code is buggy! The sentinel assignment "v[n]" is actually an array bounds overflow. Or even if it's not, we've also modified the array with a dummy value. One way to fix it is to always have every vector in the whole program have an extra safety element in its size, which is a ghastly solution.

A better way to fix it is to manipulate the last valid element "v[n-1]" instead. Then, we have to remember to fix it before we leave town.

    bool yapi_vector_has_negative_sentinel2(float v[], int n)
    {
	float save = v[n - 1];  // Save it!
	v[n - 1] = -99.0;  // Dummy negative at end
	int i = 0;
	for ( ; /*GONE!*/; i++) {
		if (v[i] < 0.0) break;  // Found negative
	}
	v[n - 1] = save;  // Restore it!
	if (i == n - 1) {
		// At the dummy (fake success)
		if (save < 0.0) return true;  // Must check it!
		return false;  
	}
	return true;  // Found a negative (for real)
    }

This sentinel example now has 3 main operations in the loop path and is hopefully bug-free. This is a 25% reduction at the cost of a few extra overhead tests and assignments before and after the loop. Sentinels also work with pointer arithmetic, so we can double down on optimizations by doing both. We just have to change "v[i]" to use pointer arithmetic instead (i.e. "*v"), and then the code has changed from 4 operations to 2 operations on the critical path, which is a 50% improvement in speed. Here's my code:

    bool yapi_vector_has_negative_sentinel3(float v[], int n)
    {
	// Sentinel + Pointer Arithmetic
	float *savev = &v[n - 1];
	float save = *savev;  // Save it!
	*savev = -99.0;  // Dummy negative at end v[n - 1]
	for (; /*GONE!*/; v++) {
		if (*v < 0.0) break;  // Found negative
	}
	*savev = save;  // Restore it!
	if (v == savev) {
		// At the dummy (fake success)
		if (save < 0.0) return true;  // Must check it!
		return false;
	}
	return true;  // Found a negative (for real)
    }

The above code is slow and linear for a CPU. Scanning a big vector could at least be multi-threaded to test multiple chunks in parallel, although this ends up doing some unnecessary computations, because it loses the benefit that the loop exits early once it finds a single negative. Is there a better way using hardware acceleration? Maybe there's a parallel accelerated method to check all the sign bits of the elements of a vector? That can go on the "todo" list for now!

Hashed Sentinel Objects: Interestingly, the idea of loop sentinels can be generalized beyond the above linear array search to more complex data structures, such as hash tables or binary trees. The trick is to use a sentinel object.

How to do it? Consider a hash table with a chained linked list for collision resolution. The main loop almost always has a "ptr!=NULL" test to detect the end of the chain list. The idea of sentinels for hash tables is to use a pointer to a dummy global object instead of the NULL pointer. When doing a search, the global sentinel object is set to a dummy value, and we can remove the test "ptr!=NULL" from the hot section of the loop.

Worth doing? This might be worthwhile if the main hash search loop is a bottleneck. But it's a lot of work, rather than just a change to a search function. For other routines where we can't use a sentinel, the main loop test changes to a "ptr!=globalptr" pointer comparison. All of the hash table data structure methods need to change to use the global sentinel pointer instead of NULL (e.g. insertions, deletions, scanning, etc.). Might be easier to go buy a bigger GPU.

Research on Loop Sentinels: Loop sentinels are a well-known trick, dating back as far as the wonderful Knuth treatises.

Donald Knuth, 1998, The Art of Computer Programming, The: Volume 3: Sorting and Searching, Addison-Wesley Professional, https://www-cs-faculty.stanford.edu/~knuth/taocp.html, https://www.amazon.com/Art-Computer-Programming-Donald-Knuth/dp/0201896850
David Spuler, 1992, C++ and C Efficiency: How to Improve Program Speed and Memory Usage, Prentice Hall, https://www.amazon.com/dp/0130965952
S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression (The loop sentinel idea applied to decoding algorithms.)
David Spuler, March 2024, Loop Sentinel, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-sentinel

Loop Strip Mining (Loop Sectioning)

Loop strip mining is a loop optimization that scans or "mines" various "strips" of an array. It is related to "loop tiling", which operates on arrays in two dimensions, but strip mining only applies to processing one-dimensional arrays. Loop strip mining is also called "loop sectioning".

Research on Loop Strip Mining: Papers on loop strip mining:

David A. Padua, Michael J. Wolfe, 1986, Advanced compiler optimizations for supercomputers. Commun. ACM 29, 12, 1184–1201, https://dl.acm.org/doi/10.1145/7902.7904, PDF: https://dl.acm.org/doi/epdf/10.1145/7902.7904
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Covers loop strip mining as one optimization.)
David Spuler, March 2024, Loop Strip Mining (Loop Sectioning), in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-strip-mining-sectioning

Loop Spreading

Loop spreading is an optimization of two non-nested sequential loops that have different iteration ranges. Typically, this refers to where the end ranges differ significantly. If the loop ranges only differ by an off-by-one issue, then only loop normalization is required. Loop spreading modifies one of the loops, so that part of this loop fully overlaps with the other loop. Hence, this subloop can be fused with the other loop, and possibly parallelized. The remaining iterations that are not overlapping then have to be addressed in a followup partial loop (only for one of the loops).

Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Covers loop spreading as one optimization.)
David Spuler, March 2024, Loop Spreading, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-spreading

Loop Normalization

Loop normalization is not directly an optimization, but is a preliminary loop transformation that can make further loop optimizations easier. This can be to fuse the two loops with loop fusion, or to parallelize each loop, such as with loop fission or vectorization. The goal of loop normalization is to make the loop iteration variables act across the same range. Hence, loop normalization is needed when two loops in sequence are starting at different offsets (e.g. one is i=1 and one starts at i=0).

Research on Loop Normalization: Papers on loop normalization:

Michael Wolfe, August 1986, Loops skewing: The wavefront method revisited, https://link.springer.com/article/10.1007/BF01407876
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Covers loop normalization as one optimization.)
David Spuler, March 2024, Loop Normalization, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-normalization

Loop Skewing

Loop skewing is a somewhat mind-bending method to change nested loops to make them more parallelizable. This technique applies when there are two nested loops, but the inner loop is difficult to parallelize because of a dependency on the outer loop variable.

The loop skewing solution is far from obvious. The range bounds of the inner loop are changed by "skewing" them by a factor based on the outer loop variable. And then, by some magical potion, this somehow breaks the dependence on the outer loop, and then the inner loop can run fast on a GPU. Who knew?

Research on Loop Skewing: Papers on loop skewing:

Michael Wolfe, August 1986, Loops skewing: The wavefront method revisited, https://link.springer.com/article/10.1007/BF01407876
W Kelly, W Pugh, 1995, A unifying framework for iteration reordering transformations, Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing, https://ieeexplore.ieee.org/document/472180
Eric LaForest, March 19, 2010, Survey of Loop Transformation Techniques, ECE 1754, http://fpgacpu.ca/writings/SurveyLoopTransformations.pdf
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Covers loop skewing as one optimization.)
David Spuler, March 2024, Loop Skewing, in Generative AI in C++, https://www.aussieai.com/book/ch15-loop-skewing

Other Research on Loop Optimizations

Various other papers on loop optimizations:

Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233 (Analysis of optimizations for DNNs and SNNs.)
D. F. Bacon, S. L. Graham, and O. J. Sharp. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4 (1994), 345–420. https://dl.acm.org/doi/10.1145/197405.197406, PDF: https://people.eecs.berkeley.edu/~fateman/264/papers/bacon.pdf (Paper with extensive coverage of numerous compiler auto-optimizations of program code.)
Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Slides PDF: https://www.isfpga.org/past/fpga2017/slides/D1_S1_04.pdf
Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016 PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016
David Spuler, March 2024, Pointer Arithmetic Loop Optimizations, in Generative AI in C++, https://www.aussieai.com/book/ch12-pointer-arithmetic-loops
Inas Bachiri, September 2024, A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration, DOI:10.13140/RG.2.2.10612.16009, Thesis for: Master in Computer Science, https://www.researchgate.net/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration https://www.researchgate.net/profile/Inas-Bachiri/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration/links/66ed912c6b101f6fa4f3d6ce/A-Literature-Review-on-Combining-Neural-Architecture-Search-and-Compiler-Optimizations-for-Neural-Network-Acceleration.pdf