Aussie AI

Deep Learning Compiler Optimization Techniques

Last Updated 22 May, 2025

by David Spuler, Ph.D.

Deep Learning Compilers are a specialized type of software tool used by ML software engineers that takes an AI model and generates optimized code to run that model's inference. There are various alternative names, such as graph compilers, ML compilers, model compilers, or even just "AI compilers". These compilers are lower level software than the common AI software frameworks. They are also unrelated to the traditional programming language compilers, used for everyday programming in languages such as C++ or Python.

The input to the deep learning compilers is the model, and the output is a runnable set of code that implements that model on a particular computing platform. Typically, these compilers take the model and first generate an internal representation format of the model, which can then be compiled to multiple platforms. These compilers are often an important part of running models on an edge device.

The Graph Nature of AI Computation

The Transformer architectures that run the inference and training of LLMs have a few peculiar characteristics:

No looping.
No alternative pathways (i.e., no "if" statements)

Hence, when you split out all the ways that the input tokens and the weights are processed, you can note that:

It's finite
It's a fixed sequence

The input tokens enter the pathway as a vector of tokens, which are then converted to vector-per-token representation that is called embeddings. These embedding values propagate through the Transformer components as "activations" (i.e., computed probabilities), and there is a huge amount of computation from multiplication of all these weights. But there are very few points where the computation can make a different pathway. Generally speaking, the fixed pathway looks like this:

Text converted to token vectors
Each token converts to a vectorized embedding
Each set of embedding vectors propagated through multiple layers
Each layer comtains standard subcomponents (i.e., attention module, activation functions, and feed-forward networks).
After all layers, the computed values ("logits") are converted to a token.

Where's the "if" statement? Well, there's very few. The main one is at the last step, whereby the output token is chosen, which is called the "decoding algorithm" (e.g., greedy decoding always chooses the highest probability token/word to output).

Anyway, the point of all this is that every single step along the pathway is fixed. And if you write out all the pathways, with "nodes" for each subcomponent, you get that it is a, ta da, graph.

In fact, it has these properties:

Finite (fixed number of layers with a fixed number of subcomponents).
No cycles (because there's no loops)

Hence, it's a finite, directed acyclic graph (DAG).

Yes, I know, there's exceptions. The whole engine does loop back at the end of outputting a token, and restarts the layers to begin computing the next token (called autoregressive decoding). Also, some optimizations of this process add selection tests. Early exit optimizations add a testing statement in there after one or more layers. KV caching optimizations add tests as to whether we have a valid cache. But the basic point that we get to is that the computation of the output of a single token is a DAG. Hence, graph compilers can "compile" that DAG into a fixed set of computations.

Efficiency Optimization Techniques

The main goal of deep learning compilers is optimization, and they may offer a variety of different optimization features. Some example types of optimizations that ML compilers can use include:

Hardware acceleration (i.e. support for various GPUs, CPUs, and other hardware platforms)
Parallelization (i.e. manage the process of sending data to one or more GPUs in parallel)
Graph optimization
Kernel operator fusion
Pipelining support (e.g. via scheduling, loop tiling, etc.)
Vectorization
Dataflow optimizations
Memory optimizations
Code optimizations (e.g. loop unrolling)
Serving efficiency
Inference optimizations (generally)

List of Machine Learning Compiler Platforms

Here is a brief list of some of the major ML compilers (graph compilers) that you can use:

TVM (Apache): https://tvm.apache.org/
XLA (TensorFlow): https://www.tensorflow.org/xla
Glow (PyTorch)
Halide
Keras
SciKit
CNTK
MXNet
LLVM
Google MLIR
JAX
TensorRT (NVIDIA)
ONNX Runtime
OpenVINO (Intel)

Survey Papers on Machine Learning Compilers

Reviews of the many deep learning compilers available:

S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, and P. Kitsos, A survey on RISC-V-based machine learning ecosystem, Information, vol. 14, no. 2, p. 64, 2023, https://www.mdpi.com/2078-2489/14/2/64, PDF: https://www.academia.edu/98345984/A_Survey_on_RISC_V_Based_Machine_Learning_Ecosystem
Amnon Geifman, April 28, 2021, Graph Compilers for Deep Learning: Definition, Pros & Cons, and Popular Examples, https://deci.ai/blog/graph-compilers/
M. Li et al., 2021, "The Deep Learning Compiler: A Comprehensive Survey," in IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 3, pp. 708-727, 1 March 2021, doi: 10.1109/TPDS.2020.3030548. https://ieeexplore.ieee.org/abstract/document/9222299
Hongbin Zhang, Mingjie Xing, Yanjun Wu, Chen Zhao. 2023, Compiler Technologies in Deep Learning Co-Design: A Survey. Intell Comput. 2023;2:0040.DOI:10.34133/icomputing.0040 https://spj.science.org/doi/full/10.34133/icomputing.0040
Inas Bachiri, September 2024, A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration, DOI:10.13140/RG.2.2.10612.16009, Thesis for: Master in Computer Science, https://www.researchgate.net/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration https://www.researchgate.net/profile/Inas-Bachiri/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration/links/66ed912c6b101f6fa4f3d6ce/A-Literature-Review-on-Combining-Neural-Architecture-Search-and-Compiler-Optimizations-for-Neural-Network-Acceleration.pdf
NL Rane, SK Mallick, O Kaya, 2024, Tools and frameworks for machine learning and deep learning: A review, https://www.researchgate.net/profile/Nitin-Rane-2/publication/385116208_Tools_and_frameworks_for_machine_learning_and_deep_learning_A_review/links/671a2811df4b534d4ef2e15d/Tools-and-frameworks-for-machine-learning-and-deep-learning-A-review.pdf

Research Papers on Machine Learning Compilers

Research on deep learning compilers and papers related to specific usage of their methods:

Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A., TVM: An automated endto-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594, Carlsbad, CA, 2018, https://arxiv.org/abs/1802.04799
Lattner, C. and Adve, V., 2004, LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization. CGO 2004., pp. 75–86, https://ieeexplore.ieee.org/abstract/document/1281665, https://www.llvm.org/pubs/2004-01-30-CGO-LLVM.pdf (Early LLVM paper from 2004.)
Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J. A., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O., MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In Proceedings of the 19th ACM/IEEE International Symposium on Code Generation and Optimization, 2021. https://ieeexplore.ieee.org/abstract/document/9370308, PDF: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/85bf23fe88bd5c7ff60365bd0c6882928562cbeb.pdf, Code: https://mlir.llvm.org/
Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, Yida Wang, 2021, Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference, Proceedings of Machine Learning and Systems 3 (MLSys 2021), https://proceedings.mlsys.org/paper_files/paper/2021/hash/5b47430e24a5a1f9fe21f0e8eb814131-Abstract.html, https://arxiv.org/abs/2006.03031
Rotem, N., Fix, J., Abdulrasool, S., Deng, S., Dzhabarov, R., Hegeman, J., Levenstein, R., Maher, B., Nadathur, S., Olesen, J., Park, J., Rakhov, A., and Smelyanskiy, M., 2018, Glow: Graph lowering compiler techniques for neural networks. CoRR, abs/1805.00907, https://arxiv.org/abs/1805.00907
Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In OSDI 2020. USENIX Association, 881–897. https://typeset.io/papers/rammer-enabling-holistic-deep-learning-compiler-12fbfi80ej
Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Experiments with various fusion methods in training. Note: code uses deprecated nvFuser compiler.)
Daniel Snider, Ruofan Liang, Jan 2023, Operator Fusion in XLA: Analysis and Evaluation, https://arxiv.org/pdf/2301.13062.pdf
Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations, Sep 2023, https://arxiv.org/pdf/2309.08978.pdf
Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Optimising loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: Fpga 2017 ‐ Proceedings of the 2017 ACM/SIGDA International Symposium on field‐ programmable gate arrays, Monterey, CA, February 2017 PDF: https://ieeexplore.ieee.org/ielaam/92/8396231/8330049-aam.pdf
Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, and David Brooks. Exploiting parallelism opportunities with deep learning frameworks. ACM Trans. Archit. Code Optim., 18(1), 2021. https://arxiv.org/abs/1908.04705 (Analysis of parallelization in terms of overhead of scheduling and intra-operator parallelization such as multi-threaded MatMul operations.)
Huang, B.-Y., Lyubomirsky, S., Li, Y., He, M., Tambe, T., Smith, G. H., Gaonkar, A., Canumalla, V., Wei, G., Gupta, A., Tatlock, Z., & Malik, S., 2022, Specialized accelerators and compiler flows: Replacing accelerator apis with a formal software/hardware interface, ArXiv abs/2203.00218, https://arxiv.org/abs/2203.00218
Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., et al. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980, 2017, https://arxiv.org/abs/1701.03980, Code: http://github.com/clab/dynet
Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Detailed analysis of ML compiler optimizations, including code examples and a list of which optimizations each compiler supports, as of 2019.)
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–205. https://doi.org/10.1109/CGO.2019.8661197
Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 943–948, https://ieeexplore.ieee.org/document/8115709
Sivathanu, M., Chugh, T., Singapuram, S. S., and Zhou, L., 2019, Astra: Exploiting predictability to optimize deep learning. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’19, pp. 909–923, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362405. doi: 10.1145/3297858.3304072. https://doi.org/10.1145/3297858.3304072, https://dl.acm.org/doi/10.1145/3297858.3304072, https://www.microsoft.com/en-us/research/publication/astra-exploiting-predictability-to-optimize-deep-learning/, PDF: https://www.microsoft.com/en-us/research/uploads/prod/2019/02/astra-asplos19.pdf (Optimization methods in PyTorch and TensorFlow.)
W Chen, Y Wang, Y Xu, C Gao, C Liu, 2022, A framework for neural network architecture and compile co-optimization, https://dl.acm.org/doi/abs/10.1145/3533251, PDF: https://dl.acm.org/doi/pdf/10.1145/3533251
P Tillet, HT Kung, D Cox, 2019, Triton: an intermediate language and compiler for tiled neural network computations, Proceedings of the 3rd ACM SIGPLAN, http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
P Gibson, 2023, Compiler-centric across-stack deep learning acceleration, Ph.D. thesis, University of Glasgow, https://theses.gla.ac.uk/83959/1/2023GibsonPhD.pdf Code: https://github.com/Wheest/bib-boi/
Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, Lidong Zhou, October 2023, PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation, SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles, Pages 331–347, https://doi.org/10.1145/3600006.3613139 https://dl.acm.org/doi/abs/10.1145/3600006.3613139 (Deep learning compiler for dynamic sparsity.)
Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Slides PDF: https://www.isfpga.org/past/fpga2017/slides/D1_S1_04.pdf
David Spuler, March 2024, Chapter 28. Deslugging AI Engines, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Amnon Geifman, April 28, 2021, Graph Compilers for Deep Learning: Definition, Pros & Cons, and Popular Examples, https://deci.ai/blog/graph-compilers/
ONNX, 2023, Graph Optimizations in ONNX Runtime, https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html
Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
XLA Team. March 2017. Xla, compiled, https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
Luhui Hu, Nov 18, 2022, AI Compilers Demystified Accelerate AI/ML through compilation: NVIDIA TensorRT, ONNX Runtime, Apache TVM, etc, https://medium.com/geekculture/ai-compilers-ae28afbc4907
Karthee Sivalingam, Nina Mujkanovic, 2022, Graph compilers for AI training and inference , CRAY EMEA Research Lab , https://www.sodalite.eu/sites/sodalite/files/public/content-files/articles/graph-compilers-proof2-blog.pdf
Masahiro Tanaka, Du Li, Umesh Chand, Ali Zafar, Haiying Shen, Olatunji Ruwase, 14 Apr 2025, DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training, https://arxiv.org/abs/2504.09983

Kernel Fusion in ML Compilers

Kernel fusion is the optimization of merging two operations into a single operator. The combined operator is thereby faster than running the two operations sequentially. The optimization may reduce computations or memory accesses to address overall cost. Read more about kernel operator fusion techniques.

Deep learning compilers use kernel fusion as one of the major optimizations on the execution graph. Compiler-executed kernel operator fusion research papers:

Huan Song, Jayaraman J. Thiagarajan, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy, Andreas Spanias, Dec 2016, A Deep Learning Approach To Multiple Kernel Fusion, https://arxiv.org/abs/1612.09007
X Cai, Y Wang, L Zhang, 2022, Optimus: An operator fusion framework for deep neural networks, ACM Transactions on Embedded Computing Systems, Vol. 22, No. 1, Article 1. October 2022, https://dl.acm.org/doi/pdf/10.1145/3520142 (Operator fusion theory for hardware accelerators and deep learning compilers.)
S Zheng, S Chen, P Song, R Chen, X Li, 2023, Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/abstract/document/10071018/ (Analysis of operator fusion on hardware accelerators.)
A. Ashari, S. Tatikonda, M. Boehm, B. Reinwald, K. Campbell, J. Keenleyside, et al., "On optimizing machine learning workloads via kernel fusion", Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP 2015, pp. 173-182, February 7-11, 2015, https://doi.org/10.1145/2688500.2688521
Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Alexandre V. Evfimievski, and Prithviraj Sen. 2018. On optimizing operator fusion plans for large-scale machine learning in systemml. arXiv:1801.00829. https://arxiv.org/abs/1801.00829
Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 47–62. https://doi.org/10.1145/3341301.3359630 PDF: https://cs.stanford.edu/~padon/taso-sosp19.pdf
Rasmus Munk Larsen and Tatiana Shpeisman. 2023. TensorFlow graph optimization with Grappler, https://www.tensorflow.org/guide/graph_optimization
Daniel Snider, Ruofan Liang, Jan 2023, Operator Fusion in XLA: Analysis and Evaluation, https://arxiv.org/pdf/2301.13062.pdf Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Experiments with various fusion methods in training. Note: code uses deprecated nvFuser compiler.)
Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Extension analysis of ML compiler optimizations. Table 8 lists the fusion operations supported by various ML compilers, as of 2019.)
B Qiao, O Reiche, F Hannig, 2019, From loop fusion to kernel fusion: A domain-specific approach to locality optimization, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), https://ieeexplore.ieee.org/document/8661176 (Theory of loop fusion generalized to graph kernel fusion for image processing.)
H Peng, R Ran, Y Luo, J Zhao, S Huang, K Thorat, 2023, LinGCN: Structural Linearized Graph Convolutional Network for Homomorphically Encrypted Inference, https://arxiv.org/pdf/2309.14331.pdf (Has "operator fusion for node-wise activation functions"; see also "Section A.4 Further Detail of Operator Fusion".)
Christian Sarofeen, Piotr Bialecki, Jie Jiang, Kevin Stephano, Masaki Kozuki, Neal Vaidya, and Stas Bekman, August 2022, Introducing nvFuser, a deep learning compiler for PyTorch, https://pytorch.org/blog/introducing-nvfuser-a-deep-learning-compiler-for-pytorch/, Project: https://github.com/pytorch/pytorch/projects/30 (Note: nvFuser deep learning compiler for Pytorch, but seems to be deprecated.)
N. Rotem, J. Fix, S. Abdulrasool, S. Deng, R. Dzhabarov, J. Hegeman, R. Levenstein, B. Maher, N. Satish, J. Olesen, J. Park, A. Rakhov, and M. Smelyanskiy, “Glow: Graph lowering compiler techniques for neural networks,” CoRR, vol. abs/1805.00907, 2018. http://arxiv.org/abs/1805.00907 Code: http://github.com/pytorch/glow (Paper introduces PyTorch Glow, with much coverage of its kernel operator fusion and "operator stacking" methods.)
T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, TVM: end-to-end optimization stack for deep learning, CoRR, vol. abs/1802.04799, 2018. http://arxiv.org/abs/1802.04799
Xia, C., Zhao, J., Sun, Q., Wang, Z., Wen, Y., Feng, X., Cui, H., 2023, Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions, The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 27 Apr-01 May 2023, San Diego, USA. https://eprints.whiterose.ac.uk/203681/, PDF: https://eprints.whiterose.ac.uk/203681/1/asplos24.pdf
Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equality saturation for tensor graph superoptimization. Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2101.01332
Zhen Zheng, Xuanda Yang, et al. 2022. AStitch: enabling a new multidimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 359–373. https://dl.acm.org/doi/abs/10.1145/3503222.3507723
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–205. https://doi.org/10.1109/CGO.2019.8661197
Huanting Wang, Zhanyong Tang, et al. 2022. Automating Reinforcement Learning Architecture Design for Code Optimization. Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction (Seoul, South Korea) (CC 2022). Association for Computing Machinery, New York, NY, USA, 129–143. https://doi.org/10.1145/3497776.3517769
Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 943–948, https://ieeexplore.ieee.org/document/8115709
Jia, Z., Padon, O., Thomas, J., Warszawski, T., Zaharia, M., Aiken, A., 2019, Taso: Optimizing deep learning computation with automatic generation of graph substitutions. Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP’19, pp. 47–62, New York, NY, USA, 2019. ACM. ISBN 9781450368735. doi: 10.1145/3341301.3359630. https://doi.org/10.1145/3341301.3359630, https://dl.acm.org/doi/10.1145/3341301.3359630
Jangda, A. and Bondhugula, U., An Effective Fusion and Tile Size Model for PolyMage, ACM Trans. Program. Lang. Syst., 42(3), November 2020. ISSN 0164-0925. doi: 10.1145/3404 https://dl.acm.org/doi/10.1145/3404846
Ma, L., Xie, Z., Yang, Z., Xue, J., Miao, Y., Cui, W., Hu, W., Yang, F., Zhang, L., and Zhou, L., RAMMER: enabling holistic deep learning compiler optimizations with rtasks, In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp. 881–897. USENIX Association, November 2020. ISBN 978-1-939133-19-9. https://www.usenix.org/conference/osdi20/presentation/ma, https://dl.acm.org/doi/10.5555/3488766.3488816
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S., Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 519–530, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176. http://doi.acm.org/10.1145/2491956.2462176, PDF: https://people.csail.mit.edu/jrk/halide-pldi13.pdf
Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., Devito, Z., Moses, W. S., Verdoolaege, S., Adams, A., and Cohen, A., 2019, The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated gpu kernels, automatically. ACM Trans. Archit. Code Optim., 16(4), October 2019. ISSN 1544-3566. doi: 10.1145/3355606. https://doi.org/10.1145/3355606, https://dl.acm.org/doi/fullHtml/10.1145/3355606
Ding, Y., Zhu, L., Jia, Z., Pekhimenko, G., and Han, S. Ios, 2021, Inter-operator scheduler for cnn acceleration. In Smola, A., Dimakis, A., and Stoica, I. (eds.), Proceedings of Machine Learning and Systems, volume 3, pp. 1–14, 2021. https://arxiv.org/abs/2011.01302, PDF: https://proceedings.mlsys.org/paper/2021/file/38b3eff8baf56627478ec76a704e9b52-Paper.pdf, Code: https://github.com/mit-han-lab/inter-operator-scheduler (Examines improvements to single-operator parallelization, "intra-operator", to parallelization improvements across multiple operators, "inter-operator".)
Irigoin, F. and Triolet, R., Supernode partitioning. In Proc. of the 15th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, POPL’88, pp. 319–329, New York, NY, USA, 1988. ACM. ISBN 0-89791-252-7. doi: 10.1145/73560. 73588. http://doi.acm.org/10.1145/73560.73588
Jie Zhao, Xiong Gao, Ruijie Xia, Zhaochuang Zhang, Deshi Chen, Lei Chen, Renwei Zhang, Zhen Geng, Bin Cheng, Xuefeng Jin, 2022, Apollo: Automatic partition-based operator fusion through layer by layer optimization, https://proceedings.mlsys.org/paper_files/paper/2022/hash/e175e8a86d28d935be4f43719651f86d-Abstract.html PDF: https://proceedings.mlsys.org/paper_files/paper/2022/file/e175e8a86d28d935be4f43719651f86d-Paper.pdf, PDF: https://yaozhujia.github.io/assets/pdf/mlsys2022-paper.pdf
Keith G. Mills, Muhammad Fetrat Qharabagh, Weichen Qiu, Fred X. Han, Mohammad Salameh, Wei Lu, Shangling Jui, Di Niu, 31 Dec 2024, Applying Graph Explanation to Operator Fusion, https://arxiv.org/abs/2501.00636
Guoliang He, Eiko Yoneki, 14 Jan 2025, CuAsmRL: Optimizing GPU SASS Schedules via Deep Reinforcement Learning, https://arxiv.org/abs/2501.08071

Aussie AI

Deep Learning Compiler Optimization Techniques

The Graph Nature of AI Computation

Efficiency Optimization Techniques

List of Machine Learning Compiler Platforms

Survey Papers on Machine Learning Compilers

Research Papers on Machine Learning Compilers

Kernel Fusion in ML Compilers

More AI Research

Quick Links

Product

New to Writing?

Writing Styles