Aussie AI
Deep Learning Compiler Optimization Techniques
-
Last Updated 22 November, 2024
-
by David Spuler, Ph.D.
Deep Learning Compilers are a specialized type of software tool used by ML software engineers that takes an AI model and generates optimized code to run that model's inference. There are various alternative names, such as graph compilers, ML compilers, model compilers, or even just "AI compilers". These compilers are lower level software than the common AI software frameworks. They are also unrelated to the traditional programming language compilers, used for everyday programming in languages such as C++ or Python.
The input to the deep learning compilers is the model, and the output is a runnable set of code that implements that model on a particular computing platform. Typically, these compilers take the model and first generate an internal representation format of the model, which can then be compiled to multiple platforms. These compilers are often an important part of running models on an edge device.
The Graph Nature of AI Computation
The Transformer architectures that run the inference and training of LLMs have a few peculiar characteristics:
- No looping.
- No alternative pathways (i.e., no "if" statements)
Hence, when you split out all the ways that the input tokens and the weights are processed, you can note that:
- It's finite
- It's a fixed sequence
The input tokens enter the pathway as a vector of tokens, which are then converted to vector-per-token representation that is called embeddings. These embedding values propagate through the Transformer components as "activations" (i.e., computed probabilities), and there is a huge amount of computation from multiplication of all these weights. But there are very few points where the computation can make a different pathway. Generally speaking, the fixed pathway looks like this:
- Text converted to token vectors
- Each token converts to a vectorized embedding
- Each set of embedding vectors propagated through multiple layers
- Each layer comtains standard subcomponents (i.e., attention module, activation functions, and feed-forward networks).
- After all layers, the computed values ("logits") are converted to a token.
Where's the "if" statement? Well, there's very few. The main one is at the last step, whereby the output token is chosen, which is called the "decoding algorithm" (e.g., greedy decoding always chooses the highest probability token/word to output).
Anyway, the point of all this is that every single step along the pathway is fixed. And if you write out all the pathways, with "nodes" for each subcomponent, you get that it is a, ta da, graph.
In fact, it has these properties:
- Finite (fixed number of layers with a fixed number of subcomponents).
- No cycles (because there's no loops)
Hence, it's a finite, directed acyclic graph (DAG).
Yes, I know, there's exceptions. The whole engine does loop back at the end of outputting a token, and restarts the layers to begin computing the next token (called autoregressive decoding). Also, some optimizations of this process add selection tests. Early exit optimizations add a testing statement in there after one or more layers. KV caching optimizations add tests as to whether we have a valid cache. But the basic point that we get to is that the computation of the output of a single token is a DAG. Hence, graph compilers can "compile" that DAG into a fixed set of computations.
Efficiency Optimization Techniques
The main goal of deep learning compilers is optimization, and they may offer a variety of different optimization features. Some example types of optimizations that ML compilers can use include:
- Hardware acceleration (i.e. support for various GPUs, CPUs, and other hardware platforms)
- Parallelization (i.e. manage the process of sending data to one or more GPUs in parallel)
- Graph optimization
- Kernel operator fusion
- Pipelining support (e.g. via scheduling, loop tiling, etc.)
- Vectorization
- Dataflow optimizations
- Memory optimizations
- Code optimizations (e.g. loop unrolling)
- Serving efficiency
- Inference optimizations (generally)
List of Machine Learning Compiler Platforms
Here is a brief list of some of the major ML compilers (graph compilers) that you can use:
- TVM (Apache): https://tvm.apache.org/
- XLA (TensorFlow): https://www.tensorflow.org/xla
- Glow (PyTorch)
- Halide
- Keras
- SciKit
- CNTK
- MXNet
- LLVM
- Google MLIR
- JAX
- TensorRT (NVIDIA)
- ONNX Runtime
- OpenVINO (Intel)
Survey Papers on Machine Learning Compilers
Reviews of the many deep learning compilers available:
- S. Kalapothas, M. Galetakis, G. Flamis, F. Plessas, and P. Kitsos, A survey on RISC-V-based machine learning ecosystem, Information, vol. 14, no. 2, p. 64, 2023, https://www.mdpi.com/2078-2489/14/2/64, PDF: https://www.academia.edu/98345984/A_Survey_on_RISC_V_Based_Machine_Learning_Ecosystem
- Amnon Geifman, April 28, 2021, Graph Compilers for Deep Learning: Definition, Pros & Cons, and Popular Examples, https://deci.ai/blog/graph-compilers/
- M. Li et al., 2021, "The Deep Learning Compiler: A Comprehensive Survey," in IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 3, pp. 708-727, 1 March 2021, doi: 10.1109/TPDS.2020.3030548. https://ieeexplore.ieee.org/abstract/document/9222299
- Hongbin Zhang, Mingjie Xing, Yanjun Wu, Chen Zhao. 2023, Compiler Technologies in Deep Learning Co-Design: A Survey. Intell Comput. 2023;2:0040.DOI:10.34133/icomputing.0040 https://spj.science.org/doi/full/10.34133/icomputing.0040
- Inas Bachiri, September 2024, A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration, DOI:10.13140/RG.2.2.10612.16009, Thesis for: Master in Computer Science, https://www.researchgate.net/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration https://www.researchgate.net/profile/Inas-Bachiri/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration/links/66ed912c6b101f6fa4f3d6ce/A-Literature-Review-on-Combining-Neural-Architecture-Search-and-Compiler-Optimizations-for-Neural-Network-Acceleration.pdf
- NL Rane, SK Mallick, O Kaya, 2024, Tools and frameworks for machine learning and deep learning: A review, https://www.researchgate.net/profile/Nitin-Rane-2/publication/385116208_Tools_and_frameworks_for_machine_learning_and_deep_learning_A_review/links/671a2811df4b534d4ef2e15d/Tools-and-frameworks-for-machine-learning-and-deep-learning-A-review.pdf
Research Papers on Machine Learning Compilers
Research on deep learning compilers and papers related to specific usage of their methods:
- Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., Cowan, M., Wang, L., Hu, Y., Ceze, L., Guestrin, C., and Krishnamurthy, A., TVM: An automated endto-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594, Carlsbad, CA, 2018, https://arxiv.org/abs/1802.04799
- Lattner, C. and Adve, V., 2004, LLVM: A compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization. CGO 2004., pp. 75–86, https://ieeexplore.ieee.org/abstract/document/1281665, https://www.llvm.org/pubs/2004-01-30-CGO-LLVM.pdf (Early LLVM paper from 2004.)
- Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J. A., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O., MLIR: Scaling Compiler Infrastructure for Domain Specific Computation. In Proceedings of the 19th ACM/IEEE International Symposium on Code Generation and Optimization, 2021. https://ieeexplore.ieee.org/abstract/document/9370308, PDF: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/85bf23fe88bd5c7ff60365bd0c6882928562cbeb.pdf, Code: https://mlir.llvm.org/
- Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, Yida Wang, 2021, Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference, Proceedings of Machine Learning and Systems 3 (MLSys 2021), https://proceedings.mlsys.org/paper_files/paper/2021/hash/5b47430e24a5a1f9fe21f0e8eb814131-Abstract.html, https://arxiv.org/abs/2006.03031
- Rotem, N., Fix, J., Abdulrasool, S., Deng, S., Dzhabarov, R., Hegeman, J., Levenstein, R., Maher, B., Nadathur, S., Olesen, J., Park, J., Rakhov, A., and Smelyanskiy, M., 2018, Glow: Graph lowering compiler techniques for neural networks. CoRR, abs/1805.00907, https://arxiv.org/abs/1805.00907
- Lingxiao Ma, Zhiqiang Xie, Zhi Yang, Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, and Lidong Zhou. 2020. Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks. In OSDI 2020. USENIX Association, 881–897. https://typeset.io/papers/rammer-enabling-holistic-deep-learning-compiler-12fbfi80ej
- Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Experiments with various fusion methods in training. Note: code uses deprecated nvFuser compiler.)
- Daniel Snider, Ruofan Liang, Jan 2023, Operator Fusion in XLA: Analysis and Evaluation, https://arxiv.org/pdf/2301.13062.pdf
- Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations, Sep 2023, https://arxiv.org/pdf/2309.08978.pdf
- Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Optimising loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. In: Fpga 2017 ‐ Proceedings of the 2017 ACM/SIGDA International Symposium on field‐ programmable gate arrays, Monterey, CA, February 2017 PDF: https://ieeexplore.ieee.org/ielaam/92/8396231/8330049-aam.pdf
- Yu Emma Wang, Carole-Jean Wu, Xiaodong Wang, Kim Hazelwood, and David Brooks. Exploiting parallelism opportunities with deep learning frameworks. ACM Trans. Archit. Code Optim., 18(1), 2021. https://arxiv.org/abs/1908.04705 (Analysis of parallelization in terms of overhead of scheduling and intra-operator parallelization such as multi-threaded MatMul operations.)
- Huang, B.-Y., Lyubomirsky, S., Li, Y., He, M., Tambe, T., Smith, G. H., Gaonkar, A., Canumalla, V., Wei, G., Gupta, A., Tatlock, Z., & Malik, S., 2022, Specialized accelerators and compiler flows: Replacing accelerator apis with a formal software/hardware interface, ArXiv abs/2203.00218, https://arxiv.org/abs/2203.00218
- Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Ammar, W., Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., et al. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980, 2017, https://arxiv.org/abs/1701.03980, Code: http://github.com/clab/dynet
- Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Detailed analysis of ML compiler optimizations, including code examples and a list of which optimizations each compiler supports, as of 2019.)
- William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475
- Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–205. https://doi.org/10.1109/CGO.2019.8661197
- Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 943–948, https://ieeexplore.ieee.org/document/8115709
- Sivathanu, M., Chugh, T., Singapuram, S. S., and Zhou, L., 2019, Astra: Exploiting predictability to optimize deep learning. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS’19, pp. 909–923, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450362405. doi: 10.1145/3297858.3304072. https://doi.org/10.1145/3297858.3304072, https://dl.acm.org/doi/10.1145/3297858.3304072, https://www.microsoft.com/en-us/research/publication/astra-exploiting-predictability-to-optimize-deep-learning/, PDF: https://www.microsoft.com/en-us/research/uploads/prod/2019/02/astra-asplos19.pdf (Optimization methods in PyTorch and TensorFlow.)
- W Chen, Y Wang, Y Xu, C Gao, C Liu, 2022, A framework for neural network architecture and compile co-optimization, https://dl.acm.org/doi/abs/10.1145/3533251, PDF: https://dl.acm.org/doi/pdf/10.1145/3533251
- P Tillet, HT Kung, D Cox, 2019, Triton: an intermediate language and compiler for tiled neural network computations, Proceedings of the 3rd ACM SIGPLAN, http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
- P Gibson, 2023, Compiler-centric across-stack deep learning acceleration, Ph.D. thesis, University of Glasgow, https://theses.gla.ac.uk/83959/1/2023GibsonPhD.pdf Code: https://github.com/Wheest/bib-boi/
- Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, Lidong Zhou, October 2023, PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation, SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles, Pages 331–347, https://doi.org/10.1145/3600006.3613139 https://dl.acm.org/doi/abs/10.1145/3600006.3613139 (Deep learning compiler for dynamic sparsity.)
- Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks Y Ma, Y Cao, S Vrudhula, J Seo, 2017, Slides PDF: https://www.isfpga.org/past/fpga2017/slides/D1_S1_04.pdf
- David Spuler, March 2024, Chapter 28. Deslugging AI Engines, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Amnon Geifman, April 28, 2021, Graph Compilers for Deep Learning: Definition, Pros & Cons, and Popular Examples, https://deci.ai/blog/graph-compilers/
- ONNX, 2023, Graph Optimizations in ONNX Runtime, https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html
- Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- XLA Team. March 2017. Xla, compiled, https://developers.googleblog.com/2017/03/xla-tensorflow-compiled.html
- Luhui Hu, Nov 18, 2022, AI Compilers Demystified Accelerate AI/ML through compilation: NVIDIA TensorRT, ONNX Runtime, Apache TVM, etc, https://medium.com/geekculture/ai-compilers-ae28afbc4907
- Karthee Sivalingam, Nina Mujkanovic, 2022, Graph compilers for AI training and inference , CRAY EMEA Research Lab , https://www.sodalite.eu/sites/sodalite/files/public/content-files/articles/graph-compilers-proof2-blog.pdf
Kernel Fusion in ML Compilers
Kernel fusion is the optimization of merging two operations into a single operator. The combined operator is thereby faster than running the two operations sequentially. The optimization may reduce computations or memory accesses to address overall cost. Read more about kernel operator fusion techniques.
Deep learning compilers use kernel fusion as one of the major optimizations on the execution graph. Compiler-executed kernel operator fusion research papers:
- Huan Song, Jayaraman J. Thiagarajan, Prasanna Sattigeri, Karthikeyan Natesan Ramamurthy, Andreas Spanias, Dec 2016, A Deep Learning Approach To Multiple Kernel Fusion, https://arxiv.org/abs/1612.09007
- X Cai, Y Wang, L Zhang, 2022, Optimus: An operator fusion framework for deep neural networks, ACM Transactions on Embedded Computing Systems, Vol. 22, No. 1, Article 1. October 2022, https://dl.acm.org/doi/pdf/10.1145/3520142 (Operator fusion theory for hardware accelerators and deep learning compilers.)
- S Zheng, S Chen, P Song, R Chen, X Li, 2023, Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), https://ieeexplore.ieee.org/abstract/document/10071018/ (Analysis of operator fusion on hardware accelerators.)
- A. Ashari, S. Tatikonda, M. Boehm, B. Reinwald, K. Campbell, J. Keenleyside, et al., "On optimizing machine learning workloads via kernel fusion", Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP 2015, pp. 173-182, February 7-11, 2015, https://doi.org/10.1145/2688500.2688521
- Matthias Boehm, Berthold Reinwald, Dylan Hutchison, Alexandre V. Evfimievski, and Prithviraj Sen. 2018. On optimizing operator fusion plans for large-scale machine learning in systemml. arXiv:1801.00829. https://arxiv.org/abs/1801.00829
- Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 47–62. https://doi.org/10.1145/3341301.3359630 PDF: https://cs.stanford.edu/~padon/taso-sosp19.pdf
- Rasmus Munk Larsen and Tatiana Shpeisman. 2023. TensorFlow graph optimization with Grappler, https://www.tensorflow.org/guide/graph_optimization
- Daniel Snider, Ruofan Liang, Jan 2023, Operator Fusion in XLA: Analysis and Evaluation, https://arxiv.org/pdf/2301.13062.pdf Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Experiments with various fusion methods in training. Note: code uses deprecated nvFuser compiler.)
- Robert Lim, 2019, Methods for accelerating machine learning in high performance computing, Report AREA-2019-01, School of Computer and Data Sciences, University of Oregon, https://www.cs.uoregon.edu/Reports/AREA-201901-Lim.pdf (Extension analysis of ML compiler optimizations. Table 8 lists the fusion operations supported by various ML compilers, as of 2019.)
- B Qiao, O Reiche, F Hannig, 2019, From loop fusion to kernel fusion: A domain-specific approach to locality optimization, 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), https://ieeexplore.ieee.org/document/8661176 (Theory of loop fusion generalized to graph kernel fusion for image processing.)
- H Peng, R Ran, Y Luo, J Zhao, S Huang, K Thorat, 2023, LinGCN: Structural Linearized Graph Convolutional Network for Homomorphically Encrypted Inference, https://arxiv.org/pdf/2309.14331.pdf (Has "operator fusion for node-wise activation functions"; see also "Section A.4 Further Detail of Operator Fusion".)
- Christian Sarofeen, Piotr Bialecki, Jie Jiang, Kevin Stephano, Masaki Kozuki, Neal Vaidya, and Stas Bekman, August 2022, Introducing nvFuser, a deep learning compiler for PyTorch, https://pytorch.org/blog/introducing-nvfuser-a-deep-learning-compiler-for-pytorch/, Project: https://github.com/pytorch/pytorch/projects/30 (Note: nvFuser deep learning compiler for Pytorch, but seems to be deprecated.)
- N. Rotem, J. Fix, S. Abdulrasool, S. Deng, R. Dzhabarov, J. Hegeman, R. Levenstein, B. Maher, N. Satish, J. Olesen, J. Park, A. Rakhov, and M. Smelyanskiy, “Glow: Graph lowering compiler techniques for neural networks,” CoRR, vol. abs/1805.00907, 2018. http://arxiv.org/abs/1805.00907 Code: http://github.com/pytorch/glow (Paper introduces PyTorch Glow, with much coverage of its kernel operator fusion and "operator stacking" methods.)
- T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, TVM: end-to-end optimization stack for deep learning, CoRR, vol. abs/1802.04799, 2018. http://arxiv.org/abs/1802.04799
- Xia, C., Zhao, J., Sun, Q., Wang, Z., Wen, Y., Feng, X., Cui, H., 2023, Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions, The ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 27 Apr-01 May 2023, San Diego, USA. https://eprints.whiterose.ac.uk/203681/, PDF: https://eprints.whiterose.ac.uk/203681/1/asplos24.pdf
- Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. Equality saturation for tensor graph superoptimization. Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2101.01332
- Zhen Zheng, Xuanda Yang, et al. 2022. AStitch: enabling a new multidimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 359–373. https://dl.acm.org/doi/abs/10.1145/3503222.3507723
- Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. 2019. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 193–205. https://doi.org/10.1109/CGO.2019.8661197
- Huanting Wang, Zhanyong Tang, et al. 2022. Automating Reinforcement Learning Architecture Design for Code Optimization. Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction (Seoul, South Korea) (CC 2022). Association for Computing Machinery, New York, NY, USA, 129–143. https://doi.org/10.1145/3497776.3517769
- Fredrik Kjolstad, Stephen Chou, David Lugato, Shoaib Kamil, and Saman Amarasinghe. 2017. Taco: A tool to generate tensor algebra kernels. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 943–948, https://ieeexplore.ieee.org/document/8115709
- Jia, Z., Padon, O., Thomas, J., Warszawski, T., Zaharia, M., Aiken, A., 2019, Taso: Optimizing deep learning computation with automatic generation of graph substitutions. Proceedings of the 27th ACM Symposium on Operating Systems Principles, SOSP’19, pp. 47–62, New York, NY, USA, 2019. ACM. ISBN 9781450368735. doi: 10.1145/3341301.3359630. https://doi.org/10.1145/3341301.3359630, https://dl.acm.org/doi/10.1145/3341301.3359630
- Jangda, A. and Bondhugula, U., An Effective Fusion and Tile Size Model for PolyMage, ACM Trans. Program. Lang. Syst., 42(3), November 2020. ISSN 0164-0925. doi: 10.1145/3404 https://dl.acm.org/doi/10.1145/3404846
- Ma, L., Xie, Z., Yang, Z., Xue, J., Miao, Y., Cui, W., Hu, W., Yang, F., Zhang, L., and Zhou, L., RAMMER: enabling holistic deep learning compiler optimizations with rtasks, In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp. 881–897. USENIX Association, November 2020. ISBN 978-1-939133-19-9. https://www.usenix.org/conference/osdi20/presentation/ma, https://dl.acm.org/doi/10.5555/3488766.3488816
- Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S., Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 519–530, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176. http://doi.acm.org/10.1145/2491956.2462176, PDF: https://people.csail.mit.edu/jrk/halide-pldi13.pdf
- Vasilache, N., Zinenko, O., Theodoridis, T., Goyal, P., Devito, Z., Moses, W. S., Verdoolaege, S., Adams, A., and Cohen, A., 2019, The next 700 accelerated layers: From mathematical expressions of network computation graphs to accelerated gpu kernels, automatically. ACM Trans. Archit. Code Optim., 16(4), October 2019. ISSN 1544-3566. doi: 10.1145/3355606. https://doi.org/10.1145/3355606, https://dl.acm.org/doi/fullHtml/10.1145/3355606
- Ding, Y., Zhu, L., Jia, Z., Pekhimenko, G., and Han, S. Ios, 2021, Inter-operator scheduler for cnn acceleration. In Smola, A., Dimakis, A., and Stoica, I. (eds.), Proceedings of Machine Learning and Systems, volume 3, pp. 1–14, 2021. https://arxiv.org/abs/2011.01302, PDF: https://proceedings.mlsys.org/paper/2021/file/38b3eff8baf56627478ec76a704e9b52-Paper.pdf, Code: https://github.com/mit-han-lab/inter-operator-scheduler (Examines improvements to single-operator parallelization, "intra-operator", to parallelization improvements across multiple operators, "inter-operator".)
- Irigoin, F. and Triolet, R., Supernode partitioning. In Proc. of the 15th ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages, POPL’88, pp. 319–329, New York, NY, USA, 1988. ACM. ISBN 0-89791-252-7. doi: 10.1145/73560. 73588. http://doi.acm.org/10.1145/73560.73588
- Jie Zhao, Xiong Gao, Ruijie Xia, Zhaochuang Zhang, Deshi Chen, Lei Chen, Renwei Zhang, Zhen Geng, Bin Cheng, Xuefeng Jin, 2022, Apollo: Automatic partition-based operator fusion through layer by layer optimization, https://proceedings.mlsys.org/paper_files/paper/2022/hash/e175e8a86d28d935be4f43719651f86d-Abstract.html PDF: https://proceedings.mlsys.org/paper_files/paper/2022/file/e175e8a86d28d935be4f43719651f86d-Paper.pdf, PDF: https://yaozhujia.github.io/assets/pdf/mlsys2022-paper.pdf
More AI Research
Read more about: