Aussie AI
Optimizing Matrix Multiplication Algorithms and MatMul/GEMM Kernels
-
Last Updated 7 December, 2024
-
by David Spuler, Ph.D.
Matrix Multiplication Research
With papers on this since the 1970s, you'd think we'd be done optimizing the matrix multiplication code by now. Although they were sequential algorithms to start with, even parallel algorithms for matrix multiplication have decades of theory. And yet there is a continual stream of research papers on MatMul/GEMM optimizations.
Feel free to blame the AI bubble for this. AI engines are built on matrix multiplication, and it occurs in all of these standard LLM components:
- Attention algorithms (self-attention) — uses three special-purpose QKV matrices.
- Feed-Forward Networks (FFNs) — uses two matrix multiplications with an intervening activation function (e.g., RELU or SwiGLU).
- Embedding and unembedding matrix
Note that attention and FFN blocks occur in every layer of an LLM.
There's quite a lot of work to do to optimize attention and FFNs. The reasons for the ongoing research are that there are many nuances and special cases. Some examples:
- Sparse versus dense matrices (i.e. SpMM versus GEMM algorithms).
- Matrix-matrix versus matrix-vector multiplications.
- Rectangular versus square matrices, including tall or flat rectangular matrices.
- Special shapes of matrices (e.g. triangular).
Optimizations to the matrix processing code must also deal with:
- Compute-bound versus memory-bound algorithms.
- Different computational versus memory costs in different hardware architectures (e.g., CPUs, GPUs, NPUs).
- Generalization from 2D matrices to 3D tensors (i.e., "tensor slices").
- Integer versus floating-point arithmetic.
- Quantization to fewer bits, such as 8-bit integer or 4-bit kernels (also reduced-bit floating-point FP8 and even FP4 kernels).
- Permutation-based vector/matrix/tensor algorithms.
- Sparse matrix storage compressions formats (there are various in-memory methods).
- Non-negative matrices (e.g., activations arising after RELU, or fixed non-negatives from scaling the weight matrices).
There are also some features specific to AI engine algorithms and their use of matrices:
- Prefill phase (prompt processing) is compute-bound, but in the decoding phase, FFNs are compute-bound because the matrices are static, whereas the QKV attention algorithms are memory-bound because the KV cache is not static data.
- FFNs involve two matrix multiplications separated by an element-wise activation function (e.g., kernel fusion is possible).
- Attention kernels have a particular QKV algorithm, for which there are many specialty kernels: e.g., Group Query Attention (GQA), Multi-Query Attention (MAQ), Flash Attention, Paged Attention, etc.
- Transposed matrices are used in the QKV algorithm, adding another wrinkle.
- Embedding and unembedding matrices used to convert between tokens and embedding vectors.
And there are some theoretical alternative methods of handling matrix multiplications:
- Low-rank matrix factorization algorithms (e.g., SVD).
- Approximate matrix multiplication
- Element-wise multiplication (Hadamard product)
Since many of these techniques are orthogonal, you can see the combinatorial growth of the various sub-cases. We might be reading GEMM papers for a while!
Optimizing Matrix Multiplication
Matrix multiplications are the basis of neural network processing. The main bottleneck is the multiplication operation deep inside the nested loops of the matrix multiplier. The main optimization in modern times is hardware optimizations, to run lots of those computations in parallel.
The main techniques to use in faster MatMul/GEMM kernels include:
- GPU versions of kernels
- Advanced theoretical matrix multiplication algorithms (e.g. Strassen's)
- Approximate matrix multiplication
Researchers have been optimizing matrix multiplication for decades, so you might think it's all done and dusted. Well, firstly, a lot of that old research is sequential algorithms rather than parallel. Furthermore, most of such research was focused on reducing the number of multiplications (i.e. FLOPs), whereas the cost of memory access is a greater issue on very large matrices. Hence, there's a steady stream of matrix multiplation optimization papers, and no end in sight. Some of areas of focus for this AI-related matrix multiplication research include:
- Quantized kernels — e.g. INT4 or INT8 GEMM kernels
- Memory-efficient matrix multiplication (generally)
- Memory-efficient attention-specific optimizations (e.g., Flash attention, Paged attention)
- Sparse matrix kernels (see also sparsity optimizations)
- Low-rank matrix factorization
- Specialized versions (e.g., matrix-vector, square vs rectangular, etc.)
- Advanced parallelization (e.g., block-level)
- Tensor generalizations — add another dimension!
Several of the components of a basic AI Transformer engine (for LLMs) use matrix multiplications in a specialized way. See also the more specific optimizations of these components:
Naive Matrix Multiplication Algorithm
The simplest algorithm to compute matrix multiplication is a cubic-complexity algorithm, which involves three nested loops. There are various papers with examples:
- Geeks for Geeks, 31 Oct, 2024, Program to multiply two matrices, https://www.geeksforgeeks.org/c-program-multiply-two-matrices/
- Geeks for Geeks, 21 Sep, 2023, Multiply Two Matrices in C++, https://www.geeksforgeeks.org/cpp-matrix-multiplication/
- Codes Cracker, Nov 2024, C Program to Multiply Two Matrices, https://codescracker.com/c/program/c-program-multiply-two-matrices.htm
- Know Program, Nov 2024, Matrix Operations in C | Addition, Multiplication, Transpose, https://www.knowprogram.com/c-programming/matrix-operations-in-c/
- Nadav Rotem, Nov 2024, High-Performance Matrix Multiplication, https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
- Ulrich Drepper, October 23, 2007, Memory part 5: What programmers can do, https://lwn.net/Articles/255364/
- Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM
Survey Papers on Matrix Multiplication Kernels (MatMul/GEMM)
Survey papers on matrix multiplication include:
- Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
- Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
- Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, Vivek Sarkar, 1 Aug 2016, A survey of sparse matrix-vector multiplication performance on large matrices, https://arxiv.org/abs/1608.00636
- S. Afroz, M. Tahaseen, F. Ahmed, K. S. Farshee and M. N. Huda, "Survey on matrix multiplication algorithms," 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), Dhaka, Bangladesh, 2016, pp. 151-155, doi: 10.1109/ICIEV.2016.7759985. https://ieeexplore.ieee.org/abstract/document/7759985
- Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
Surveys on Sparse Matrix Multiplication
Review papers on sparse matrix multiplication (SpMM):
- Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
- Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
- Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, Vivek Sarkar, 1 Aug 2016, A survey of sparse matrix-vector multiplication performance on large matrices, https://arxiv.org/abs/1608.00636
- Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
Sparse Matrix Kernels
Sparse matrices are those with lots of zeros and few non-zero values. There are various ways to optimize MatMul/GEMM for sparsity.
- Y Yang, JS Emer, D Sanchez, 2024, Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications, MIT, https://yang-yifan.github.io/papers/isca24_trapezoid.pdf
- Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
- Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
- Hongyaoxing Gu, 11 Mar 2024, A method for accelerating low precision operations by sparse matrix multiplication, https://arxiv.org/abs/2403.06924v1
- Haque, S.A.; Choudhury, N.; Hossain, S. Matrix Multiplication with Diagonals: Structured Sparse Matrices and Beyond. In Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications, Jinan, China, 17–19 June 2023; pp. 69–76. https://doi.org/10.1145/3606043.3606053
- Sardar Anisul Haque,Mohammad Tanvir Parvez, Shahadat Hossain, Jan 2024, GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes, https://www.mdpi.com/1999-4893/17/1/31
- D. Mukunoki, M. Kawai and T. Imamura, 2023, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 2023, pp. 608-615, doi: 10.1109/MCSoC60832.2023.00094, https://ieeexplore.ieee.org/abstract/document/10387875
- Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
- Helin Cheng, Wenxuan Li, Yuechen Lu, and Weifeng Liu. 2023. HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore Processors. In Proceedings of the 52nd International Conference on Parallel Processing (ICPP '23). Association for Computing Machinery, New York, NY, USA, 807–817. https://doi.org/10.1145/3605573.3605611 https://dl.acm.org/doi/abs/10.1145/3605573.3605611
- Chunxu Lin, Wensheng Luo, Yixiang Fang, Chenhao Ma, Xilin Liu, and Yuchi Ma. 2024. On Efficient Large Sparse Matrix Chain Multiplication. Proc. ACM Manag. Data 2, 3, Article 156 (June 2024), 27 pages. https://doi.org/10.1145/3654959 https://dl.acm.org/doi/abs/10.1145/3654959
- Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
- NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
- Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
- Zhang, H., Ma, W., Yuan, W. et al. Mixed-precision block incomplete sparse approximate preconditioner on Tensor core. CCF Trans. HPC 6, 54–67 (2024). https://doi.org/10.1007/s42514-023-00165-9 https://link.springer.com/article/10.1007/s42514-023-00165-9
- Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
- A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
- Kaige Zhang, Xiaoyan Liu, Hailong Yang, Tianyu Feng, Xinyu Yang, Yi Liu, Zhongzhi Luan, and Depei Qian. 2024. Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor Core. In Proceedings of the 53rd International Conference on Parallel Processing (ICPP '24). Association for Computing Machinery, New York, NY, USA, 1124–1134. https://doi.org/10.1145/3673038.3673108 https://dl.acm.org/doi/abs/10.1145/3673038.3673108
- Bobby Yan, Alexander J. Root, Trevor Gale, David Broman, Fredrik Kjolstad, 20 Jun 2024 (v2), Scorch: A Library for Sparse Deep Learning, https://arxiv.org/abs/2405.16883
- Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad, 21 Aug 2024, Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication, https://arxiv.org/abs/2408.11988
- Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
- Noah Amsel, Tyler Chen, Feyza Duman Keles, Diana Halikias, Cameron Musco, Christopher Musco, 26 Mar 2024 (v3), Fixed-sparsity matrix approximation from matrix-vector products, https://arxiv.org/abs/2402.09379
- Peiming Liu, Alexander J Root, Anlunxu, Yinyig Li, Fredrik Kjolstad, Aart C. Bik, 2024, Compiler Support for Sparse Tensor Convolutions, https://rootjalex.github.io/publications/oopsla2024-spconv.pdf
- Pranav Dangi, Zhenyu Bai, Dhananjaya Wijerathne, Rohan Juneja, 2024, ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in ML, https://pranavdangi.github.io/papers/PACT24.pdf
- Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
- Anton Lokhmotov, 17 Nov 2015 (v2), GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication, https://arxiv.org/abs/1511.03742
- Xiaobo Lu, Jianbin Fang, Lin Peng, Chun Huang, Zidong Du, Yongwei Zhao, and Zheng Wang. 2024. Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product. ACM Trans. Archit. Code Optim. Just Accepted (August 2024). https://doi.org/10.1145/3688612 https://dl.acm.org/doi/abs/10.1145/3688612
- Patrik Okanovic, Grzegorz Kwasniewski, Paolo Sylos Labini, Maciej Besta, Flavio Vella, Torsten Hoefler, 21 Aug 2024, High Performance Unstructured SpMM Computation Using Tensor Cores, https://arxiv.org/abs/2408.11551
- Takuma Yamaguchi and Federico Busato, Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores, Mar 19, 2021, https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/
- OpenAI, December 6, 2017, Block-sparse GPU kernels, https://openai.com/index/block-sparse-gpu-kernels/ https://cdn.openai.com/blocksparse/blocksparsepaper.pdf https://github.com/openai/blocksparse
- Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
- R. L. Castro, D. Andrade and B. B. Fraguela, "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning," in IEEE Access, vol. 12, pp. 70581-70599, 2024, doi: 10.1109/ACCESS.2024.3402326. https://ieeexplore.ieee.org/abstract/document/10534045 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10534045
- Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
- Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
Tiled MatMul/GEMM
Research papers on tiled MatMul/GEMM optimizations:
- Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
- Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
- Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
- A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
- David Spuler, March 2024, Tiled Matrix-Vector Multiplication, in Generative AI in C++, in Generative AI in C++, https://www.aussieai.com/book/ch34-tiled-matrix-vector-multiplication
- NVIDIA, 2024, Matrix Multiplication Background User's Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html https://docs.nvidia.com/deeplearning/performance/pdf/Matrix-Multiplication-Background-User-Guide.pdf
- Y. Song, Y. Meng, B. Chen, S. Chen and Y. Kang, 2024, SALTM: Accelerating Large Transformers in Multi-device System with 2D Model Partitioning Method, Integrated Circuits and Systems, doi: 10.23919/ICS.2024.3458897, https://ieeexplore.ieee.org/abstract/document/10678935 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10678935
- Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
- Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
- Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
- A. Jangda, S. Maleki, M. M. Dehnavi, M. Musuvathi and O. Saarikivi, "A Framework for Fine-Grained Synchronization of Dependent GPU Kernels," 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Edinburgh, United Kingdom, 2024, pp. 93-105, doi: 10.1109/CGO57630.2024.10444873. https://ieeexplore.ieee.org/abstract/document/10444873
- Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, Yongwei Zhao, Zidong Du, Xing Hu, Qi Guo, Tianshi Chen, 24 Sep 2024, Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM, https://arxiv.org/abs/2409.15654
- Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu, 1 Nov 2024, V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM, https://arxiv.org/abs/2411.00915
- Nadav Rotem, Nov 2024, High-Performance Matrix Multiplication, https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
- David Spuler, March 2024, Chapter 34. MatMul/GEMM, Book Excerpt from "Generative AI in C++", https://www.aussieai.com/book/ch34-matmul-gemm
- Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM
- Lai, J., and Seznec, A. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. International Symposium on Code Generation and Optimization (CGO '13), 1–10. https://inria.hal.science/file/index/docid/789958/filename/112_Lai.pdf
- Andrew Kerr, Duane Merrill, Julien Demouth and John Tran, Dec 05, 2017, CUTLASS: Fast Linear Algebra in CUDA C++, https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
- Fengguang Song, Stanimire Tomov, Jack Dongarra, 2012, Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems, ICS’12, June 25–29, 2012, San Servolo Island, Venice, Italy. https://icl.utk.edu/files/publications/2012/icl-utk-495-2012.pdf
CUDA Matrix Multiplication
There are various papers with examples of coding Matmul/GEMM in CUDA C++:
- Samuel S. Cho, 2011, CUDA Thread Basics, https://users.wfu.edu/choss/CUDA/docs/Lecture%205.pdf
- Mark Harris, Feb 18, 2013, An Efficient Matrix Transpose in CUDA C/C++, https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
- Cris Cecka, Feb 27, 2017, Pro Tip: cuBLAS Strided Batched Matrix Multiply, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
- Dominik Grewe and Anton Lokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
- Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
- Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
- Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
- Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
- Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
- Lei Mao, March 19, 2023, CUDA Coalesced Memory Access, https://leimao.github.io/blog/CUDA-Coalesced-Memory-Access/
- Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594 NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
- Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
- Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM
- Gray, S. 2014. A full walk through of the SGEMM implementation, https://github.com/NervanaSystems/maxas/wiki/SGEMM
- Lai, J., and Seznec, A. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. International Symposium on Code Generation and Optimization (CGO '13), 1–10. https://inria.hal.science/file/index/docid/789958/filename/112_Lai.pdf
- Andrew Kerr, Duane Merrill, Julien Demouth and John Tran, Dec 05, 2017, CUTLASS: Fast Linear Algebra in CUDA C++, https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, 18 Dec 2014 (v3), cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
- Rajib Nath, Stanimire Tomov, and Jack Dongarra, November 18, 2010, An Improved Magma Gemm For Fermi Graphics Processing Units, The International Journal of High Performance Computing Applications, Volume 24, Issue 4, https://doi.org/10.1177/1094342010385729 https://journals.sagepub.com/doi/10.1177/1094342010385729
- Fengguang Song, Stanimire Tomov, Jack Dongarra, 2012, Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems, ICS’12, June 25–29, 2012, San Servolo Island, Venice, Italy. https://icl.utk.edu/files/publications/2012/icl-utk-495-2012.pdf
- Wang Zhiyong, 2022, NVIDIA SGEMM Practice: Step-by-step optimization of CUDA SGEMM, https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE
- Siboehm, 2023, Fast CUDA SGEMM from Scratch, https://github.com/siboehm/SGEMM_CUDA
- Edward Kandrot, 2023, cuda_matmul: Optimized CUDA matmul with benchmarks, https://github.com/ekandrot/cuda_matmul
- Harshit Kumar, June 7, 2024, Matrix Multiplication in CUDA, https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda
- NVIDIA, 2024, Matrix Multiplication Background User's Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
Lookup Table (LUT) for MatMul/GEMM
The use of table lookup can sometimes reduce or remove the number of multiplication operations. This method has been used in various MatMul/GEMM kernels, usually when they are quantized to smaller ranges than floating-point.
Research on LUT MatMUL/GEMM algorithms:
- SZ Lin, YC Chen, YH Chang, TW Kuo, HP Li, 2024, LUTIN: Efficient Neural Network Inference with Table Lookup, ISLPED ’24, August 5-7, 2024, Newport Beach, CA, USA, https://dl.acm.org/doi/pdf/10.1145/3665314.3670804
- Davis Blalock, John Guttag, 21 Jun 2021, Multiplying Matrices Without Multiplying, https://arxiv.org/abs/2106.10860
- Q. Deng, Y. Zhang, M. Zhang and J. Yang, "LAcc: Exploiting Lookup Table-based Fast and Accurate Vector Multiplication in DRAM-based CNN Accelerator," 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2019, pp. 1-6. https://ieeexplore.ieee.org/document/8806810 PDF: https://dl.acm.org/doi/pdf/10.1145/3316781.3317845
- Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee, 31 Aug 2020 (v2), BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs, https://arxiv.org/abs/2005.09904
- Jie Ran, Rui Lin, Jason Chun Lok Li, Jiajun Zhou, Ngai Wong, 13 Aug 2022, PECAN: A Product-Quantized Content Addressable Memory Network, https://arxiv.org/abs/2208.13571
Matrix-Vector Multiplication
Matrix-vector multiplication is also called Vector Matrix Multiplication (VMM) or Generatized Matrix Vector (GEMV).
- Wenjie Li; Aokun Hu; Ningyi Xu; Guanghui He, Jan 2024, Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models IEEE Transactions on Circuits and Systems I: Regular Papers (Early Access), https://ieeexplore.ieee.org/abstract/document/10400181/ (Quantization software algorithms done in a hardware-aware co-designed method to optimize hardware matrix-vector multiplication.)
- David Spuler, March 2024, Chapter 34. MatMul/GEMM, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner, 17 Jul 2024, Improved Algorithms for Kernel Matrix-Vector Multiplication, LCFM 2024, https://openreview.net/forum?id=7CCzyEtZXH https://openreview.net/pdf?id=7CCzyEtZXH
- Han Xu, Yutong Li, Shihao Ji, 12 Sep 2024, LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs, https://arxiv.org/abs/2409.11424 (Matrix multiplications are 97% of computations, which are optimized with a pipelined matrix-vector operation.)
- D. Mukunoki, M. Kawai and T. Imamura, 2023, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 2023, pp. 608-615, doi: 10.1109/MCSoC60832.2023.00094, https://ieeexplore.ieee.org/abstract/document/10387875
- Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
- Noah Amsel, Tyler Chen, Feyza Duman Keles, Diana Halikias, Cameron Musco, Christopher Musco, 26 Mar 2024 (v3), Fixed-sparsity matrix approximation from matrix-vector products, https://arxiv.org/abs/2402.09379
- David Spuler, March 2024, Tiled Matrix-Vector Multiplication, in Generative AI in C++, in Generative AI in C++, https://www.aussieai.com/book/ch34-tiled-matrix-vector-multiplication
- Md Tawsif Rahman Chowdhury, Huynh Quang Nguyen Vo, Paritosh Ramanan, Murat Yildirim, Gozde Tutuncuoglu, 10 Sep 2024, The Lynchpin of In-Memory Computing: A Benchmarking Framework for Vector-Matrix Multiplication in RRAMs, https://arxiv.org/abs/2409.06140
- Mohamed Assem Ibrahim, Mahzabeen Islam, Shaizeen Aga, 1 Apr 2024 (v2), Balanced Data Placement for GEMV Acceleration with Processing-In-Memory, https://arxiv.org/abs/2403.20297
- Sandeep Kaur Kingra, Vivek Parmar, Shubham Negi, Sufyan Khan, Boris Hudec, Tuo-Hung Hou, Manan Suri, 10 Jun 2020, Methodology for Realizing VMM with Binary RRAM Arrays: Experimental Demonstration of Binarized-ADALINE Using OxRAM Crossbar, https://arxiv.org/abs/2006.05657
- Florian Freye, Jie Lou, Christian Lanius, Tobias Gemmeke, 21 May 2024 (v2), Merits of Time-Domain Computing for VMM -- A Quantitative Comparison, https://arxiv.org/abs/2403.18367
- Alexandre Siron, Sean Molesky, 25 Jun 2024, A Split Fast Fourier Transform Algorithm for Block Toeplitz Matrix-Vector Multiplication, https://arxiv.org/abs/2406.17981
- Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU, Lingnan Xia, Hua Ma, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10721583
- Hyeji Kim, Yeongmin Lee, Chun-Gi Lyuh, 28 October 2024, PF-GEMV: Utilization maximizing architecture in fast matrix–vector multiplication for GPT-2 inference, https://doi.org/10.4218/etrij.2024-0111 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2024-0111 https://onlinelibrary.wiley.com/doi/epdf/10.4218/etrij.2024-0111
- Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, Vivek Sarkar, 1 Aug 2016, A survey of sparse matrix-vector multiplication performance on large matrices. https://arxiv.org/abs/1608.00636
- David Spuler, March 2024, Chapter 34. MatMul/GEMM, Book Excerpt from "Generative AI in C++", https://www.aussieai.com/book/ch34-matmul-gemm
- Donghyeon Yi, Seoyoung Lee, Jongho Kim, Junyoung Kim, Sohmyung Ha, Ik Joon Chang, Minkyu Je, 22 Nov 2024, FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration, https://arxiv.org/abs/2411.14733
- MA Ibrahim, M Islam, S Aga, Nov 2024, PIMnast: Balanced Data Placement for GEMV Acceleration with Processing-In-Memory, https://conferences.computer.org/sc-wpub/pdfs/SC-W2024-6oZmigAQfgJ1GhPL0yE3pS/555400a970/555400a970.pdf
Transpose Optimizations in Matrix Multiplication
One of the basic methods to optimize matrix multiplication is to operate on a matrix's transpose, because this leads to contiguous memory blocks, which are faster to access (i.e., it's a memory access speed optimization, not a reduction in computations). The issue is the method to store matrices in row-major versus column-major ordering, with the two matrices in a matrix-matrix computation requiring opposite storage methods. It can even be beneficial in some cases to compute the tranpose on-the-fly (which has quadratic complexity), storing it as a temporary matrix, just to speed up the full matrix-matrix operation (which has cubic complexity). Hence, there are various optimizations including:
- Fast computation of the tranpose.
- Caching a pre-computed transpose
Research papers on handling a transpose, or creating one, for faster MatMul include:
- Harald Prokop, May 21, 1999, Cache-Oblivious Algorithms, Master of Science Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, http://supertech.csail.mit.edu/papers/Prokop99.pdf
- Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité, Julien Langou, 21 Feb 2022, I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels https://arxiv.org/abs/2202.10217
- Paul Springer, Paolo Bientinesi, 7 Nov 2017 (v3), Design of a high-performance GEMM-like Tensor-Tensor Multiplication, https://arxiv.org/abs/1607.00145
- David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- David Spuler, March 2024, Cached or Precomputed Transpose, in Generative AI in C++, https://www.aussieai.com/book/ch29-cached-precomputed-transpose
- Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
- Mark Harris, Feb 18, 2013, An Efficient Matrix Transpose in CUDA C/C++, https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
- Lei Mao, March 19, 2023, CUDA Coalesced Memory Access, https://leimao.github.io/blog/CUDA-Coalesced-Memory-Access/
- Viviana Arrigoni, Annalisa Massini, 6 Feb 2019, Fast Strassen-based A^t A Parallel Multiplication, https://arxiv.org/abs/1902.02104
- Viviana Arrigoni, Filippo Maggioli, Annalisa Massini, Emanuele Rodolà, 25 Oct 2021, Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose, https://arxiv.org/abs/2110.13042
- David Spuler, March 2024, Chapter 34. MatMul/GEMM, Book Excerpt from "Generative AI in C++", https://www.aussieai.com/book/ch34-matmul-gemm
- Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
Matrix Multiplication Optimization in AI Models
Most of AI inference algorithms are based on matrix multiplications or similarly on vector dot products. The primary optimization is to parallelize matrix or vector operations using hardware acceleration in GPUs or other chips. However, several improvements to matrix and vector algorithms have also been found. Research papers specific to matrix multiplication include:
- Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
- Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, Tianwei Zhang, Iron: Private Inference on Transformers, Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022), https://proceedings.neurips.cc/paper_files/paper/2022/hash/64e2449d74f84e5b1a5c96ba7b3d308e-Abstract-Conference.html
- François Charton, Linear algebra with transformers, Nov 2022, https://arxiv.org/abs/2112.01898
- Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis & Pushmeet Kohli, Discovering faster matrix multiplication algorithms with reinforcement learning, Nature, volume 610, pp.47–53 (2022), https://www.nature.com/articles/s41586-022-05172-4
- Huang, J., Yu, C. D. & Geijn, R. A. V. D. Strassen’s algorithm reloaded on GPUs. ACM Trans. Math. Softw. 46, 1–22 (2020). https://dl.acm.org/doi/10.1145/3372419
- Tschannen, M., Khanna, A. & Anandkumar, A, StrassenNets: deep learning with a multiplication budget. In International Conference on Machine Learning 4985–4994 (PMLR, 2018). https://arxiv.org/abs/1712.03942
- Jason Cong and Bingjun Xiao. Minimizing computation in convolutional neural networks. In Artificial Neural Networks and Machine Learning–ICANN 2014, pages 281–290. Springer, 2014. PDF: https://vast.cs.ucla.edu/sites/default/files/publications/CNN_ICANN14.pdf
- Ruofan Wu, Feng Zhang, Jiawei Guan, Zhen Zheng, Xiaoyong Du, and Xipeng Shen. DREW: Efficient Winograd CNN Inference with Deep Reuse. In Proceedings of the ACM Web Conference 2022, pages 1807–1816, 2022. https://dl.acm.org/doi/10.1145/3485447.3511985 PDF: https://research.csc.ncsu.edu/picture/publications/papers/www2022.pdf (Combines the Winograd matrix multiplication algorithm with computation reuse.)
- F Zhang, R Wu, J Guan, Z Zheng, X Guo, 2023, Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge Device, IEEE Transactions on Knowledge and Data Engineering, Volume 35, Issue 10, 01 October 2023, https://ieeexplore.ieee.org/abstract/document/10106424/ (Further analysis of DREW, which combines Winograd optimizations with data reuse.)
- NM Cicek, X Shen, O Ozturk, 2022, Energy Efficient Boosting of GEMM Accelerators for DNN via Reuse, ACM Transactions on Design Automation of Electronic Systems, Volume 27, Issue 5, Article No. 43, pp 1–26, https://doi.org/10.1145/3503469, https://dl.acm.org/doi/10.1145/3503469, PDF: https://dl.acm.org/doi/pdf/10.1145/3503469 (Uses computation reuse to speed up matrix multiplications.)
- R Snytsar, Oct 2023, Accelerating Machine Learning Primitives on Commodity Hardware, arXiv preprint arXiv:2310.05218, https://arxiv.org/pdf/2310.05218.pdf (Uses the "sliding window" technique to optimize general matrix multiplication on edge devices.)
- Andrew Anderson, Aravind Vasudevan, Cormac Keane, David Gregg, Sep 2017, Low-memory GEMM-based convolution algorithms for deep neural networks, arXiv preprint arXiv:1709.03395, https://arxiv.org/abs/1709.03395
- Aravind Vasudevan, Andrew Anderson, David Gregg, 2017, Parallel Multi Channel Convolution using General Matrix Multiplication, 2017 IEEE 28th international conference on application-specific systems, architectures and processors (ASAP). pp. 19–24. https://arxiv.org/abs/1704.04428
- H Li, J Choi, Y Kwon, JH Ahn, Oct 2023, A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models, IEEE Computer Architecture Letters, https://ieeexplore.ieee.org/abstract/document/10285300/ (Tiled version of matrix multiplication with SVD factorization.)
- Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf (Fusion and fission optimizations with example algorithms on p.40 and p.45.)
- Dukhan, M. 2019. The indirect convolution algorithm, https://arxiv.org/abs/1907.02129 (Optimizing an alternative to GEMM for convolutions.)
- Zlateski, A., Jia, Z., Li, K., Durand, F., The anatomy of efficient fft and winograd convolutions on modern cpus. Proceedings of the ACM International Conference on Supercomputing (New York, NY, USA, 2019), ICS ’19, Association for Computing Machinery, p. 414–424. PDF: https://dl.acm.org/doi/pdf/10.1145/3330345.3330382
- William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475
- chengzeyi, Oct 2023, Stable Fast, https://github.com/chengzeyi/stable-fast (Highly optimized inference engine with fused GEMM)
- Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S., Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 519–530, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176. http://doi.acm.org/10.1145/2491956.2462176, PDF: https://people.csail.mit.edu/jrk/halide-pldi13.pdf
- Caleb Tung, Nicholas Eliopoulos, Purvish Jajal, Gowri Ramshankar, Chen-Yun Yang, Nicholas Synovic, Xuecen Zhang, Vipin Chaudhary, George K. Thiruvathukal, Yung-Hsiang Lu, Oct 2023, An automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions, https://arxiv.org/pdf/2310.07782.pdf, Code: https://github.com/PurdueCAM2Project/focused-convolutions
- H Liu, F Deng, 2023, Convolutional Acceleration Algorithm Combining Loop Optimization and Automatic Scheduling, 2023 International Conference for Advancement in Technology (ICONAT), https://ieeexplore.ieee.org/abstract/document/10080410/
- Liangzhen Lai, Naveen Suda, Vikas Chandra, 2018, CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs, arXiv preprint arXiv:1801.06601, https://arxiv.org/abs/1801.06601 PDF: https://arxiv.org/pdf/1801.06601
- N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Many pseudo-code examples of kernel operator fusion, e.g. shows pseudo-code of fusing RELU into MatMul, and a fused MatMul-addbias-RELU.)
- Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian, 4 Jun 2024, Scalable MatMul-free Language Modeling, https://arxiv.org/abs/2406.02528 Code: https://github.com/ridgerchu/matmulfreellm (Uses addition via ternary quantization and elementwise Hadamard products to replace MatMul.)
- Y Yang, JS Emer, D Sanchez, 2024, Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications, MIT, https://yang-yifan.github.io/papers/isca24_trapezoid.pdf
- Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
- Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, April 2024, Tandem processor: Grappling with emerging operators in neural networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 1165–1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365 Code: https://actlab-genesys.github.io (Reviews hardware acceleration of all sub-layer kernel operators, with a focus beyond just GEMM/MatMul operators.)
- Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren, 21 Apr 2024, SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile, https://arxiv.org/abs/2404.13528 (Choosing optimal tensor memory layouts to optimize low-level operator kernels.)
- Tomasz Kolinko, 2024, Effort: A possibly new algorithm for LLM Inference, https://kolinko.github.io/effort/equations.html https://kolinko.github.io/effort/bucketmul.html (A method of doing sparse MatMul computations to get faster inference.)
- Justine Tunney, March 31st 2024, LLaMA Now Goes Faster on CPUs, https://justine.lol/matmul/ Code: https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/sgemm.cpp Code: https://github.com/mozilla-Ocho/llamafile (Improved on-device benchmarks for PC CPU platforms, with Intel, AMD or M2 chips, for Mistral 7B models with 8-bit quantization by optimizing MatMul in the llama.cpp inference engine.)
- Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Dec 2023, Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Optimized LLM inference using kernel fusion of GEMM with element-wise operations for better data movement, and also advanced management of the KV cache.)
- http://supertech.csail.mit.edu/papers/Prokop99.pdf Harald Prokop, May 21, 1999, Cache-Oblivious Algorithms, Master of Science Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, http://supertech.csail.mit.edu/papers/Prokop99.pdf
- Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen, May 2023, Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs, https://arxiv.org/abs/2305.01024 (Focuses on error tolerance of failures within matrix multiplication algorithms.)
- Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
- Vipul Vaibhaw, Jan 13, 2021, Matrix Multiplication: Optimizing the code from 6 hours to 1 sec, https://vaibhaw-vipul.medium.com/matrix-multiplication-optimizing-the-code-from-6-hours-to-1-sec-70889d33dcfa (Practical analysis of coding optimizations of matrix multiplication in Python and C++ for various speedups.)
- Wenjie Li; Aokun Hu; Ningyi Xu; Guanghui He, Jan 2024, Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models IEEE Transactions on Circuits and Systems I: Regular Papers (Early Access), https://ieeexplore.ieee.org/abstract/document/10400181/ (Quantization software algorithms done in a hardware-aware co-designed method to optimize hardware matrix-vector multiplication.)
- Yuzong Chen, Jordan Dotzel, Mohamed S. Abdelfattah, 5 Nov 2023, M4BRAM: Mixed-Precision Matrix-Matrix Multiplication in FPGA Block RAMs, https://arxiv.org/abs/2311.02758 (Extends FPGA hardware for matrix multiplication to mixed-precision data, such as from mixed-precision quantization.)
- Trevor Gale, Matei Zaharia, Cliff Young, Erich Elsen, Aug 2020, Sparse GPU Kernels for Deep Learning, https://arxiv.org/abs/2006.10901
- Ziheng Wang, Aug 2020, SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference, https://arxiv.org/abs/2008.11849
- Robert A. van de Geijn, Enrique S. Quintana-Ort´ı, 2007, The Science of Programming Matrix Computations, https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf Code: http://www.cs.utexas.edu/users/flame/
- X Xie, H Peng, A Hasan, S Huang, J Zhao, 2023, Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks https://arxiv.org/abs/2308.11825 (Kernel for sparse matrix multiplication with block-level tiling as example.)
- Darshan C. Ganji, Saad Ashfaq, Ehsan Saboori, Sudhakar Sah, Saptarshi Mitra, MohammadHossein AskariHemmat, Alexander Hoffman, Ahmed Hassanien, Mathieu Léonardon, 18 Apr 2023, DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables, https://arxiv.org/abs/2304.09049
- Cong Guo, Fengchen Xue, Jingwen Leng, Yuxian Qiu, Yue Guan, Weihao Cui, Quan Chen, Minyi Guo, 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, https://www.computer.org/csdl/journal/tc/5555/01/10436533/1UwVolp0wta
- 12 Mar 2024, IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers, Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh, https://arxiv.org/abs/2403.07339v1 (Faster GEMM with low-bitwidth integer approximations.)
- Babak Hejazi, Introducing Grouped GEMM APIs in cuBLAS and More Performance Updates, Jun 12, 2024, NVIDIA Technical Blog, https://developer.nvidia.com/blog/introducing-grouped-gemm-apis-in-cublas-and-more-performance-updates/
- Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité, Julien Langou, 21 Feb 2022, I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels https://arxiv.org/abs/2202.10217
- Paul Springer, Paolo Bientinesi, 7 Nov 2017 (v3), Design of a high-performance GEMM-like Tensor-Tensor Multiplication, https://arxiv.org/abs/1607.00145
- Intel, Fast DistilBERT on CPUs, 2022, https://arxiv.org/pdf/2211.07715.pdf
- Kazushige Goto, Robert A. van de Geijn, 2008, Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software, Volume 34, Issue 3, Article No.: 12, pp 1–25, https://dl.acm.org/doi/10.1145/1356052.1356053 (The GotoBLAS algorithm for matrix multiplication.)
- DY Fu, S Arora, J Grogan, I Johnson, S Eyuboglu, Oct 2023, Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture https://arxiv.org/pdf/2310.12109.pdf
- DN Parikh, RA van de Geijn, GM Henry, 2023, Cascading GEMM: High Precision from Low Precision, arXiv preprint arXiv:2303.04353, https://arxiv.org/abs/2303.04353
- Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
- David Spuler, March 2024, Chapter 34. MatMul/GEMM, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Stanford University, 2018, General Matrix Multiply (GeMM), https://spatial-lang.org/gemm
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
- Emily Cerf, June 20, 2024, Researchers run high-performing large language model on the energy needed to power a lightbulb, https://news.ucsc.edu/2024/06/matmul-free-llm.html
- Benj Edwards, 26 June, 2024, Researchers upend AI status quo by eliminating matrix multiplication in LLMs, https://arstechnica.com/information-technology/2024/06/researchers-upend-ai-status-quo-by-eliminating-matrix-multiplication-in-llms/
- Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
- Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
- Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
- Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
- Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
- Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
- Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
- Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee, 21 Jun 2024 (v4), Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs, https://arxiv.org/abs/2402.10517 Code: https://github.com/SNU-ARC/any-precision-llm
- Keren Zhou, Karthik Ganapathi Subramanian, Po-Hsun Lin, Matthias Fey, Binqian Yin, Jiajia Li, 03 June 2024, FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, Pages 511-524, https://doi.org/10.1145/3650200.3656593 https://dl.acm.org/doi/abs/10.1145/3650200.3656593
- Jinendra Malekar, Mohammed E. Elbtity, Ramtin Zand Co, 21 Aug 2024, Matmul or No Matmal in the Era of 1-bit LLMs, https://arxiv.org/abs/2408.11939
- Kristina Wilson, Clifford Li, Hon Man Lau, Kwai Wong, Stanimire Tomov, 17 July 2024, Implementing Single-precision and Half-precision Tensor Operations, PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, Article No.: 109, Pages 1-4, https://doi.org/10.1145/3626203.3670601
- Abhinav Jangda, Mohit Yadav, 20 February 2024, Fast Kronecker Matrix-Matrix Multiplication on GPUs, PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Pages 390 - 403, https://doi.org/10.1145/3627535.3638489 https://dl.acm.org/doi/abs/10.1145/3627535.3638489
- John Yang, Le An, Su Inn Park, 11 Jun 2024, ReduceFormer: Attention with Tensor Reduction by Summation, https://arxiv.org/abs/2406.07488
- D. Mukunoki, M. Kawai and T. Imamura, 2023, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 2023, pp. 608-615, doi: 10.1109/MCSoC60832.2023.00094, https://ieeexplore.ieee.org/abstract/document/10387875
- C. Wei, H. Jia, Y. Zhang, J. Yao, C. Li and W. Cao, Sep 2024, IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUs, IEEE Transactions on Parallel and Distributed Systems, vol. 35, no. 9, pp. 1672-1689, Sept. 2024, doi: 10.1109/TPDS.2024.3432579, https://ieeexplore.ieee.org/abstract/document/10607886
- Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner, 17 Jul 2024, Improved Algorithms for Kernel Matrix-Vector Multiplication, LCFM 2024, https://openreview.net/forum?id=7CCzyEtZXH https://openreview.net/pdf?id=7CCzyEtZXH
- Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
- Helin Cheng, Wenxuan Li, Yuechen Lu, and Weifeng Liu. 2023. HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore Processors. In Proceedings of the 52nd International Conference on Parallel Processing (ICPP '23). Association for Computing Machinery, New York, NY, USA, 807–817. https://doi.org/10.1145/3605573.3605611 https://dl.acm.org/doi/abs/10.1145/3605573.3605611
- Chunxu Lin, Wensheng Luo, Yixiang Fang, Chenhao Ma, Xilin Liu, and Yuchi Ma. 2024. On Efficient Large Sparse Matrix Chain Multiplication. Proc. ACM Manag. Data 2, 3, Article 156 (June 2024), 27 pages. https://doi.org/10.1145/3654959 https://dl.acm.org/doi/abs/10.1145/3654959
- NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
- Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
- Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
- Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
- A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
- Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad, 21 Aug 2024, Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication, https://arxiv.org/abs/2408.11988
- Suchita Pati, Shaizeen Aga, Nuwan Jayasena, Matthew D. Sinclair, 3 Sep 2024, Global Optimizations & Lightweight Dynamic Logic for Concurrency, https://arxiv.org/abs/2409.02227 (Parallelizing GEMM at a granular level.)
- Pranav Dangi, Zhenyu Bai, Dhananjaya Wijerathne, Rohan Juneja, 2024, ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in ML, https://pranavdangi.github.io/papers/PACT24.pdf
- Pete Warden, April 20, 2015, Why GEMM is at the heart of deep learning, https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, 18 Dec 2014 (v3), cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
- Stefan Hadjis, Firas Abuzaid, Ce Zhang, Christopher Ré, 26 May 2015 (v2), Caffe con Troll: Shallow Ideas to Speed Up Deep Learning, https://arxiv.org/abs/1504.04343
- Roman Dubtsov, Evarist Fomenko and Babak Hejazi, Feb 01, 2023, New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs, https://developer.nvidia.com/blog/new-cublas-12-0-features-and-matrix-multiplication-performance-on-nvidia-hopper-gpus/
- NVIDIA, 2024, Matrix Multiplication Background User's Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html https://docs.nvidia.com/deeplearning/performance/pdf/Matrix-Multiplication-Background-User-Guide.pdf
- Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
- Andy He and Evan Williams, 2023, Computational Complexity of Matrix Multiplication, CS 6810 Fall 2023, Cornell University, https://courses.cs.cornell.edu/cs6810/2023fa/Matrix.pdf
- Adnan Hoque, Less Wright, Chih-Chieh Yang, Mudhakar Srivatsa, Raghu Ganti, 22 Feb 2024 (v2), Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition, https://arxiv.org/abs/2402.00025
- Anton Lokhmotov, 17 Nov 2015 (v2), GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication, https://arxiv.org/abs/1511.03742
- Davis Blalock, John Guttag, 21 Jun 2021, Multiplying Matrices Without Multiplying, https://arxiv.org/abs/2106.10860
- Y. Song, Y. Meng, B. Chen, S. Chen and Y. Kang, 2024, SALTM: Accelerating Large Transformers in Multi-device System with 2D Model Partitioning Method, Integrated Circuits and Systems, doi: 10.23919/ICS.2024.3458897, https://ieeexplore.ieee.org/abstract/document/10678935 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10678935
- Cong Li and Yutao Xu. 2024. Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model Inference. In Proceedings of the 17th ACM International Systems and Storage Conference (SYSTOR '24). Association for Computing Machinery, New York, NY, USA, 53–67. https://doi.org/10.1145/3688351.3689153 https://dl.acm.org/doi/10.1145/3688351.3689153 PDF: https://dl.acm.org/doi/pdf/10.1145/3688351.3689153
- Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
- Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
- Han Xu, Yutong Li, Shihao Ji, 12 Sep 2024, LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs, https://arxiv.org/abs/2409.11424 (Matrix multiplications are 97% of computations, which are optimized with a pipelined matrix-vector operation.)
- Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
- DominikGreweandAntonLokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
- Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
- Chendi Li, Yufan Xu, Sina Mahdipour Saravani, and Ponnuswamy Sadayappan. 2024. Accelerated Auto-Tuning of GPU Kernels for Tensor Computations. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS '24). Association for Computing Machinery, New York, NY, USA, 549–561. https://doi.org/10.1145/3650200.3656626 https://dl.acm.org/doi/abs/10.1145/3650200.3656626 PDF: https://dl.acm.org/doi/pdf/10.1145/3650200.3656626
- Bagus Hanindhito and Lizy K. John. 2024. Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE '24). Association for Computing Machinery, New York, NY, USA, 178–189. https://doi.org/10.1145/3629526.3653835 https://dl.acm.org/doi/abs/10.1145/3629526.3653835 PDF: https://lca.ece.utexas.edu/pubs/Hanindhito_AcceleratingMLWorkloads.pdf
- C. Wang, P. Song, H. Zhao, F. Zhang, J. Wang and L. Zhang, "High-Utilization GPGPU Design for Accelerating GEMM Workloads: An Incremental Approach," 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, Singapore, 2024, pp. 1-5, doi: 10.1109/ISCAS58744.2024.10558334. https://ieeexplore.ieee.org/abstract/document/10558334
- Cris Cecka, Feb 27, 2017, Pro Tip: cuBLAS Strided Batched Matrix Multiply, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
- Hansung Kim, Ruohan Yan, Joshua You, Tieliang Vamber Yang, Yakun Sophia Shao, 22 Aug 2024, Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency, https://arxiv.org/abs/2408.12073 (Hardware optimization for GEMM operations.)
- Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura, 20 Sep 2024, Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit, https://arxiv.org/abs/2409.13313
- Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang, 26 Sep 2024 (v2), INT-FlashAttention: Enabling Flash Attention for INT8 Quantization, https://arxiv.org/abs/2409.16997 https://github.com/INT-FlashAttention2024/INT-FlashAttention
- Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
- Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
- Theo Gregersen, Pratyush Patel, Esha Choukse, 26 Sep 2024, Input-Dependent Power Usage in GPUs, https://arxiv.org/abs/2409.18324
- Enrique Galvez, Feb 2024, A study of Convolutions for Efficient Inference of Deep Neural Networks on Embedded Processors, Master's Thesis, France, University de Lyon, https://largo.lip6.fr/~cassagnea/docs/reports/Galvez2024%20-%20A%20Study%20of%20Convolutions%20for%20Efficient%20Inference%20of%20Deep%20Neural%20Networks%20on%20Embedded%20Processors.pdf
- Yuxiao Fan, 2024, Design and research of high-performance convolutional neural network accelerator based on Chipyard, ICAITA-2024, Journal of Physics: Conference Series 2858 (2024) 012001, IOP Publishing, doi:10.1088/1742-6596/2858/1/012001, https://iopscience.iop.org/article/10.1088/1742-6596/2858/1/012001/pdf (Convolution optimization on a RISC-V architecture.)
- Adrián Castelló, Héctor Martínez, Sandra Catalán, Francisco D. Igual, Enrique S. Quintana-Ortí, 2024, Experience-guided, Mixed-precision Matrix Multiplication with Apache TVM for ARM processors https://www.researchsquare.com/article/rs-5017241/v1 (Optimization of GEMM for sparse and low-bit quantized versions for TVM, with code examples.)
- Haiyue Ma, Jian Liu, Ronny Krashinsky, 10 Oct 2024, Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM, https://arxiv.org/abs/2410.07531
- Bettayeb, M., Halawani, Y., Khan, M.U. et al. Efficient memristor accelerator for transformer self-attention functionality. Sci Rep 14, 24173 (2024). https://doi.org/10.1038/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z.pdf
- Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
- Or Ordentlich, Yury Polyanskiy, 17 Oct 2024, Optimal Quantization for Matrix Multiplication, https://arxiv.org/abs/2410.13780
- Nitish Satya Murthy, Francky Catthoor, and Marian Verhelst. 2024. Optimization of block-scaled integer GeMMs for efficient DNN deployment on scalable in-order vector processors. J. Syst. Archit. 154, C (Sep 2024). https://doi.org/10.1016/j.sysarc.2024.103236 https://dl.acm.org/doi/abs/10.1016/j.sysarc.2024.103236
- Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU, Lingnan Xia, Hua Ma, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10721583
- OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
- Hyeji Kim, Yeongmin Lee, Chun-Gi Lyuh, 28 October 2024, PF-GEMV: Utilization maximizing architecture in fast matrix–vector multiplication for GPT-2 inference, https://doi.org/10.4218/etrij.2024-0111 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2024-0111 https://onlinelibrary.wiley.com/doi/epdf/10.4218/etrij.2024-0111
- X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/
- Ulrich Drepper, October 23, 2007, Memory part 5: What programmers can do, https://lwn.net/Articles/255364/
- David Spuler, March 2024, Chapter 34. MatMul/GEMM, Book Excerpt from "Generative AI in C++", https://www.aussieai.com/book/ch34-matmul-gemm
- Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM
- Gray, S. 2014. A full walk through of the SGEMM implementation, https://github.com/NervanaSystems/maxas/wiki/SGEMM
- Lai, J., and Seznec, A. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. International Symposium on Code Generation and Optimization (CGO '13), 1–10. https://inria.hal.science/file/index/docid/789958/filename/112_Lai.pdf
- Andrew Lavin, Scott Gray, 10 Nov 2015 (v2), Fast Algorithms for Convolutional Neural Networks, https://arxiv.org/abs/1509.09308
- Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst, 24 Nov 2024, Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format, https://arxiv.org/abs/2411.15982
- Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton, 9 Aug 2024, Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge, https://arxiv.org/abs/2408.05152
- Conner Takehana, Aaryan Singhal, Nov 28, 2024, ThunderMittens For Your ThunderKittens, https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx (Porting TK to Apple Metal and MLX on the M2 chips.)
- Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
Fast Matrix Multiplication Computation Theory
The main techniques for faster matrix multiplication, of general matrices, include:
- Strassen's algorithm
- Winograd's algorithm
- Fast Fourier Transform (FFT) methods
Matrix multiplications can also be sped up by using specialized matrices:
- Low-rank matrix factorization
- Sparse matrices
- Special matrix methods (e.g. Butterfly matrices, Monarch matrices, etc.)
General Matrix Multiplication Research
Papers on the low-level algorithms for optimizing matrix multiplication for general matrices (i.e. all matrices, not just a subtype):
- Bläser, M. Fast matrix multiplication. Theory Comput. 5, 1–60 (2013), https://www.semanticscholar.org/paper/Fast-Matrix-Multiplication-Bl%C3%A4ser/23443bee5576f05801fed6c1494517be2ad2e742
- Hopcroft, J. E. & Kerr, L. R., On minimizing the number of multiplications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 30–36 (1971), https://epubs.siam.org/doi/10.1137/0120004
- Lim, L.-H., Tensors in computations. Acta Numer. 30, 555–764 (2021). https://arxiv.org/abs/2106.08090
- Pan, V. Y., Fast feasible and unfeasible matrix multiplication. Preprint, 2018, https://arxiv.org/abs/1804.04102
- Schönhage, A., Partial and total matrix multiplication. SIAM J. Comput. 10, 434–455 (1981), https://epubs.siam.org/doi/10.1137/0210032
- Le Gall, F. Powers of tensors and fast matrix multiplication. In International Symposium on Symbolic and Algebraic Computation 296–303 (ACM, 2014), https://arxiv.org/abs/1401.7714
- Strassen, V. The asymptotic spectrum of tensors and the exponent of matrix multiplication. In 27th Annual Symposium on Foundations of Computer Science 49–54 (IEEE, 1986), https://ieeexplore.ieee.org/document/4568194
- Coppersmith, D. & Winograd, S., Matrix multiplication via arithmetic progressions. In ACM Symposium on Theory of Computing 1–6 (ACM, 1987), https://dl.acm.org/doi/10.1145/28395.28396
- Alman, J. & Williams, V. V., A refined laser method and faster matrix multiplication. In ACM-SIAM Symposium on Discrete Algorithms 522–539 (SIAM, 2021), https://simons.berkeley.edu/news/refined-laser-method-faster-matrix-multiplication-breakthroughs
- Smirnov, A. V., The bilinear complexity and practical algorithms for matrix multiplication. Comput. Math. Math. Phys. 53, 1781–1795 (2013). PDF: https://cs.uwaterloo.ca/~eschost/Exam/Smirnov.pdf
- Heule, M. J., Kauers, M. & Seidl, M. New ways to multiply 3 × 3-matrices. J. Symb. Comput. 104, 899–916 (2021), https://arxiv.org/abs/1905.10192
- Sedoglavic, A., A non-commutative algorithm for multiplying (7 × 7) matrices using 250 multiplications. 2017, https://arxiv.org/abs/1712.07935
- Ye, K. & Lim, L.-H., Fast structured matrix computations: tensor rank and Cohn–Umans method. Found. Comput. Math. 18, 45–95 (2018). https://arxiv.org/abs/1601.00292
- Benson, A. R. & Ballard, G., A framework for practical parallel fast matrix multiplication. ACM SIGPLAN Not. 50, 42–53 (2015). https://arxiv.org/abs/1409.2908
- Huang, J., Smith, T. M., Henry, G. M. & Van De Geijn, R. A., Strassen’s algorithm reloaded. In International Conference for High Performance Computing, Networking, Storage and Analysis 690–701 (IEEE, 2016). https://ieeexplore.ieee.org/document/7877137
- Alexandre Sedoglavic, Yet another catalogue of fast matrix multiplication algorithms, Université de Lille, 2021, https://fmm.univ-lille.fr/, Bibliography: https://fmm.univ-lille.fr/biblio.html (An extensive bibliography with 280 papers on matrix multiplication!)
- Lavin, A.; Gray, S. Fast Algorithms for Convolutional Neural Networks. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021, doi:10.1109/CVPR.2016.435 https://arxiv.org/abs/1509.09308 (Uses Winograd's algorithm.)
- A Freudenberg, 2023, Novel techniques for accelerating statistical operations on compressed genomic data, Ph.D. Thesis, University of Mannheim, https://madoc.bib.uni-mannheim.de/65226/1/Dissertation_Freudenberg.pdf (Faster GPU-accelerated matrix multiplication algorithms in the context of genetics.)
- Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf
- Wang, H., Ma, C., 2021, An optimization of im2col, an important method of CNNs, based on continuous address access. 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE). pp. 314–320. IEEE (2021) https://ieeexplore.ieee.org/document/9342343
- Jörg Arndt, 2010, Matters Computational Ideas, Algorithms, Source Code, https://dl.acm.org/doi/10.5555/1941953, https://www.jjj.de/fxt/fxtpage.html#fxtbook, Code: https://www.jjj.de/bitwizardry/bitwizardrypage.html
- Allen, F. E., and Cocke, J. 1972. A catalogue of optimizing transformations. In Design and Optimization of Compilers, Prentice-Hall, Englewood Cliffs, N.J., pp. 1–30. PDF: https://www.clear.rice.edu/comp512/Lectures/Papers/1971-allen-catalog.pdf (Matrix multiplication is used for examples of loop optimizations, if you like reading old-style code from the 1970s.)
- Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P Sadayappan. 2015. On optimizing machine learning workloads via kernel fusion. ACM SIGPLAN Notices 50, 8 (2015), 173–182. https://dl.acm.org/doi/10.1145/2688500.2688521, PDF: https://mboehm7.github.io/resources/ppopp2015.pdf (Fused MatMul.)
- Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul 2016), 11 pages. https://doi.org/10.1145/2897824.2925952, https://research.google/pubs/pub45525/, PDF: https://dl.acm.org/doi/pdf/10.1145/2897824.2925952 (GEMM optimization discussed.)
- Ashari, A., Tatikonda, S., Boehm, M., Reinwald, B., Campbell, K., Keenleyside, J., and Sadayappan, P., 2015, On optimizing machine learning workloads via kernel fusion. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pp. 173–182, New York, NY, USA. Association for Computing Machinery. ISBN 9781450332057. doi: 10.1145/2688500.2688521. https://dl.acm.org/doi/10.1145/2688500.2688521, PDF: https://mboehm7.github.io/resources/ppopp2015.pdf (Analysis of kernel fusion including for sparse and dense matrix multiplication.)
- Shiyao Xu; Jingfei Jiang; Jinwei Xu; Chaorun Liu; Yuanhong He; Xiaohang Liu, 2022, Sparkle: A High Efficient Sparse Matrix Multiplication Accelerator for Deep Learning, 2022 IEEE 40th International Conference on Computer Design (ICCD) https://ieeexplore.ieee.org/document/9978530
- Darshan C. Ganji, Saad Ashfaq, Ehsan Saboori, Sudhakar Sah, Saptarshi Mitra, MohammadHossein AskariHemmat, Alexander Hoffman, Ahmed Hassanien, Mathieu Léonardon, 18 Apr 2023, DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables, https://arxiv.org/abs/2304.09049
- T. Ogita, S. Rump, S. Oishi, 1 June 2005, Accurate Sum and Dot Product, SIAM J. Sci. Comput., DOI:10.1137/030601818Corpus ID: 10641044 PDF: http://www.oishi.info.waseda.ac.jp/~oishi/papers/OgRuOi05.pdf
- Gunnels JA, Henry GM, van de Geijn RA (2001) A family of high-performance matrix multiplication algorithms. Computational Science—ICCS 2001. Springer Berlin Heidelberg, pp 51–60. https://doi.org/10.1007/3-540-45545-0_15
- Sedoglavic, A. 2017, A non-commutative algorithm for multiplying 5x5 matrices using 99 multiplications. Preprint at https://arxiv.org/abs/1707.06860 (2017). https://arxiv.org/abs/1707.06860
- Burichenko, V. P., 2014, On symmetries of the Strassen algorithm. Preprint at https://arxiv.org/abs/1408.6273 (2014).
- Grochow, J. A. & Moore, C. 2017, Designing Strassen’s algorithm. Preprint at https://arxiv.org/abs/1708.09398 (2017). https://arxiv.org/abs/1708.09398
- Grochow, J. A. & Moore, C., 2016, Matrix multiplication algorithms from group orbits. Preprint at https://arxiv.org/abs/1612.01527 (2016).
- Elser, V., 2016, A network that learns Strassen multiplication. J. Mach. Learn. Res. 17, 3964–3976 (2016). https://arxiv.org/abs/1601.07227
- Tschannen, M., Khanna, A. & Anandkumar, A, 2018, StrassenNets: deep learning with a multiplication budget. In International Conference on Machine Learning 4985–4994 (PMLR, 2018). https://arxiv.org/abs/1712.03942
- Wagner, A. Z., 2021, Constructions in combinatorics via neural networks. Preprint at https://arxiv.org/abs/2104.14516 (2021). https://arxiv.org/abs/2104.14516
- Laderman, 1976, http://dx.doi.org/10.1090/S0002-9904-1976-13988-2
- DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
- Robert A. van de Geijn, Enrique S. Quintana-Ort, December 2007, The Science of Programming Matrix Computations, First Edition, https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf
- Junki Park, Hyunsung Yoon, Daehyun Ahn, Jungwook Choi, and Jae-Joon Kim. 2020. OPTIMUS: OPTImized Matrix Multiplication Structure for Transformer neural network accelerator. Proceedings of Machine Learning and Systems 2 (2020), 363-378. https://proceedings.mlsys.org/paper_files/paper/2020/file/91ba7292e5388b90b58d0b839a7f19ec-Paper.pdf
- Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016 PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016
Strassen Matrix Multiplication
The Strassen method is based on a way to multiply 2x2 matrices using seven arithmetic multiplications, rather than the usual eight. Generalizing this idea leads to a faster way to multiply general matrices. This reduces the asymptomic complexity of matrix-matrix multiplication to O(n^2.8) (where 2.8 is log base 2 of 7) rather than O(n^3), i.e., cubic 3.0 (where 3 is log base 2 of 8).
- Strassen, V. The asymptotic spectrum of tensors and the exponent of matrix multiplication. In 27th Annual Symposium on Foundations of Computer Science 49–54 (IEEE, 1986), https://ieeexplore.ieee.org/document/4568194
- Huang, J., Yu, C. D. & Geijn, R. A. V. D. Strassen’s algorithm reloaded on GPUs. ACM Trans. Math. Softw. 46, 1–22 (2020). https://dl.acm.org/doi/10.1145/3372419
- Tschannen, M., Khanna, A. & Anandkumar, A, StrassenNets: deep learning with a multiplication budget. In International Conference on Machine Learning 4985–4994 (PMLR, 2018). https://arxiv.org/abs/1712.03942
- Huang, J., Smith, T. M., Henry, G. M. & Van De Geijn, R. A., Strassen’s algorithm reloaded. In International Conference for High Performance Computing, Networking, Storage and Analysis 690–701 (IEEE, 2016). https://ieeexplore.ieee.org/document/7877137
- Burichenko, V. P., 2014, On symmetries of the Strassen algorithm. Preprint at https://arxiv.org/abs/1408.6273 (2014).
- Grochow, J. A. & Moore, C. 2017, Designing Strassen’s algorithm. Preprint at https://arxiv.org/abs/1708.09398 (2017). https://arxiv.org/abs/1708.09398
- Elser, V., 2016, A network that learns Strassen multiplication. J. Mach. Learn. Res. 17, 3964–3976 (2016). https://arxiv.org/abs/1601.07227
- Jianyu Huang, Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn, 3 May 2016, Implementing Strassen's Algorithm with BLIS, https://arxiv.org/abs/1605.01078
- Gianfranco Bilardi, Lorenzo De Stefani, 7 May 2016, The I/O complexity of Strassen's matrix multiplication with recomputation, https://arxiv.org/abs/1605.02224
- Christian Ikenmeyer, Vladimir Lysikov, 13 Feb 2019 (v2), Strassen's 2x2 matrix multiplication algorithm: A conceptual perspective, https://arxiv.org/abs/1708.08083
- Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn, 24 Aug 2018, Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs, https://arxiv.org/abs/1808.07984
- Chandan Misra, Sourangshu Bhattacharya, Soumya K. Ghosh, 22 Nov 2018 (v2), Stark: Fast and Scalable Strassen's Matrix Multiplication using Apache Spark, https://arxiv.org/abs/1811.07325
- Viviana Arrigoni, Annalisa Massini, 6 Feb 2019, Fast Strassen-based A^t A Parallel Multiplication, https://arxiv.org/abs/1902.02104
- Viviana Arrigoni, Filippo Maggioli, Annalisa Massini, Emanuele Rodolà, 25 Oct 2021, Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose, https://arxiv.org/abs/2110.13042
- Osman B. Guney, Suayb S. Arslan, 10 Oct 2022, Fault-Tolerant Strassen-Like Matrix Multiplication, https://arxiv.org/abs/2210.04441
- Paolo D'Alberto, 20 Dec 2023, Strassen's Matrix Multiplication Algorithm Is Still Faster, https://arxiv.org/abs/2312.12732
- Jean-Guillaume Dumas, Clément Pernet, Alexandre Sedoglavic, 28 Jun 2024 (v2), Strassen's algorithm is not optimally accurate, https://arxiv.org/abs/2402.05630
- Afzal Ahmad, Linfeng Du, Wei Zhang, 4 Jun 2024, Fast and Practical Strassen's Matrix Multiplication using FPGAs, https://arxiv.org/abs/2406.02088
- Tomonori Kouya, 7 Oct 2014, Accelerated Multiple Precision Matrix Multiplication using Strassen's Algorithm and Winograd's Variant, https://arxiv.org/abs/1410.1599
- Tomonori Kouya, 29 Oct 2015, Performance evaluation of multiple precision matrix multiplications using parallelized Strassen and Winograd algorithms https://arxiv.org/abs/1510.08642
- Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, Wei Zhou, 18 May 2009 (v5), Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm, https://arxiv.org/abs/0707.2347
- Jean-Guillaume Dumas, Victor Pan, 17 Dec 2016, Fast Matrix Multiplication and Symbolic Computation, https://arxiv.org/abs/1612.05766
Winograd Matrix Multiplication
Research papers on using the Winograd or the Coppersmith-Winograd algorithm:
- Coppersmith, D. & Winograd, S., Matrix multiplication via arithmetic progressions. In ACM Symposium on Theory of Computing 1–6 (ACM, 1987), https://dl.acm.org/doi/10.1145/28395.28396
- Lavin, A.; Gray, S. Fast Algorithms for Convolutional Neural Networks. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021, doi:10.1109/CVPR.2016.435 https://arxiv.org/abs/1509.09308 (Uses Winograd's algorithm.)
- Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf
- Ruofan Wu, Feng Zhang, Jiawei Guan, Zhen Zheng, Xiaoyong Du, and Xipeng Shen. DREW: Efficient Winograd CNN Inference with Deep Reuse. In Proceedings of the ACM Web Conference 2022, pages 1807–1816, 2022. https://dl.acm.org/doi/10.1145/3485447.3511985 PDF: https://research.csc.ncsu.edu/picture/publications/papers/www2022.pdf (Combines the Winograd matrix multiplication algorithm with computation reuse.)
- F Zhang, R Wu, J Guan, Z Zheng, X Guo, 2023, Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge Device, IEEE Transactions on Knowledge and Data Engineering, Volume 35, Issue 10, 01 October 2023, https://ieeexplore.ieee.org/abstract/document/10106424/ (Further analysis of DREW, which combines Winograd optimizations with data reuse.)
- Zlateski, A., Jia, Z., Li, K., Durand, F., The anatomy of efficient fft and winograd convolutions on modern cpus. Proceedings of the ACM International Conference on Supercomputing (New York, NY, USA, 2019), ICS ’19, Association for Computing Machinery, p. 414–424. PDF: https://dl.acm.org/doi/pdf/10.1145/3330345.3330382
- Tomonori Kouya, 7 Oct 2014, Accelerated Multiple Precision Matrix Multiplication using Strassen's Algorithm and Winograd's Variant, https://arxiv.org/abs/1410.1599
- Tomonori Kouya, 29 Oct 2015, Performance evaluation of multiple precision matrix multiplications using parallelized Strassen and Winograd algorithms https://arxiv.org/abs/1510.08642
- François Le Gall, Florent Urrutia, 6 Nov 2017 (v2), Improved Rectangular Matrix Multiplication using Powers of the Coppersmith-Winograd Tensor, https://arxiv.org/abs/1708.05622
- Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, Wei Zhou, 18 May 2009 (v5), Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm, https://arxiv.org/abs/0707.2347
- Enrique Galvez, Feb 2024, A study of Convolutions for Efficient Inference of Deep Neural Networks on Embedded Processors, Master's Thesis, France, University de Lyon, https://largo.lip6.fr/~cassagnea/docs/reports/Galvez2024%20-%20A%20Study%20of%20Convolutions%20for%20Efficient%20Inference%20of%20Deep%20Neural%20Networks%20on%20Embedded%20Processors.pdf
- Protsenko, V., Kryzhanovskiy, V., Filippov, A. (2025). Quantization-Friendly Winograd Transformations for Convolutional Neural Networks. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15116. Springer, Cham. https://doi.org/10.1007/978-3-031-73636-0_11 https://link.springer.com/chapter/10.1007/978-3-031-73636-0_11
FFT Matrix Multiplication
Research papers using the Fast Fourier Transformation (FFT) for matrix multiplication:
- Zlateski, A., Jia, Z., Li, K., Durand, F., The anatomy of efficient fft and winograd convolutions on modern cpus. Proceedings of the ACM International Conference on Supercomputing (New York, NY, USA, 2019), ICS ’19, Association for Computing Machinery, p. 414–424. PDF: https://dl.acm.org/doi/pdf/10.1145/3330345.3330382
- Alexandre Siron, Sean Molesky, 25 Jun 2024, A Split Fast Fourier Transform Algorithm for Block Toeplitz Matrix-Vector Multiplication, https://arxiv.org/abs/2406.17981
- Andrew Lavin, Scott Gray, 10 Nov 2015 (v2), Fast Algorithms for Convolutional Neural Networks, https://arxiv.org/abs/1509.09308
Approximate Matrix Multiplication
Approximate Matrix Multiplication (AMM) is a variety of complicated model optimization techniques that replace matrix multiplications with various approximations that avoid the cost of arithmetic multiplication, trading off some accuracy. These methods are usually distinct from quantization methods, are not specific to certain subclasses of matrices, and evoke more advanced mathematics in the theory of matrices.
Note that these algorithms apply at the high-level of how matrices are multiplied with other matrices or with vectors (e.g. avoiding some vector dot products), whereas there are also low-level optimizations of the arithmetic operation of multiplying two numbers (see advanced mathematics). These two classes of research are not the same, and are actually orthogonal to each other.
Research papers on approximate matrix multiplication include:
- Davis Blalock, John Guttag, "Multiplying Matrices Without Multiplying", arXiv:2106.10860v1, June 2021, https://arxiv.org/abs/2106.10860
- Anthony Alford, "MIT Researchers Open-Source Approximate Matrix Multiplication Algorithm MADDNESS", InfoQ, October 5, 2021, https://www.infoq.com/news/2021/10/mit-matrix-multiplication/
- Anirban Dasgupta, Ravi Kumar, Tamás Sarlós, "A Sparse Johnson-Lindenstrauss Transform", In Proceedings of the forty-second ACM symposium on Theory of computing, pp.341–350, 2010, https://arxiv.org/abs/1004.4240
- Mairal, J., Bach, F., Ponce, J., and Sapiro, G., "Online dictionary learning for sparse coding". In Proceedings of the 26th annual international conference on machine learning, pp. 689–696, 2009.
- Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM Brian Plancher; Camelia D. Brumar; Iulian Brumar; Lillian Pentecost; Saketh Rama; David Brooks, 2019 IEEE High Performance Extreme Computing Conference (HPEC) DOI: 10.1109/HPEC.2019.8916468, https://ieeexplore.ieee.org/abstract/document/8916468 PDF: https://brianplancher.com/files/Applied_Approx_MM.pdf
- A. Deshpande, L. Rademacher, S. Vempala, and G. Wang, “Matrix approximation and projective clustering via volume sampling,” in Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2006, pp.1117–1126, https://dl.acm.org/doi/10.5555/1109557.1109681
- P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algorithms for matrices I: Approximating matrix multiplication,” SIAM Journal on Computing, vol. 36, no. 1, pp. 132–157, 2006, https://epubs.siam.org/doi/10.1137/S0097539704442684, https://doi.org/10.1137/S0097539704442684
- P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix,” SIAM Journal on computing, vol. 36, no. 1, pp. 158–183, 2006, PDF: https://www.stat.berkeley.edu/~mmahoney/pubs/matrix2_SICOMP.pdf
- P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition,” SIAM Journal on Computing, vol. 36, no. 1, pp. 184–206, 2006, PDF: https://www.stat.berkeley.edu/~mmahoney/pubs/matrix3_SICOMP.pdf
- A. S. Householder and G. Young, “Matrix approximation and latent roots,” The American Mathematical Monthly, vol. 45, no. 3, pp. 165–171, 1938, https://www.jstor.org/stable/2302980
- R. Pagh, “Compressed matrix multiplication,” ACM Transactions on Computation Theory (TOCT), vol. 5, no. 3, p. 9, 2013, https://arxiv.org/abs/1108.1320
- N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011, https://arxiv.org/abs/0909.4061
- T. Sarlos, “Improved approximation algorithms for large matrices via random projections,” in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE, 2006, pp. 143-152, https://ieeexplore.ieee.org/document/4031351
- G. H. Golub, A. Hoffman, and G. W. Stewart, “A generalization of the eckart-young-mirsky matrix approximation theorem,” Linear Algebra and its applications, vol. 88, pp. 317–327, 1987, https://www.sciencedirect.com/science/article/pii/0024379587901145
- A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha, “A generalized maximum entropy approach to bregman co-clustering and matrix approximation,” Journal of Machine Learning Research, vol. 8, no. Aug, pp. 1919–1986, 2007, https://dl.acm.org/doi/10.1145/1014052.1014111, PDF: https://www.cs.utexas.edu/users/inderjit/public_papers/kdd_bregman_coclustering.pdf
- M. Adelman and M. Silberstein, “Faster neural network training with approximate tensor operations,” arXiv preprint arXiv:1805.08079, 2018, https://arxiv.org/abs/1805.08079
- Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
- 12 Mar 2024, IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers, Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh, https://arxiv.org/abs/2403.07339v1 (Faster GEMM with low-bitwidth integer approximations.)
- Uzair Shah1, Jens Schneider, Giovanni Pintore, Enrico Gobbetti, Mahmood Alzubaidi, Mowafa Househ, Marco Agus, 2024, EleViT: exploiting element-wise products for designing efficient and lightweight vision transformers, https://publications.crs4.it/pubdocs/2024/SSPGAHA24/cvprt4v2024-elevit.pdf
- Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
- Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner, 17 Jul 2024, Improved Algorithms for Kernel Matrix-Vector Multiplication, LCFM 2024, https://openreview.net/forum?id=7CCzyEtZXH https://openreview.net/pdf?id=7CCzyEtZXH
- Noah Amsel, Tyler Chen, Feyza Duman Keles, Diana Halikias, Cameron Musco, Christopher Musco, 26 Mar 2024 (v3), Fixed-sparsity matrix approximation from matrix-vector products, https://arxiv.org/abs/2402.09379
- SZ Lin, YC Chen, YH Chang, TW Kuo, HP Li, 2024, LUTIN: Efficient Neural Network Inference with Table Lookup, ISLPED ’24, August 5-7, 2024, Newport Beach, CA, USA, https://dl.acm.org/doi/pdf/10.1145/3665314.3670804
- Q. Deng, Y. Zhang, M. Zhang and J. Yang, "LAcc: Exploiting Lookup Table-based Fast and Accurate Vector Multiplication in DRAM-based CNN Accelerator," 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2019, pp. 1-6. https://ieeexplore.ieee.org/document/8806810 PDF: https://dl.acm.org/doi/pdf/10.1145/3316781.3317845
- Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee, 31 Aug 2020 (v2), BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs, https://arxiv.org/abs/2005.09904
- Hongyaoxing Gu, 27 May 2024, LRAMM -- Low precision approximates GEMM via RSVD, https://arxiv.org/abs/2405.16917
- Yi Ren, Ruge Xu, Xinfei Guo, Weikang Qian, 27 Nov 2024, FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs--Down to 2 Bits! https://arxiv.org/abs/2411.18055
More AI Research
Read more about: