Aussie AI

Optimizing Matrix Multiplication Algorithms and MatMul/GEMM Kernels

Last Updated 19 March, 2025

by David Spuler, Ph.D.

Matrix Multiplication Research

With papers on this since the 1970s, you'd think we'd be done optimizing the matrix multiplication code by now. Although they were sequential algorithms to start with, even parallel algorithms for matrix multiplication have decades of theory. And yet there is a continual stream of research papers on MatMul/GEMM optimizations.

Feel free to blame the AI bubble for this. AI engines are built on matrix multiplication, and it occurs in all of these standard LLM components:

Attention algorithms (self-attention) — uses three special-purpose QKV matrices.
Feed-Forward Networks (FFNs) — uses two matrix multiplications with an intervening activation function (e.g., RELU or SwiGLU).
Embedding and unembedding matrix

Note that attention and FFN blocks occur in every layer of an LLM.

There's quite a lot of work to do to optimize attention and FFNs. The reasons for the ongoing research are that there are many nuances and special cases. Some examples:

Sparse versus dense matrices (i.e. SpMM versus GEMM algorithms).
Matrix-matrix versus matrix-vector multiplications.
Rectangular versus square matrices, including tall or flat rectangular matrices.
Special shapes of matrices (e.g. triangular).

Optimizations to the matrix processing code must also deal with:

Compute-bound versus memory-bound algorithms.
Different computational versus memory costs in different hardware architectures (e.g., CPUs, GPUs, NPUs).
Generalization from 2D matrices to 3D tensors (i.e., "tensor slices").
Integer versus floating-point arithmetic.
Quantization to fewer bits, such as 8-bit integer or 4-bit kernels (also reduced-bit floating-point FP8 and even FP4 kernels).
Permutation-based vector/matrix/tensor algorithms.
Sparse matrix storage compressions formats (there are various in-memory methods).
Non-negative matrices (e.g., activations arising after RELU, or fixed non-negatives from scaling the weight matrices).

There are also some features specific to AI engine algorithms and their use of matrices:

Prefill phase (prompt processing) is compute-bound, but in the decoding phase, FFNs are compute-bound because the matrices are static, whereas the QKV attention algorithms are memory-bound because the KV cache is not static data.
FFNs involve two matrix multiplications separated by an element-wise activation function (e.g., kernel fusion is possible).
Attention kernels have a particular QKV algorithm, for which there are many specialty kernels: e.g., Group Query Attention (GQA), Multi-Query Attention (MAQ), Flash Attention, Paged Attention, etc.
Transposed matrices are used in the QKV algorithm, adding another wrinkle.
Embedding and unembedding matrices used to convert between tokens and embedding vectors.

And there are some theoretical alternative methods of handling matrix multiplications:

Low-rank matrix factorization algorithms (e.g., SVD).
Approximate matrix multiplication
Element-wise multiplication (Hadamard product)

Since many of these techniques are orthogonal, you can see the combinatorial growth of the various sub-cases. We might be reading GEMM papers for a while!

Optimizing Matrix Multiplication

Matrix multiplications are the basis of neural network processing. The main bottleneck is the multiplication operation deep inside the nested loops of the matrix multiplier. The main optimization in modern times is hardware optimizations, to run lots of those computations in parallel.

The main techniques to use in faster MatMul/GEMM kernels include:

GPU versions of kernels
Advanced theoretical matrix multiplication algorithms (e.g. Strassen's)
Approximate matrix multiplication

Researchers have been optimizing matrix multiplication for decades, so you might think it's all done and dusted. Well, firstly, a lot of that old research is sequential algorithms rather than parallel. Furthermore, most of such research was focused on reducing the number of multiplications (i.e. FLOPs), whereas the cost of memory access is a greater issue on very large matrices. Hence, there's a steady stream of matrix multiplation optimization papers, and no end in sight. Some of areas of focus for this AI-related matrix multiplication research include:

Quantized kernels — e.g. INT4 or INT8 GEMM kernels
Memory-efficient matrix multiplication (generally)
Memory-efficient attention-specific optimizations (e.g., Flash attention, Paged attention)
Sparse matrix kernels (see also sparsity optimizations)
Low-rank matrix factorization
Specialized versions (e.g., matrix-vector, square vs rectangular, etc.)
Advanced parallelization (e.g., block-level)
Tensor generalizations — add another dimension!

Several of the components of a basic AI Transformer engine (for LLMs) use matrix multiplications in a specialized way. See also the more specific optimizations of these components:

Naive Matrix Multiplication Algorithm

The simplest algorithm to compute matrix multiplication is a cubic-complexity algorithm, which involves three nested loops. There are various papers with examples:

Geeks for Geeks, 31 Oct, 2024, Program to multiply two matrices, https://www.geeksforgeeks.org/c-program-multiply-two-matrices/
Geeks for Geeks, 21 Sep, 2023, Multiply Two Matrices in C++, https://www.geeksforgeeks.org/cpp-matrix-multiplication/
Codes Cracker, Nov 2024, C Program to Multiply Two Matrices, https://codescracker.com/c/program/c-program-multiply-two-matrices.htm
Know Program, Nov 2024, Matrix Operations in C | Addition, Multiplication, Transpose, https://www.knowprogram.com/c-programming/matrix-operations-in-c/
Nadav Rotem, Nov 2024, High-Performance Matrix Multiplication, https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
Ulrich Drepper, October 23, 2007, Memory part 5: What programmers can do, https://lwn.net/Articles/255364/
Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM

Survey Papers on Matrix Multiplication Kernels (MatMul/GEMM)

Survey papers on matrix multiplication include:

Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, Vivek Sarkar, 1 Aug 2016, A survey of sparse matrix-vector multiplication performance on large matrices, https://arxiv.org/abs/1608.00636
S. Afroz, M. Tahaseen, F. Ahmed, K. S. Farshee and M. N. Huda, "Survey on matrix multiplication algorithms," 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), Dhaka, Bangladesh, 2016, pp. 151-155, doi: 10.1109/ICIEV.2016.7759985. https://ieeexplore.ieee.org/abstract/document/7759985
Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047

Surveys on Sparse Matrix Multiplication

Review papers on sparse matrix multiplication (SpMM):

Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, Vivek Sarkar, 1 Aug 2016, A survey of sparse matrix-vector multiplication performance on large matrices, https://arxiv.org/abs/1608.00636
Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047

Sparse Matrix Kernels

Sparse matrices are those with lots of zeros and few non-zero values. There are various ways to optimize MatMul/GEMM for sparsity.

Y Yang, JS Emer, D Sanchez, 2024, Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications, MIT, https://yang-yifan.github.io/papers/isca24_trapezoid.pdf
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
Hongyaoxing Gu, 11 Mar 2024, A method for accelerating low precision operations by sparse matrix multiplication, https://arxiv.org/abs/2403.06924v1
Haque, S.A.; Choudhury, N.; Hossain, S. Matrix Multiplication with Diagonals: Structured Sparse Matrices and Beyond. In Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications, Jinan, China, 17–19 June 2023; pp. 69–76. https://doi.org/10.1145/3606043.3606053
Sardar Anisul Haque,Mohammad Tanvir Parvez, Shahadat Hossain, Jan 2024, GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes, https://www.mdpi.com/1999-4893/17/1/31
D. Mukunoki, M. Kawai and T. Imamura, 2023, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 2023, pp. 608-615, doi: 10.1109/MCSoC60832.2023.00094, https://ieeexplore.ieee.org/abstract/document/10387875
Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
Helin Cheng, Wenxuan Li, Yuechen Lu, and Weifeng Liu. 2023. HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore Processors. In Proceedings of the 52nd International Conference on Parallel Processing (ICPP '23). Association for Computing Machinery, New York, NY, USA, 807–817. https://doi.org/10.1145/3605573.3605611 https://dl.acm.org/doi/abs/10.1145/3605573.3605611
Chunxu Lin, Wensheng Luo, Yixiang Fang, Chenhao Ma, Xilin Liu, and Yuchi Ma. 2024. On Efficient Large Sparse Matrix Chain Multiplication. Proc. ACM Manag. Data 2, 3, Article 156 (June 2024), 27 pages. https://doi.org/10.1145/3654959 https://dl.acm.org/doi/abs/10.1145/3654959
Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
Zhang, H., Ma, W., Yuan, W. et al. Mixed-precision block incomplete sparse approximate preconditioner on Tensor core. CCF Trans. HPC 6, 54–67 (2024). https://doi.org/10.1007/s42514-023-00165-9 https://link.springer.com/article/10.1007/s42514-023-00165-9
Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
Kaige Zhang, Xiaoyan Liu, Hailong Yang, Tianyu Feng, Xinyu Yang, Yi Liu, Zhongzhi Luan, and Depei Qian. 2024. Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor Core. In Proceedings of the 53rd International Conference on Parallel Processing (ICPP '24). Association for Computing Machinery, New York, NY, USA, 1124–1134. https://doi.org/10.1145/3673038.3673108 https://dl.acm.org/doi/abs/10.1145/3673038.3673108
Bobby Yan, Alexander J. Root, Trevor Gale, David Broman, Fredrik Kjolstad, 20 Jun 2024 (v2), Scorch: A Library for Sparse Deep Learning, https://arxiv.org/abs/2405.16883
Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad, 21 Aug 2024, Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication, https://arxiv.org/abs/2408.11988
Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
Noah Amsel, Tyler Chen, Feyza Duman Keles, Diana Halikias, Cameron Musco, Christopher Musco, 26 Mar 2024 (v3), Fixed-sparsity matrix approximation from matrix-vector products, https://arxiv.org/abs/2402.09379
Peiming Liu, Alexander J Root, Anlunxu, Yinyig Li, Fredrik Kjolstad, Aart C. Bik, 2024, Compiler Support for Sparse Tensor Convolutions, https://rootjalex.github.io/publications/oopsla2024-spconv.pdf
Pranav Dangi, Zhenyu Bai, Dhananjaya Wijerathne, Rohan Juneja, 2024, ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in ML, https://pranavdangi.github.io/papers/PACT24.pdf
Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
Anton Lokhmotov, 17 Nov 2015 (v2), GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication, https://arxiv.org/abs/1511.03742
Xiaobo Lu, Jianbin Fang, Lin Peng, Chun Huang, Zidong Du, Yongwei Zhao, and Zheng Wang. 2024. Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product. ACM Trans. Archit. Code Optim. Just Accepted (August 2024). https://doi.org/10.1145/3688612 https://dl.acm.org/doi/abs/10.1145/3688612
Patrik Okanovic, Grzegorz Kwasniewski, Paolo Sylos Labini, Maciej Besta, Flavio Vella, Torsten Hoefler, 21 Aug 2024, High Performance Unstructured SpMM Computation Using Tensor Cores, https://arxiv.org/abs/2408.11551
Takuma Yamaguchi and Federico Busato, Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores, Mar 19, 2021, https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/
OpenAI, December 6, 2017, Block-sparse GPU kernels, https://openai.com/index/block-sparse-gpu-kernels/ https://cdn.openai.com/blocksparse/blocksparsepaper.pdf https://github.com/openai/blocksparse
Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
R. L. Castro, D. Andrade and B. B. Fraguela, "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning," in IEEE Access, vol. 12, pp. 70581-70599, 2024, doi: 10.1109/ACCESS.2024.3402326. https://ieeexplore.ieee.org/abstract/document/10534045 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10534045
Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
X. Wang et al., "Vision Transformer Acceleration via a Versatile Attention Optimization Framework," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3513265. https://ieeexplore.ieee.org/abstract/document/10782988/
Zhonggen Li, Xiangyu Ke, Yifan Zhu, Yunjun Gao, Yaofeng Tu, 12 Dec 2024, HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores, https://arxiv.org/abs/2412.08902
Dongyun Kam, Myeongji Yun, Sunwoo Yoo, Seungwoo Hong, Zhengya Zhang, Youngjoo Lee, 13 Dec 2024, Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity, https://arxiv.org/abs/2412.10059
Y. Ogiwara and H. Kawashima, "Sparse Ternary Matrix Multiplication with Tensor Core for Transformer," 2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW), Naha, Japan, 2024, pp. 150-156, doi: 10.1109/CANDARW64572.2024.00031. https://ieeexplore.ieee.org/abstract/document/10817836/
Jordi Wolfson-Pou, Jan Laukemann, Fabrizio Petrini, 13 Jan 2025, Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs, https://arxiv.org/abs/2501.07056
Vasileios Titopoulos, Kosmas Alexandridis, Christodoulos Peltekis, Chrysostomos Nicopoulos, Giorgos Dimitrakopoulos, 17 Jan 2025, Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors, https://arxiv.org/abs/2501.10189 (New hardware instruction for permutation-based sparse MatMul.)
Jonas M Kübler, Yu-Xiang Wang, Shoham Sabach, Navid Ansari, Matthäus Kleindessner, Kailash Budhathoki, Volkan Cevher, George Karypis, 29 Jan 2025, A Proximal Operator for Inducing 2:4-Sparsity, https://arxiv.org/abs/2501.18015

Tiled MatMul/GEMM

Research papers on tiled MatMul/GEMM optimizations:

Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
David Spuler, March 2024, Tiled Matrix-Vector Multiplication, in Generative AI in C++, in Generative AI in C++, https://www.aussieai.com/book/ch34-tiled-matrix-vector-multiplication
NVIDIA, 2024, Matrix Multiplication Background User's Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html https://docs.nvidia.com/deeplearning/performance/pdf/Matrix-Multiplication-Background-User-Guide.pdf
Y. Song, Y. Meng, B. Chen, S. Chen and Y. Kang, 2024, SALTM: Accelerating Large Transformers in Multi-device System with 2D Model Partitioning Method, Integrated Circuits and Systems, doi: 10.23919/ICS.2024.3458897, https://ieeexplore.ieee.org/abstract/document/10678935 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10678935
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
A. Jangda, S. Maleki, M. M. Dehnavi, M. Musuvathi and O. Saarikivi, "A Framework for Fine-Grained Synchronization of Dependent GPU Kernels," 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Edinburgh, United Kingdom, 2024, pp. 93-105, doi: 10.1109/CGO57630.2024.10444873. https://ieeexplore.ieee.org/abstract/document/10444873
Zhongkai Yu, Shengwen Liang, Tianyun Ma, Yunke Cai, Ziyuan Nan, Di Huang, Xinkai Song, Yifan Hao, Jie Zhang, Tian Zhi, Yongwei Zhao, Zidong Du, Xing Hu, Qi Guo, Tianshi Chen, 24 Sep 2024, Cambricon-LLM: A Chiplet-Based Hybrid Architecture for On-Device Inference of 70B LLM, https://arxiv.org/abs/2409.15654
Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu, 1 Nov 2024, V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM, https://arxiv.org/abs/2411.00915
Nadav Rotem, Nov 2024, High-Performance Matrix Multiplication, https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0
David Spuler, March 2024, Chapter 34. MatMul/GEMM, Book Excerpt from "Generative AI in C++", https://www.aussieai.com/book/ch34-matmul-gemm
Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM
Lai, J., and Seznec, A. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. International Symposium on Code Generation and Optimization (CGO '13), 1–10. https://inria.hal.science/file/index/docid/789958/filename/112_Lai.pdf
Andrew Kerr, Duane Merrill, Julien Demouth and John Tran, Dec 05, 2017, CUTLASS: Fast Linear Algebra in CUDA C++, https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
Fengguang Song, Stanimire Tomov, Jack Dongarra, 2012, Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems, ICS’12, June 25–29, 2012, San Servolo Island, Venice, Italy. https://icl.utk.edu/files/publications/2012/icl-utk-495-2012.pdf
Dongyun Kam, Myeongji Yun, Sunwoo Yoo, Seungwoo Hong, Zhengya Zhang, Youngjoo Lee, 13 Dec 2024, Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity, https://arxiv.org/abs/2412.10059
Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Less Wright, Adnan Hoque, November 01, 2024, Deep Dive on CUTLASS Ping-Pong GEMM Kernel, https://pytorch.org/blog/cutlass-ping-pong-gemm-kernel/ https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp (CUTLASS optimized FP8 tiled GEMM kernel using warp specialization into data producers and consumers.)
M Beck, K Pöppel, P Lippe, S Hochreiter, Mar 2025, Tiled Flash Linear Attention: More Efficient Linear RNN and xLSTM Kernels, ICLR 2025 review, https://openreview.net/pdf?id=P6CPFdexbk

CUDA Matrix Multiplication

There are various papers with examples of coding Matmul/GEMM in CUDA C++:

Samuel S. Cho, 2011, CUDA Thread Basics, https://users.wfu.edu/choss/CUDA/docs/Lecture%205.pdf
Mark Harris, Feb 18, 2013, An Efficient Matrix Transpose in CUDA C/C++, https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
Cris Cecka, Feb 27, 2017, Pro Tip: cuBLAS Strided Batched Matrix Multiply, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
Dominik Grewe and Anton Lokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
Lei Mao, March 19, 2023, CUDA Coalesced Memory Access, https://leimao.github.io/blog/CUDA-Coalesced-Memory-Access/
Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594 NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM
Gray, S. 2014. A full walk through of the SGEMM implementation, https://github.com/NervanaSystems/maxas/wiki/SGEMM
Lai, J., and Seznec, A. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. International Symposium on Code Generation and Optimization (CGO '13), 1–10. https://inria.hal.science/file/index/docid/789958/filename/112_Lai.pdf
Andrew Kerr, Duane Merrill, Julien Demouth and John Tran, Dec 05, 2017, CUTLASS: Fast Linear Algebra in CUDA C++, https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, 18 Dec 2014 (v3), cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
Rajib Nath, Stanimire Tomov, and Jack Dongarra, November 18, 2010, An Improved Magma Gemm For Fermi Graphics Processing Units, The International Journal of High Performance Computing Applications, Volume 24, Issue 4, https://doi.org/10.1177/1094342010385729 https://journals.sagepub.com/doi/10.1177/1094342010385729
Fengguang Song, Stanimire Tomov, Jack Dongarra, 2012, Enabling and Scaling Matrix Computations on Heterogeneous Multi-Core and Multi-GPU Systems, ICS’12, June 25–29, 2012, San Servolo Island, Venice, Italy. https://icl.utk.edu/files/publications/2012/icl-utk-495-2012.pdf
Wang Zhiyong, 2022, NVIDIA SGEMM Practice: Step-by-step optimization of CUDA SGEMM, https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE
Siboehm, 2023, Fast CUDA SGEMM from Scratch, https://github.com/siboehm/SGEMM_CUDA
Edward Kandrot, 2023, cuda_matmul: Optimized CUDA matmul with benchmarks, https://github.com/ekandrot/cuda_matmul
Harshit Kumar, June 7, 2024, Matrix Multiplication in CUDA, https://kharshit.github.io/blog/2024/06/07/matrix-multiplication-cuda
NVIDIA, 2024, Matrix Multiplication Background User's Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html
Less Wright, Adnan Hoque, November 01, 2024, Deep Dive on CUTLASS Ping-Pong GEMM Kernel, https://pytorch.org/blog/cutlass-ping-pong-gemm-kernel/ https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp (CUTLASS optimized FP8 tiled GEMM kernel using warp specialization into data producers and consumers.)

Lookup Table (LUT) for MatMul/GEMM

The use of table lookup can sometimes reduce or remove the number of multiplication operations. This method has been used in various MatMul/GEMM kernels, usually when they are quantized to smaller ranges than floating-point.

Research on LUT MatMUL/GEMM algorithms:

SZ Lin, YC Chen, YH Chang, TW Kuo, HP Li, 2024, LUTIN: Efficient Neural Network Inference with Table Lookup, ISLPED ’24, August 5-7, 2024, Newport Beach, CA, USA, https://dl.acm.org/doi/pdf/10.1145/3665314.3670804
Davis Blalock, John Guttag, 21 Jun 2021, Multiplying Matrices Without Multiplying, https://arxiv.org/abs/2106.10860
Q. Deng, Y. Zhang, M. Zhang and J. Yang, "LAcc: Exploiting Lookup Table-based Fast and Accurate Vector Multiplication in DRAM-based CNN Accelerator," 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2019, pp. 1-6. https://ieeexplore.ieee.org/document/8806810 PDF: https://dl.acm.org/doi/pdf/10.1145/3316781.3317845
Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee, 31 Aug 2020 (v2), BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs, https://arxiv.org/abs/2005.09904
Jie Ran, Rui Lin, Jason Chun Lok Li, Jiajun Zhou, Ngai Wong, 13 Aug 2022, PECAN: A Product-Quantized Content Addressable Memory Network, https://arxiv.org/abs/2208.13571

Matrix-Vector Multiplication

Matrix-vector multiplication is also called Vector Matrix Multiplication (VMM) or Generatized Matrix Vector (GEMV).

Wenjie Li; Aokun Hu; Ningyi Xu; Guanghui He, Jan 2024, Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models IEEE Transactions on Circuits and Systems I: Regular Papers (Early Access), https://ieeexplore.ieee.org/abstract/document/10400181/ (Quantization software algorithms done in a hardware-aware co-designed method to optimize hardware matrix-vector multiplication.)
David Spuler, March 2024, Chapter 34. MatMul/GEMM, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner, 17 Jul 2024, Improved Algorithms for Kernel Matrix-Vector Multiplication, LCFM 2024, https://openreview.net/forum?id=7CCzyEtZXH https://openreview.net/pdf?id=7CCzyEtZXH
Han Xu, Yutong Li, Shihao Ji, 12 Sep 2024, LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs, https://arxiv.org/abs/2409.11424 (Matrix multiplications are 97% of computations, which are optimized with a pipelined matrix-vector operation.)
D. Mukunoki, M. Kawai and T. Imamura, 2023, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 2023, pp. 608-615, doi: 10.1109/MCSoC60832.2023.00094, https://ieeexplore.ieee.org/abstract/document/10387875
Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
Noah Amsel, Tyler Chen, Feyza Duman Keles, Diana Halikias, Cameron Musco, Christopher Musco, 26 Mar 2024 (v3), Fixed-sparsity matrix approximation from matrix-vector products, https://arxiv.org/abs/2402.09379
David Spuler, March 2024, Tiled Matrix-Vector Multiplication, in Generative AI in C++, in Generative AI in C++, https://www.aussieai.com/book/ch34-tiled-matrix-vector-multiplication
Md Tawsif Rahman Chowdhury, Huynh Quang Nguyen Vo, Paritosh Ramanan, Murat Yildirim, Gozde Tutuncuoglu, 10 Sep 2024, The Lynchpin of In-Memory Computing: A Benchmarking Framework for Vector-Matrix Multiplication in RRAMs, https://arxiv.org/abs/2409.06140
Mohamed Assem Ibrahim, Mahzabeen Islam, Shaizeen Aga, 1 Apr 2024 (v2), Balanced Data Placement for GEMV Acceleration with Processing-In-Memory, https://arxiv.org/abs/2403.20297
Sandeep Kaur Kingra, Vivek Parmar, Shubham Negi, Sufyan Khan, Boris Hudec, Tuo-Hung Hou, Manan Suri, 10 Jun 2020, Methodology for Realizing VMM with Binary RRAM Arrays: Experimental Demonstration of Binarized-ADALINE Using OxRAM Crossbar, https://arxiv.org/abs/2006.05657
Florian Freye, Jie Lou, Christian Lanius, Tobias Gemmeke, 21 May 2024 (v2), Merits of Time-Domain Computing for VMM -- A Quantitative Comparison, https://arxiv.org/abs/2403.18367
Alexandre Siron, Sean Molesky, 25 Jun 2024, A Split Fast Fourier Transform Algorithm for Block Toeplitz Matrix-Vector Multiplication, https://arxiv.org/abs/2406.17981
Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU, Lingnan Xia, Hua Ma, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10721583
Hyeji Kim, Yeongmin Lee, Chun-Gi Lyuh, 28 October 2024, PF-GEMV: Utilization maximizing architecture in fast matrix–vector multiplication for GPT-2 inference, https://doi.org/10.4218/etrij.2024-0111 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2024-0111 https://onlinelibrary.wiley.com/doi/epdf/10.4218/etrij.2024-0111
Max Grossman, Christopher Thiele, Mauricio Araya-Polo, Florian Frank, Faruk O. Alpak, Vivek Sarkar, 1 Aug 2016, A survey of sparse matrix-vector multiplication performance on large matrices. https://arxiv.org/abs/1608.00636
David Spuler, March 2024, Chapter 34. MatMul/GEMM, Book Excerpt from "Generative AI in C++", https://www.aussieai.com/book/ch34-matmul-gemm
Donghyeon Yi, Seoyoung Lee, Jongho Kim, Junyoung Kim, Sohmyung Ha, Ik Joon Chang, Minkyu Je, 22 Nov 2024, FLARE: FP-Less PTQ and Low-ENOB ADC Based AMS-PiM for Error-Resilient, Fast, and Efficient Transformer Acceleration, https://arxiv.org/abs/2411.14733
MA Ibrahim, M Islam, S Aga, Nov 2024, PIMnast: Balanced Data Placement for GEMV Acceleration with Processing-In-Memory, https://conferences.computer.org/sc-wpub/pdfs/SC-W2024-6oZmigAQfgJ1GhPL0yE3pS/555400a970/555400a970.pdf
Zhang, Y., Lu, L., Zhao, R. et al. An efficient quantized GEMV implementation for large language models inference with matrix core. J Supercomput 81, 496 (2025). https://doi.org/10.1007/s11227-025-06993-6 https://link.springer.com/article/10.1007/s11227-025-06993-6

Transpose Optimizations in Matrix Multiplication

One of the basic methods to optimize matrix multiplication is to operate on a matrix's transpose, because this leads to contiguous memory blocks, which are faster to access (i.e., it's a memory access speed optimization, not a reduction in computations). The issue is the method to store matrices in row-major versus column-major ordering, with the two matrices in a matrix-matrix computation requiring opposite storage methods. It can even be beneficial in some cases to compute the tranpose on-the-fly (which has quadratic complexity), storing it as a temporary matrix, just to speed up the full matrix-matrix operation (which has cubic complexity). Hence, there are various optimizations including:

Fast computation of the tranpose.
Caching a pre-computed transpose

Research papers on handling a transpose, or creating one, for faster MatMul include:

Harald Prokop, May 21, 1999, Cache-Oblivious Algorithms, Master of Science Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, http://supertech.csail.mit.edu/papers/Prokop99.pdf
Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité, Julien Langou, 21 Feb 2022, I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels https://arxiv.org/abs/2202.10217
Paul Springer, Paolo Bientinesi, 7 Nov 2017 (v3), Design of a high-performance GEMM-like Tensor-Tensor Multiplication, https://arxiv.org/abs/1607.00145
David Spuler, March 2024, Chapter 29. Caching Optimizations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
David Spuler, March 2024, Cached or Precomputed Transpose, in Generative AI in C++, https://www.aussieai.com/book/ch29-cached-precomputed-transpose
Dung Le, Jul 30, 2020, CUDA Memory Management & Use cases, https://medium.com/distributed-knowledge/cuda-memory-management-use-cases-f9d340f7c704
Mark Harris, Feb 18, 2013, An Efficient Matrix Transpose in CUDA C/C++, https://developer.nvidia.com/blog/efficient-matrix-transpose-cuda-cc/
Lei Mao, March 19, 2023, CUDA Coalesced Memory Access, https://leimao.github.io/blog/CUDA-Coalesced-Memory-Access/
Viviana Arrigoni, Annalisa Massini, 6 Feb 2019, Fast Strassen-based A^t A Parallel Multiplication, https://arxiv.org/abs/1902.02104
Viviana Arrigoni, Filippo Maggioli, Annalisa Massini, Emanuele Rodolà, 25 Oct 2021, Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose, https://arxiv.org/abs/2110.13042
David Spuler, March 2024, Chapter 34. MatMul/GEMM, Book Excerpt from "Generative AI in C++", https://www.aussieai.com/book/ch34-matmul-gemm
Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)

Matrix Multiplication Optimization in AI Models

Most of AI inference algorithms are based on matrix multiplications or similarly on vector dot products. The primary optimization is to parallelize matrix or vector operations using hardware acceleration in GPUs or other chips. However, several improvements to matrix and vector algorithms have also been found. Research papers specific to matrix multiplication include:

Tim Dettmers, Mike Lewis, Younes Belkada, Luke Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Nov 2022, https://arxiv.org/abs/2208.07339
Meng Hao, Hongwei Li, Hanxiao Chen, Pengzhi Xing, Guowen Xu, Tianwei Zhang, Iron: Private Inference on Transformers, Part of Advances in Neural Information Processing Systems 35 (NeurIPS 2022), https://proceedings.neurips.cc/paper_files/paper/2022/hash/64e2449d74f84e5b1a5c96ba7b3d308e-Abstract-Conference.html
François Charton, Linear algebra with transformers, Nov 2022, https://arxiv.org/abs/2112.01898
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis & Pushmeet Kohli, Discovering faster matrix multiplication algorithms with reinforcement learning, Nature, volume 610, pp.47–53 (2022), https://www.nature.com/articles/s41586-022-05172-4
Huang, J., Yu, C. D. & Geijn, R. A. V. D. Strassen’s algorithm reloaded on GPUs. ACM Trans. Math. Softw. 46, 1–22 (2020). https://dl.acm.org/doi/10.1145/3372419
Tschannen, M., Khanna, A. & Anandkumar, A, StrassenNets: deep learning with a multiplication budget. In International Conference on Machine Learning 4985–4994 (PMLR, 2018). https://arxiv.org/abs/1712.03942
Jason Cong and Bingjun Xiao. Minimizing computation in convolutional neural networks. In Artificial Neural Networks and Machine Learning–ICANN 2014, pages 281–290. Springer, 2014. PDF: https://vast.cs.ucla.edu/sites/default/files/publications/CNN_ICANN14.pdf
Ruofan Wu, Feng Zhang, Jiawei Guan, Zhen Zheng, Xiaoyong Du, and Xipeng Shen. DREW: Efficient Winograd CNN Inference with Deep Reuse. In Proceedings of the ACM Web Conference 2022, pages 1807–1816, 2022. https://dl.acm.org/doi/10.1145/3485447.3511985 PDF: https://research.csc.ncsu.edu/picture/publications/papers/www2022.pdf (Combines the Winograd matrix multiplication algorithm with computation reuse.)
F Zhang, R Wu, J Guan, Z Zheng, X Guo, 2023, Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge Device, IEEE Transactions on Knowledge and Data Engineering, Volume 35, Issue 10, 01 October 2023, https://ieeexplore.ieee.org/abstract/document/10106424/ (Further analysis of DREW, which combines Winograd optimizations with data reuse.)
NM Cicek, X Shen, O Ozturk, 2022, Energy Efficient Boosting of GEMM Accelerators for DNN via Reuse, ACM Transactions on Design Automation of Electronic Systems, Volume 27, Issue 5, Article No. 43, pp 1–26, https://doi.org/10.1145/3503469, https://dl.acm.org/doi/10.1145/3503469, PDF: https://dl.acm.org/doi/pdf/10.1145/3503469 (Uses computation reuse to speed up matrix multiplications.)
R Snytsar, Oct 2023, Accelerating Machine Learning Primitives on Commodity Hardware, arXiv preprint arXiv:2310.05218, https://arxiv.org/pdf/2310.05218.pdf (Uses the "sliding window" technique to optimize general matrix multiplication on edge devices.)
Andrew Anderson, Aravind Vasudevan, Cormac Keane, David Gregg, Sep 2017, Low-memory GEMM-based convolution algorithms for deep neural networks, arXiv preprint arXiv:1709.03395, https://arxiv.org/abs/1709.03395
Aravind Vasudevan, Andrew Anderson, David Gregg, 2017, Parallel Multi Channel Convolution using General Matrix Multiplication, 2017 IEEE 28th international conference on application-specific systems, architectures and processors (ASAP). pp. 19–24. https://arxiv.org/abs/1704.04428
H Li, J Choi, Y Kwon, JH Ahn, Oct 2023, A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models, IEEE Computer Architecture Letters, https://ieeexplore.ieee.org/abstract/document/10285300/ (Tiled version of matrix multiplication with SVD factorization.)
Dennis Sebastian Rieber, 2023, Deployment of Deep Neural Networks on Dedicated Hardware Accelerators, Ph.D. thesis, Doctor of Natural Sciences, Ruprecht–Karls University Heidelberg, PDF: https://archiv.ub.uni-heidelberg.de/volltextserver/32994/1/dissertationPDFA.pdf (Fusion and fission optimizations with example algorithms on p.40 and p.45.)
Dukhan, M. 2019. The indirect convolution algorithm, https://arxiv.org/abs/1907.02129 (Optimizing an alternative to GEMM for convolutions.)
Zlateski, A., Jia, Z., Li, K., Durand, F., The anatomy of efficient fft and winograd convolutions on modern cpus. Proceedings of the ACM International Conference on Supercomputing (New York, NY, USA, 2019), ICS ’19, Association for Computing Machinery, p. 414–424. PDF: https://dl.acm.org/doi/pdf/10.1145/3330345.3330382
William S. Moses, Ivan R. Ivanov, Jens Domke, Toshio Endo, Johannes Doerfert, Oleksandr Zinenko, 2023, High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs, PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, February 2023, Pages 119–134, https://dl.acm.org/doi/abs/10.1145/3572848.3577475, PDF: https://dl.acm.org/doi/pdf/10.1145/3572848.3577475
chengzeyi, Oct 2023, Stable Fast, https://github.com/chengzeyi/stable-fast (Highly optimized inference engine with fused GEMM)
Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., and Amarasinghe, S., Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI’13, pp. 519–530, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-2014-6. doi: 10.1145/2491956.2462176. http://doi.acm.org/10.1145/2491956.2462176, PDF: https://people.csail.mit.edu/jrk/halide-pldi13.pdf
Caleb Tung, Nicholas Eliopoulos, Purvish Jajal, Gowri Ramshankar, Chen-Yun Yang, Nicholas Synovic, Xuecen Zhang, Vipin Chaudhary, George K. Thiruvathukal, Yung-Hsiang Lu, Oct 2023, An automated approach for improving the inference latency and energy efficiency of pretrained CNNs by removing irrelevant pixels with focused convolutions, https://arxiv.org/pdf/2310.07782.pdf, Code: https://github.com/PurdueCAM2Project/focused-convolutions
H Liu, F Deng, 2023, Convolutional Acceleration Algorithm Combining Loop Optimization and Automatic Scheduling, 2023 International Conference for Advancement in Technology (ICONAT), https://ieeexplore.ieee.org/abstract/document/10080410/
Liangzhen Lai, Naveen Suda, Vikas Chandra, 2018, CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs, arXiv preprint arXiv:1801.06601, https://arxiv.org/abs/1801.06601 PDF: https://arxiv.org/pdf/1801.06601
N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege, A. Adams, and A. Cohen, 2018, Tensor comprehensions: Framework-agnostic high-performance machine learning abstractions, CoRR, vol. abs/1802.04730, http://arxiv.org/abs/1802.04730 (Many pseudo-code examples of kernel operator fusion, e.g. shows pseudo-code of fusing RELU into MatMul, and a fused MatMul-addbias-RELU.)
Rui-Jie Zhu, Yu Zhang, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Peng Zhou, Jason K. Eshraghian, 4 Jun 2024, Scalable MatMul-free Language Modeling, https://arxiv.org/abs/2406.02528 Code: https://github.com/ridgerchu/matmulfreellm (Uses addition via ternary quantization and elementwise Hadamard products to replace MatMul.)
Y Yang, JS Emer, D Sanchez, 2024, Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications, MIT, https://yang-yifan.github.io/papers/isca24_trapezoid.pdf
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han, 7 May 2024, QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving, arXiv preprint arXiv:2405.04532, https://arxiv.org/abs/2405.04532 Project: https://hanlab.mit.edu/projects/qserve Code: https://github.com/mit-han-lab/qserve (Efficient quantized inference on GPUs using 4-bit weights, 8-bit activations, and 4-bit KV cache, mostly via a GEMM speedup.)
Soroush Ghodrati, Sean Kinzer, Hanyang Xu, Rohan Mahapatra, Yoonsung Kim, Byung Hoon Ahn, Dong Kai Wang, Lavanya Karthikeyan, Amir Yazdanbakhsh, Jongse Park, Nam Sung Kim, Hadi Esmaeilzadeh, April 2024, Tandem processor: Grappling with emerging operators in neural networks, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 1165–1182, https://doi.org/10.1145/3620665.3640365 https://dl.acm.org/doi/abs/10.1145/3620665.3640365 Code: https://actlab-genesys.github.io (Reviews hardware acceleration of all sub-layer kernel operators, with a focus beyond just GEMM/MatMul operators.)
Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren, 21 Apr 2024, SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile, https://arxiv.org/abs/2404.13528 (Choosing optimal tensor memory layouts to optimize low-level operator kernels.)
Tomasz Kolinko, 2024, Effort: A possibly new algorithm for LLM Inference, https://kolinko.github.io/effort/equations.html https://kolinko.github.io/effort/bucketmul.html (A method of doing sparse MatMul computations to get faster inference.)
Justine Tunney, March 31st 2024, LLaMA Now Goes Faster on CPUs, https://justine.lol/matmul/ Code: https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/sgemm.cpp Code: https://github.com/mozilla-Ocho/llamafile (Improved on-device benchmarks for PC CPU platforms, with Intel, AMD or M2 chips, for Mistral 7B models with 8-bit quantization by optimizing MatMul in the llama.cpp inference engine.)
Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Dec 2023, Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Optimized LLM inference using kernel fusion of GEMM with element-wise operations for better data movement, and also advanced management of the KV cache.)
http://supertech.csail.mit.edu/papers/Prokop99.pdf Harald Prokop, May 21, 1999, Cache-Oblivious Algorithms, Master of Science Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, http://supertech.csail.mit.edu/papers/Prokop99.pdf
Shixun Wu, Yujia Zhai, Jinyang Liu, Jiajun Huang, Zizhe Jian, Bryan M. Wong, Zizhong Chen, May 2023, Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs, https://arxiv.org/abs/2305.01024 (Focuses on error tolerance of failures within matrix multiplication algorithms.)
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
Vipul Vaibhaw, Jan 13, 2021, Matrix Multiplication: Optimizing the code from 6 hours to 1 sec, https://vaibhaw-vipul.medium.com/matrix-multiplication-optimizing-the-code-from-6-hours-to-1-sec-70889d33dcfa (Practical analysis of coding optimizations of matrix multiplication in Python and C++ for various speedups.)
Wenjie Li; Aokun Hu; Ningyi Xu; Guanghui He, Jan 2024, Quantization and Hardware Architecture Co-Design for Matrix-Vector Multiplications of Large Language Models IEEE Transactions on Circuits and Systems I: Regular Papers (Early Access), https://ieeexplore.ieee.org/abstract/document/10400181/ (Quantization software algorithms done in a hardware-aware co-designed method to optimize hardware matrix-vector multiplication.)
Yuzong Chen, Jordan Dotzel, Mohamed S. Abdelfattah, 5 Nov 2023, M4BRAM: Mixed-Precision Matrix-Matrix Multiplication in FPGA Block RAMs, https://arxiv.org/abs/2311.02758 (Extends FPGA hardware for matrix multiplication to mixed-precision data, such as from mixed-precision quantization.)
Trevor Gale, Matei Zaharia, Cliff Young, Erich Elsen, Aug 2020, Sparse GPU Kernels for Deep Learning, https://arxiv.org/abs/2006.10901
Ziheng Wang, Aug 2020, SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference, https://arxiv.org/abs/2008.11849
Robert A. van de Geijn, Enrique S. Quintana-Ort´ı, 2007, The Science of Programming Matrix Computations, https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf Code: http://www.cs.utexas.edu/users/flame/
X Xie, H Peng, A Hasan, S Huang, J Zhao, 2023, Accel-GCN: High-Performance GPU Accelerator Design for Graph Convolution Networks https://arxiv.org/abs/2308.11825 (Kernel for sparse matrix multiplication with block-level tiling as example.)
Darshan C. Ganji, Saad Ashfaq, Ehsan Saboori, Sudhakar Sah, Saptarshi Mitra, MohammadHossein AskariHemmat, Alexander Hoffman, Ahmed Hassanien, Mathieu Léonardon, 18 Apr 2023, DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables, https://arxiv.org/abs/2304.09049
Cong Guo, Fengchen Xue, Jingwen Leng, Yuxian Qiu, Yue Guan, Weihao Cui, Quan Chen, Minyi Guo, 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, https://www.computer.org/csdl/journal/tc/5555/01/10436533/1UwVolp0wta
12 Mar 2024, IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers, Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh, https://arxiv.org/abs/2403.07339v1 (Faster GEMM with low-bitwidth integer approximations.)
Babak Hejazi, Introducing Grouped GEMM APIs in cuBLAS and More Performance Updates, Jun 12, 2024, NVIDIA Technical Blog, https://developer.nvidia.com/blog/introducing-grouped-gemm-apis-in-cublas-and-more-performance-updates/
Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité, Julien Langou, 21 Feb 2022, I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels https://arxiv.org/abs/2202.10217
Paul Springer, Paolo Bientinesi, 7 Nov 2017 (v3), Design of a high-performance GEMM-like Tensor-Tensor Multiplication, https://arxiv.org/abs/1607.00145
Intel, Fast DistilBERT on CPUs, 2022, https://arxiv.org/pdf/2211.07715.pdf
Kazushige Goto, Robert A. van de Geijn, 2008, Anatomy of high-performance matrix multiplication, ACM Transactions on Mathematical Software, Volume 34, Issue 3, Article No.: 12, pp 1–25, https://dl.acm.org/doi/10.1145/1356052.1356053 (The GotoBLAS algorithm for matrix multiplication.)
DY Fu, S Arora, J Grogan, I Johnson, S Eyuboglu, Oct 2023, Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture https://arxiv.org/pdf/2310.12109.pdf
DN Parikh, RA van de Geijn, GM Henry, 2023, Cascading GEMM: High Precision from Low Precision, arXiv preprint arXiv:2303.04353, https://arxiv.org/abs/2303.04353
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
David Spuler, March 2024, Chapter 34. MatMul/GEMM, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Stanford University, 2018, General Matrix Multiply (GeMM), https://spatial-lang.org/gemm
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
Emily Cerf, June 20, 2024, Researchers run high-performing large language model on the energy needed to power a lightbulb, https://news.ucsc.edu/2024/06/matmul-free-llm.html
Benj Edwards, 26 June, 2024, Researchers upend AI status quo by eliminating matrix multiplication in LLMs, https://arstechnica.com/information-technology/2024/06/researchers-upend-ai-status-quo-by-eliminating-matrix-multiplication-in-llms/
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Kaixin Xu, Zhe Wang, Chunyun Chen, Xue Geng, Jie Lin, Xulei Yang, Min Wu, Xiaoli Li, Weisi Lin, 2 Jul 2024, LPViT: Low-Power Semi-structured Pruning for Vision Transformers, https://arxiv.org/abs/2407.02068 (Block-level pruning to give a granular type of structured pruning which speeds up MatMul/GEMM by skipping whole blocks or tiles.)
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim, 15 Jul 2024, Fast Matrix Multiplications for Lookup Table-Quantized LLMs, https://arxiv.org/abs/2407.10960
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang, 12 Aug 2024, LUT Tensor Core: Lookup Table Enables Efficient Low-Bit LLM Inference Acceleration, https://arxiv.org/abs/2408.06003 (Lookup tables for mixed-precision MatMul/GEMM kernels using low-bit quantization mixed with full precision.)
Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei, 16 Aug 2024, ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models, https://arxiv.org/abs/2408.08554
Yeonhong Park, Jake Hyun, SangLyul Cho, Bonggeun Sim, Jae W. Lee, 21 Jun 2024 (v4), Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs, https://arxiv.org/abs/2402.10517 Code: https://github.com/SNU-ARC/any-precision-llm
Keren Zhou, Karthik Ganapathi Subramanian, Po-Hsun Lin, Matthias Fey, Binqian Yin, Jiajia Li, 03 June 2024, FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, Pages 511-524, https://doi.org/10.1145/3650200.3656593 https://dl.acm.org/doi/abs/10.1145/3650200.3656593
Jinendra Malekar, Mohammed E. Elbtity, Ramtin Zand Co, 21 Aug 2024, Matmul or No Matmal in the Era of 1-bit LLMs, https://arxiv.org/abs/2408.11939
Kristina Wilson, Clifford Li, Hon Man Lau, Kwai Wong, Stanimire Tomov, 17 July 2024, Implementing Single-precision and Half-precision Tensor Operations, PEARC '24: Practice and Experience in Advanced Research Computing 2024: Human Powered Computing, Article No.: 109, Pages 1-4, https://doi.org/10.1145/3626203.3670601
Abhinav Jangda, Mohit Yadav, 20 February 2024, Fast Kronecker Matrix-Matrix Multiplication on GPUs, PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, Pages 390 - 403, https://doi.org/10.1145/3627535.3638489 https://dl.acm.org/doi/abs/10.1145/3627535.3638489
John Yang, Le An, Su Inn Park, 11 Jun 2024, ReduceFormer: Attention with Tensor Reduction by Summation, https://arxiv.org/abs/2406.07488
D. Mukunoki, M. Kawai and T. Imamura, 2023, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 2023, pp. 608-615, doi: 10.1109/MCSoC60832.2023.00094, https://ieeexplore.ieee.org/abstract/document/10387875
C. Wei, H. Jia, Y. Zhang, J. Yao, C. Li and W. Cao, Sep 2024, IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUs, IEEE Transactions on Parallel and Distributed Systems, vol. 35, no. 9, pp. 1672-1689, Sept. 2024, doi: 10.1109/TPDS.2024.3432579, https://ieeexplore.ieee.org/abstract/document/10607886
Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner, 17 Jul 2024, Improved Algorithms for Kernel Matrix-Vector Multiplication, LCFM 2024, https://openreview.net/forum?id=7CCzyEtZXH https://openreview.net/pdf?id=7CCzyEtZXH
Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
Helin Cheng, Wenxuan Li, Yuechen Lu, and Weifeng Liu. 2023. HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore Processors. In Proceedings of the 52nd International Conference on Parallel Processing (ICPP '23). Association for Computing Machinery, New York, NY, USA, 807–817. https://doi.org/10.1145/3605573.3605611 https://dl.acm.org/doi/abs/10.1145/3605573.3605611
Chunxu Lin, Wensheng Luo, Yixiang Fang, Chenhao Ma, Xilin Liu, and Yuchi Ma. 2024. On Efficient Large Sparse Matrix Chain Multiplication. Proc. ACM Manag. Data 2, 3, Article 156 (June 2024), 27 pages. https://doi.org/10.1145/3654959 https://dl.acm.org/doi/abs/10.1145/3654959
NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad, 21 Aug 2024, Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication, https://arxiv.org/abs/2408.11988
Suchita Pati, Shaizeen Aga, Nuwan Jayasena, Matthew D. Sinclair, 3 Sep 2024, Global Optimizations & Lightweight Dynamic Logic for Concurrency, https://arxiv.org/abs/2409.02227 (Parallelizing GEMM at a granular level.)
Pranav Dangi, Zhenyu Bai, Dhananjaya Wijerathne, Rohan Juneja, 2024, ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in ML, https://pranavdangi.github.io/papers/PACT24.pdf
Pete Warden, April 20, 2015, Why GEMM is at the heart of deep learning, https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, Evan Shelhamer, 18 Dec 2014 (v3), cuDNN: Efficient Primitives for Deep Learning, https://arxiv.org/abs/1410.0759
Stefan Hadjis, Firas Abuzaid, Ce Zhang, Christopher Ré, 26 May 2015 (v2), Caffe con Troll: Shallow Ideas to Speed Up Deep Learning, https://arxiv.org/abs/1504.04343
Roman Dubtsov, Evarist Fomenko and Babak Hejazi, Feb 01, 2023, New cuBLAS 12.0 Features and Matrix Multiplication Performance on NVIDIA Hopper GPUs, https://developer.nvidia.com/blog/new-cublas-12-0-features-and-matrix-multiplication-performance-on-nvidia-hopper-gpus/
NVIDIA, 2024, Matrix Multiplication Background User's Guide, https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html https://docs.nvidia.com/deeplearning/performance/pdf/Matrix-Multiplication-Background-User-Guide.pdf
Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
Andy He and Evan Williams, 2023, Computational Complexity of Matrix Multiplication, CS 6810 Fall 2023, Cornell University, https://courses.cs.cornell.edu/cs6810/2023fa/Matrix.pdf
Adnan Hoque, Less Wright, Chih-Chieh Yang, Mudhakar Srivatsa, Raghu Ganti, 22 Feb 2024 (v2), Accelerating a Triton Fused Kernel for W4A16 Quantized Inference with SplitK work decomposition, https://arxiv.org/abs/2402.00025
Anton Lokhmotov, 17 Nov 2015 (v2), GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication, https://arxiv.org/abs/1511.03742
Davis Blalock, John Guttag, 21 Jun 2021, Multiplying Matrices Without Multiplying, https://arxiv.org/abs/2106.10860
Y. Song, Y. Meng, B. Chen, S. Chen and Y. Kang, 2024, SALTM: Accelerating Large Transformers in Multi-device System with 2D Model Partitioning Method, Integrated Circuits and Systems, doi: 10.23919/ICS.2024.3458897, https://ieeexplore.ieee.org/abstract/document/10678935 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10678935
Cong Li and Yutao Xu. 2024. Foreseer: Knowledge-Driven Acceleration of Memory-Bound Matrix Multiplications for Large Language Model Inference. In Proceedings of the 17th ACM International Systems and Storage Conference (SYSTOR '24). Association for Computing Machinery, New York, NY, USA, 53–67. https://doi.org/10.1145/3688351.3689153 https://dl.acm.org/doi/10.1145/3688351.3689153 PDF: https://dl.acm.org/doi/pdf/10.1145/3688351.3689153
Bin Xiao, Lei Su, 4 Sep 2024, ISO: Overlap of Computation and Communication within Seqenence For LLM Inference, https://arxiv.org/abs/2409.11155
Ganesh Bikshandi, Jay Shah, 19 Dec 2023, A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library, https://arxiv.org/abs/2312.11918 https://research.colfax-intl.com/nvidia-hopper-flashattention-2/
Han Xu, Yutong Li, Shihao Ji, 12 Sep 2024, LlamaF: An Efficient Llama2 Architecture Accelerator on Embedded FPGAs, https://arxiv.org/abs/2409.11424 (Matrix multiplications are 97% of computations, which are optimized with a pipelined matrix-vector operation.)
Li, Y., Dongarra, J., Tomov, S. (2009). A Note on Auto-tuning GEMM for GPUs. In: Allen, G., Nabrzyski, J., Seidel, E., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds) Computational Science – ICCS 2009. Lecture Notes in Computer Science, vol 5544. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01970-8_89 https://link.springer.com/chapter/10.1007/978-3-642-01970-8_89 PDF: https://link.springer.com/content/pdf/10.1007/978-3-642-01970-8_89.pdf
DominikGreweandAntonLokhmotov.2011. Automatically Generating and Tuning GPU Code for Sparse Matrix-Vector Multiplication from a High-Level Representation. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, (Newport Beach, California,U SA)(GPGPU-4).Association for Computing Machinery, NewYork, NY, USA, https://doi.org/10.1145/1964179.1964196
Neda Seifi, Abdullah Al-Mamun, 2014, Optimizing Memory Access Efficiency in CUDA Kernel via Data Layout Technique, Journal of Computer and Communications, 2024, 12, 124-139, DOI: 10.4236/jcc.2024.125009, https://www.scirp.org/journal/paperinformation?paperid=133500 PDF: https://www.scirp.org/pdf/jcc2024125_91732699.pdf (Fast CUDA matrix multiplication using data locality of memory accesses, by using diagonal data access patterns for coalesced access.)
Chendi Li, Yufan Xu, Sina Mahdipour Saravani, and Ponnuswamy Sadayappan. 2024. Accelerated Auto-Tuning of GPU Kernels for Tensor Computations. In Proceedings of the 38th ACM International Conference on Supercomputing (ICS '24). Association for Computing Machinery, New York, NY, USA, 549–561. https://doi.org/10.1145/3650200.3656626 https://dl.acm.org/doi/abs/10.1145/3650200.3656626 PDF: https://dl.acm.org/doi/pdf/10.1145/3650200.3656626
Bagus Hanindhito and Lizy K. John. 2024. Accelerating ML Workloads using GPU Tensor Cores: The Good, the Bad, and the Ugly. In Proceedings of the 15th ACM/SPEC International Conference on Performance Engineering (ICPE '24). Association for Computing Machinery, New York, NY, USA, 178–189. https://doi.org/10.1145/3629526.3653835 https://dl.acm.org/doi/abs/10.1145/3629526.3653835 PDF: https://lca.ece.utexas.edu/pubs/Hanindhito_AcceleratingMLWorkloads.pdf
C. Wang, P. Song, H. Zhao, F. Zhang, J. Wang and L. Zhang, "High-Utilization GPGPU Design for Accelerating GEMM Workloads: An Incremental Approach," 2024 IEEE International Symposium on Circuits and Systems (ISCAS), Singapore, Singapore, 2024, pp. 1-5, doi: 10.1109/ISCAS58744.2024.10558334. https://ieeexplore.ieee.org/abstract/document/10558334
Cris Cecka, Feb 27, 2017, Pro Tip: cuBLAS Strided Batched Matrix Multiply, https://developer.nvidia.com/blog/cublas-strided-batched-matrix-multiply/
Hansung Kim, Ruohan Yan, Joshua You, Tieliang Vamber Yang, Yakun Sophia Shao, 22 Aug 2024, Virgo: Cluster-level Matrix Unit Integration in GPUs for Scalability and Energy Efficiency, https://arxiv.org/abs/2408.12073 (Hardware optimization for GEMM operations.)
Yuki Uchino, Katsuhisa Ozaki, Toshiyuki Imamura, 20 Sep 2024, Performance Enhancement of the Ozaki Scheme on Integer Matrix Multiplication Unit, https://arxiv.org/abs/2409.13313
Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang, 26 Sep 2024 (v2), INT-FlashAttention: Enabling Flash Attention for INT8 Quantization, https://arxiv.org/abs/2409.16997 https://github.com/INT-FlashAttention2024/INT-FlashAttention
Shaobo Ma, Chao Fang, Haikuo Shao, Zhongfeng Wang, 26 Sep 2024, Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores, https://arxiv.org/abs/2409.17870
Stephen Jones, March 2024, CUDA: New Features and Beyond, GTC 2024, https://www.nvidia.com/en-us/on-demand/session/gtc24-s62400/
Theo Gregersen, Pratyush Patel, Esha Choukse, 26 Sep 2024, Input-Dependent Power Usage in GPUs, https://arxiv.org/abs/2409.18324
Enrique Galvez, Feb 2024, A study of Convolutions for Efficient Inference of Deep Neural Networks on Embedded Processors, Master's Thesis, France, University de Lyon, https://largo.lip6.fr/~cassagnea/docs/reports/Galvez2024%20-%20A%20Study%20of%20Convolutions%20for%20Efficient%20Inference%20of%20Deep%20Neural%20Networks%20on%20Embedded%20Processors.pdf
Yuxiao Fan, 2024, Design and research of high-performance convolutional neural network accelerator based on Chipyard, ICAITA-2024, Journal of Physics: Conference Series 2858 (2024) 012001, IOP Publishing, doi:10.1088/1742-6596/2858/1/012001, https://iopscience.iop.org/article/10.1088/1742-6596/2858/1/012001/pdf (Convolution optimization on a RISC-V architecture.)
Adrián Castelló, Héctor Martínez, Sandra Catalán, Francisco D. Igual, Enrique S. Quintana-Ortí, 2024, Experience-guided, Mixed-precision Matrix Multiplication with Apache TVM for ARM processors https://www.researchsquare.com/article/rs-5017241/v1 (Optimization of GEMM for sparse and low-bit quantized versions for TVM, with code examples.)
Haiyue Ma, Jian Liu, Ronny Krashinsky, 10 Oct 2024, Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM, https://arxiv.org/abs/2410.07531
Bettayeb, M., Halawani, Y., Khan, M.U. et al. Efficient memristor accelerator for transformer self-attention functionality. Sci Rep 14, 24173 (2024). https://doi.org/10.1038/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z.pdf
Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
Or Ordentlich, Yury Polyanskiy, 17 Oct 2024, Optimal Quantization for Matrix Multiplication, https://arxiv.org/abs/2410.13780
Nitish Satya Murthy, Francky Catthoor, and Marian Verhelst. 2024. Optimization of block-scaled integer GeMMs for efficient DNN deployment on scalable in-order vector processors. J. Syst. Archit. 154, C (Sep 2024). https://doi.org/10.1016/j.sysarc.2024.103236 https://dl.acm.org/doi/abs/10.1016/j.sysarc.2024.103236
Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU, Lingnan Xia, Hua Ma, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10721583
OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
Hyeji Kim, Yeongmin Lee, Chun-Gi Lyuh, 28 October 2024, PF-GEMV: Utilization maximizing architecture in fast matrix–vector multiplication for GPT-2 inference, https://doi.org/10.4218/etrij.2024-0111 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2024-0111 https://onlinelibrary.wiley.com/doi/epdf/10.4218/etrij.2024-0111
X. Shen et al., "HotaQ: Hardware Oriented Token Adaptive Quantization for Large Language Models," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3487781. https://ieeexplore.ieee.org/abstract/document/10737419/
Ulrich Drepper, October 23, 2007, Memory part 5: What programmers can do, https://lwn.net/Articles/255364/
David Spuler, March 2024, Chapter 34. MatMul/GEMM, Book Excerpt from "Generative AI in C++", https://www.aussieai.com/book/ch34-matmul-gemm
Siboehm, December 2022, How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog, https://siboehm.com/articles/22/CUDA-MMM
Gray, S. 2014. A full walk through of the SGEMM implementation, https://github.com/NervanaSystems/maxas/wiki/SGEMM
Lai, J., and Seznec, A. 2013. Performance upper bound analysis and optimization of SGEMM on Fermi and Kepler GPUs. International Symposium on Code Generation and Optimization (CGO '13), 1–10. https://inria.hal.science/file/index/docid/789958/filename/112_Lai.pdf
Andrew Lavin, Scott Gray, 10 Nov 2015 (v2), Fast Algorithms for Convolutional Neural Networks, https://arxiv.org/abs/1509.09308
Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst, 24 Nov 2024, Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format, https://arxiv.org/abs/2411.15982
Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton, 9 Aug 2024, Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge, https://arxiv.org/abs/2408.05152
Conner Takehana, Aaryan Singhal, Nov 28, 2024, ThunderMittens For Your ThunderKittens, https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx (Porting TK to Apple Metal and MLX on the M2 chips.)
Character.AI, Nov 21, 2024, Optimizing AI Inference at Character.AI (Part Deux), https://research.character.ai/optimizing-ai-inference-at-character-ai-part-deux/ (Optimization techniques discussed include INT8, Flash attention 3, kernel fusion of KV dequantization and attention, MQA parallelization, producer-consumer CUDA warp specialization, fused matrix transpose, and more.)
Dongyun Kam, Myeongji Yun, Sunwoo Yoo, Seungwoo Hong, Zhengya Zhang, Youngjoo Lee, 13 Dec 2024, Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity, https://arxiv.org/abs/2412.10059
Mingcong Song, Xinru Tang, Fengfan Hou, Jing Li, Wei Wei, Yipeng Ma, Runqiu Xiao, Hongjie Si, Dingcheng Jiang, Shouyi Yin, Yang Hu, Guoping Long, 24 Dec 2024, Tackling the Dynamicity in a Production LLM Serving System with SOTA Optimizations via Hybrid Prefill/Decode/Verify Scheduling on Efficient Meta-kernels, https://arxiv.org/abs/2412.18106
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Y. Ogiwara and H. Kawashima, "Sparse Ternary Matrix Multiplication with Tensor Core for Transformer," 2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW), Naha, Japan, 2024, pp. 150-156, doi: 10.1109/CANDARW64572.2024.00031. https://ieeexplore.ieee.org/abstract/document/10817836/
Yunjae Lee, Juntaek Lim, Jehyeon Bang, Eunyeong Cho, Huijong Jeong, Taesu Kim, Hyungjun Kim, Joonhyung Lee, Jinseop Im, Ranggi Hwang, Se Jung Kwon, Dongsoo Lee, Minsoo Rhu, 31 Dec 2024, Debunking the CUDA Myth Towards GPU-based AI Systems, https://arxiv.org/abs/2501.00210 (Comparing Intel Gaudi-2 with TPC-C programming versus NVIDIA A100 GPUs with CUDA software.)
Dibakar Gope, David Mansell, Danny Loh, Ian Bratt, 23 Dec 2024, Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs, https://arxiv.org/abs/2501.00032 https://github.com/ggerganov/llama.cpp
Lukas Schiavone, Filip Torphage, 2024, Engineering an Efficient Implementation of Pagh’s Compressed Matrix Multiplication Algorithm Master’s thesis, Department of Computer Science and Engineering, Chalmers University of Technology, University of Gothenburg, Gothenburg, Sweden, https://scholar.google.com/scholar_url?url=http://odr.chalmers.se/bitstreams/e3fbc596-7568-4fec-82c8-f17d2be2147c/download ("... implement the algorithm for approximating matrix multiplication described in Rasmus Pagh’s paper Compressed Matrix Multiplication (2013) in C++").
Less Wright, Adnan Hoque, November 01, 2024, Deep Dive on CUTLASS Ping-Pong GEMM Kernel, https://pytorch.org/blog/cutlass-ping-pong-gemm-kernel/ https://github.com/pytorch-labs/applied-ai/tree/main/kernels/cuda/cutlass_gemm https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp (CUTLASS optimized FP8 tiled GEMM kernel using warp specialization into data producers and consumers.)
Pradeep Ramani, Jason Knight, Philippe Tillet, Thomas Raoux, Paweł Szczerbuk and Peter Bell, Feb 05, 2025, OpenAI Triton on NVIDIA Blackwell Boosts AI Performance and Programmability, https://developer.nvidia.com/blog/openai-triton-on-nvidia-blackwell-boosts-ai-performance-and-programmability/
Congjie He, Yeqi Huang, Pei Mu, Ziming Miao, Jilong Xue, Lingxiao Ma, Fan Yang, Luo Mai, 6 Feb 2025, WaferLLM: A Wafer-Scale LLM Inference System, https://arxiv.org/abs/2502.04563 (GEMM on a wafer mesh.)
DeepSeek, Feb 2025, DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling, https://github.com/deepseek-ai/DeepGEMM (DeepSeek releases FP8 GEMM kernels.)
Ashley Goolam, March 4, 2025, DeepSeek Open Source Week: A Complete Summary, https://apidog.com/blog/deepseek-open-source-week/
Gunho Park, Hyeokjun Kwon, Jiwoo Kim, Jeongin Bae, Baeseong Park, Dongsoo Lee, Youngjoo Lee, 10 Mar 2025, FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables, https://arxiv.org/abs/2503.06862
Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, Jianfei Chen, 12 Mar 2025 (v2), Accurate INT8 Training Through Dynamic Block-Level Fallback, https://arxiv.org/abs/2503.08040
Benjamin Spector, Aaryan Singhal, Dan Fu, Chris Ré, Mar 15, 2025, ThunderKittens Now on Blackwells! https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwell https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/attn/b200/b200.cu https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/matmul/B200/matmul.cu https://github.com/HazyResearch/ThunderKittens/blob/blackwell/kernels/matmul/FP8_B200/matmul.cu

Fast Matrix Multiplication Computation Theory

The main techniques for faster matrix multiplication, of general matrices, include:

Strassen's algorithm
Winograd's algorithm
Fast Fourier Transform (FFT) methods

Matrix multiplications can also be sped up by using specialized matrices:

Low-rank matrix factorization
Sparse matrices
Special matrix methods (e.g. Butterfly matrices, Monarch matrices, etc.)

General Matrix Multiplication Research

Papers on the low-level algorithms for optimizing matrix multiplication for general matrices (i.e. all matrices, not just a subtype):

Bläser, M. Fast matrix multiplication. Theory Comput. 5, 1–60 (2013), https://www.semanticscholar.org/paper/Fast-Matrix-Multiplication-Bl%C3%A4ser/23443bee5576f05801fed6c1494517be2ad2e742
Hopcroft, J. E. & Kerr, L. R., On minimizing the number of multiplications necessary for matrix multiplication. SIAM J. Appl. Math. 20, 30–36 (1971), https://epubs.siam.org/doi/10.1137/0120004
Lim, L.-H., Tensors in computations. Acta Numer. 30, 555–764 (2021). https://arxiv.org/abs/2106.08090
Pan, V. Y., Fast feasible and unfeasible matrix multiplication. Preprint, 2018, https://arxiv.org/abs/1804.04102
Schönhage, A., Partial and total matrix multiplication. SIAM J. Comput. 10, 434–455 (1981), https://epubs.siam.org/doi/10.1137/0210032
Le Gall, F. Powers of tensors and fast matrix multiplication. In International Symposium on Symbolic and Algebraic Computation 296–303 (ACM, 2014), https://arxiv.org/abs/1401.7714
Strassen, V. The asymptotic spectrum of tensors and the exponent of matrix multiplication. In 27th Annual Symposium on Foundations of Computer Science 49–54 (IEEE, 1986), https://ieeexplore.ieee.org/document/4568194
Coppersmith, D. & Winograd, S., Matrix multiplication via arithmetic progressions. In ACM Symposium on Theory of Computing 1–6 (ACM, 1987), https://dl.acm.org/doi/10.1145/28395.28396
Alman, J. & Williams, V. V., A refined laser method and faster matrix multiplication. In ACM-SIAM Symposium on Discrete Algorithms 522–539 (SIAM, 2021), https://simons.berkeley.edu/news/refined-laser-method-faster-matrix-multiplication-breakthroughs
Smirnov, A. V., The bilinear complexity and practical algorithms for matrix multiplication. Comput. Math. Math. Phys. 53, 1781–1795 (2013). PDF: https://cs.uwaterloo.ca/~eschost/Exam/Smirnov.pdf
Heule, M. J., Kauers, M. & Seidl, M. New ways to multiply 3 × 3-matrices. J. Symb. Comput. 104, 899–916 (2021), https://arxiv.org/abs/1905.10192
Sedoglavic, A., A non-commutative algorithm for multiplying (7 × 7) matrices using 250 multiplications. 2017, https://arxiv.org/abs/1712.07935
Ye, K. & Lim, L.-H., Fast structured matrix computations: tensor rank and Cohn–Umans method. Found. Comput. Math. 18, 45–95 (2018). https://arxiv.org/abs/1601.00292
Benson, A. R. & Ballard, G., A framework for practical parallel fast matrix multiplication. ACM SIGPLAN Not. 50, 42–53 (2015). https://arxiv.org/abs/1409.2908
Huang, J., Smith, T. M., Henry, G. M. & Van De Geijn, R. A., Strassen’s algorithm reloaded. In International Conference for High Performance Computing, Networking, Storage and Analysis 690–701 (IEEE, 2016). https://ieeexplore.ieee.org/document/7877137
Alexandre Sedoglavic, Yet another catalogue of fast matrix multiplication algorithms, Université de Lille, 2021, https://fmm.univ-lille.fr/, Bibliography: https://fmm.univ-lille.fr/biblio.html (An extensive bibliography with 280 papers on matrix multiplication!)
Lavin, A.; Gray, S. Fast Algorithms for Convolutional Neural Networks. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021, doi:10.1109/CVPR.2016.435 https://arxiv.org/abs/1509.09308 (Uses Winograd's algorithm.)
A Freudenberg, 2023, Novel techniques for accelerating statistical operations on compressed genomic data, Ph.D. Thesis, University of Mannheim, https://madoc.bib.uni-mannheim.de/65226/1/Dissertation_Freudenberg.pdf (Faster GPU-accelerated matrix multiplication algorithms in the context of genetics.)
Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf
Wang, H., Ma, C., 2021, An optimization of im2col, an important method of CNNs, based on continuous address access. 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE). pp. 314–320. IEEE (2021) https://ieeexplore.ieee.org/document/9342343
Jörg Arndt, 2010, Matters Computational Ideas, Algorithms, Source Code, https://dl.acm.org/doi/10.5555/1941953, https://www.jjj.de/fxt/fxtpage.html#fxtbook, Code: https://www.jjj.de/bitwizardry/bitwizardrypage.html
Allen, F. E., and Cocke, J. 1972. A catalogue of optimizing transformations. In Design and Optimization of Compilers, Prentice-Hall, Englewood Cliffs, N.J., pp. 1–30. PDF: https://www.clear.rice.edu/comp512/Lectures/Papers/1971-allen-catalog.pdf (Matrix multiplication is used for examples of loop optimizations, if you like reading old-style code from the 1970s.)
Arash Ashari, Shirish Tatikonda, Matthias Boehm, Berthold Reinwald, Keith Campbell, John Keenleyside, and P Sadayappan. 2015. On optimizing machine learning workloads via kernel fusion. ACM SIGPLAN Notices 50, 8 (2015), 173–182. https://dl.acm.org/doi/10.1145/2688500.2688521, PDF: https://mboehm7.github.io/resources/ppopp2015.pdf (Fused MatMul.)
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically Scheduling Halide Image Processing Pipelines. ACM Trans. Graph. 35, 4, Article 83 (jul 2016), 11 pages. https://doi.org/10.1145/2897824.2925952, https://research.google/pubs/pub45525/, PDF: https://dl.acm.org/doi/pdf/10.1145/2897824.2925952 (GEMM optimization discussed.)
Ashari, A., Tatikonda, S., Boehm, M., Reinwald, B., Campbell, K., Keenleyside, J., and Sadayappan, P., 2015, On optimizing machine learning workloads via kernel fusion. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2015, pp. 173–182, New York, NY, USA. Association for Computing Machinery. ISBN 9781450332057. doi: 10.1145/2688500.2688521. https://dl.acm.org/doi/10.1145/2688500.2688521, PDF: https://mboehm7.github.io/resources/ppopp2015.pdf (Analysis of kernel fusion including for sparse and dense matrix multiplication.)
Shiyao Xu; Jingfei Jiang; Jinwei Xu; Chaorun Liu; Yuanhong He; Xiaohang Liu, 2022, Sparkle: A High Efficient Sparse Matrix Multiplication Accelerator for Deep Learning, 2022 IEEE 40th International Conference on Computer Design (ICCD) https://ieeexplore.ieee.org/document/9978530
Darshan C. Ganji, Saad Ashfaq, Ehsan Saboori, Sudhakar Sah, Saptarshi Mitra, MohammadHossein AskariHemmat, Alexander Hoffman, Ahmed Hassanien, Mathieu Léonardon, 18 Apr 2023, DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures using Lookup Tables, https://arxiv.org/abs/2304.09049
T. Ogita, S. Rump, S. Oishi, 1 June 2005, Accurate Sum and Dot Product, SIAM J. Sci. Comput., DOI:10.1137/030601818Corpus ID: 10641044 PDF: http://www.oishi.info.waseda.ac.jp/~oishi/papers/OgRuOi05.pdf
Gunnels JA, Henry GM, van de Geijn RA (2001) A family of high-performance matrix multiplication algorithms. Computational Science—ICCS 2001. Springer Berlin Heidelberg, pp 51–60. https://doi.org/10.1007/3-540-45545-0_15
Sedoglavic, A. 2017, A non-commutative algorithm for multiplying 5x5 matrices using 99 multiplications. Preprint at https://arxiv.org/abs/1707.06860 (2017). https://arxiv.org/abs/1707.06860
Burichenko, V. P., 2014, On symmetries of the Strassen algorithm. Preprint at https://arxiv.org/abs/1408.6273 (2014).
Grochow, J. A. & Moore, C. 2017, Designing Strassen’s algorithm. Preprint at https://arxiv.org/abs/1708.09398 (2017). https://arxiv.org/abs/1708.09398
Grochow, J. A. & Moore, C., 2016, Matrix multiplication algorithms from group orbits. Preprint at https://arxiv.org/abs/1612.01527 (2016).
Elser, V., 2016, A network that learns Strassen multiplication. J. Mach. Learn. Res. 17, 3964–3976 (2016). https://arxiv.org/abs/1601.07227
Tschannen, M., Khanna, A. & Anandkumar, A, 2018, StrassenNets: deep learning with a multiplication budget. In International Conference on Machine Learning 4985–4994 (PMLR, 2018). https://arxiv.org/abs/1712.03942
Wagner, A. Z., 2021, Constructions in combinatorics via neural networks. Preprint at https://arxiv.org/abs/2104.14516 (2021). https://arxiv.org/abs/2104.14516
Laderman, 1976, http://dx.doi.org/10.1090/S0002-9904-1976-13988-2
DeepSpeed Team, Rangan Majumder, Andrey Proskurin, May 24, 2021, DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression, Microsoft Research Blog, https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/ (DeepSpeed uses various kernel fusion methods including for Softmax, LayerNorm, transpose, and GEMM.)
Robert A. van de Geijn, Enrique S. Quintana-Ort, December 2007, The Science of Programming Matrix Computations, First Edition, https://www.cs.utexas.edu/users/rvdg/tmp/TSoPMC.pdf
Junki Park, Hyunsung Yoon, Daehyun Ahn, Jungwook Choi, and Jae-Joon Kim. 2020. OPTIMUS: OPTImized Matrix Multiplication Structure for Transformer neural network accelerator. Proceedings of Machine Learning and Systems 2 (2020), 363-378. https://proceedings.mlsys.org/paper_files/paper/2020/file/91ba7292e5388b90b58d0b839a7f19ec-Paper.pdf
Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud, March 2021, Accelerating deep neural networks implementation: A survey, https://doi.org/10.1049/cdt2.12016 PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016

Strassen Matrix Multiplication

The Strassen method is based on a way to multiply 2x2 matrices using seven arithmetic multiplications, rather than the usual eight. Generalizing this idea leads to a faster way to multiply general matrices. This reduces the asymptomic complexity of matrix-matrix multiplication to O(n^2.8) (where 2.8 is log base 2 of 7) rather than O(n^3), i.e., cubic 3.0 (where 3 is log base 2 of 8).

Strassen, V. The asymptotic spectrum of tensors and the exponent of matrix multiplication. In 27th Annual Symposium on Foundations of Computer Science 49–54 (IEEE, 1986), https://ieeexplore.ieee.org/document/4568194
Huang, J., Yu, C. D. & Geijn, R. A. V. D. Strassen’s algorithm reloaded on GPUs. ACM Trans. Math. Softw. 46, 1–22 (2020). https://dl.acm.org/doi/10.1145/3372419
Tschannen, M., Khanna, A. & Anandkumar, A, StrassenNets: deep learning with a multiplication budget. In International Conference on Machine Learning 4985–4994 (PMLR, 2018). https://arxiv.org/abs/1712.03942
Huang, J., Smith, T. M., Henry, G. M. & Van De Geijn, R. A., Strassen’s algorithm reloaded. In International Conference for High Performance Computing, Networking, Storage and Analysis 690–701 (IEEE, 2016). https://ieeexplore.ieee.org/document/7877137
Burichenko, V. P., 2014, On symmetries of the Strassen algorithm. Preprint at https://arxiv.org/abs/1408.6273 (2014).
Grochow, J. A. & Moore, C. 2017, Designing Strassen’s algorithm. Preprint at https://arxiv.org/abs/1708.09398 (2017). https://arxiv.org/abs/1708.09398
Elser, V., 2016, A network that learns Strassen multiplication. J. Mach. Learn. Res. 17, 3964–3976 (2016). https://arxiv.org/abs/1601.07227
Jianyu Huang, Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn, 3 May 2016, Implementing Strassen's Algorithm with BLIS, https://arxiv.org/abs/1605.01078
Gianfranco Bilardi, Lorenzo De Stefani, 7 May 2016, The I/O complexity of Strassen's matrix multiplication with recomputation, https://arxiv.org/abs/1605.02224
Christian Ikenmeyer, Vladimir Lysikov, 13 Feb 2019 (v2), Strassen's 2x2 matrix multiplication algorithm: A conceptual perspective, https://arxiv.org/abs/1708.08083
Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn, 24 Aug 2018, Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs, https://arxiv.org/abs/1808.07984
Chandan Misra, Sourangshu Bhattacharya, Soumya K. Ghosh, 22 Nov 2018 (v2), Stark: Fast and Scalable Strassen's Matrix Multiplication using Apache Spark, https://arxiv.org/abs/1811.07325
Viviana Arrigoni, Annalisa Massini, 6 Feb 2019, Fast Strassen-based A^t A Parallel Multiplication, https://arxiv.org/abs/1902.02104
Viviana Arrigoni, Filippo Maggioli, Annalisa Massini, Emanuele Rodolà, 25 Oct 2021, Efficiently Parallelizable Strassen-Based Multiplication of a Matrix by its Transpose, https://arxiv.org/abs/2110.13042
Osman B. Guney, Suayb S. Arslan, 10 Oct 2022, Fault-Tolerant Strassen-Like Matrix Multiplication, https://arxiv.org/abs/2210.04441
Paolo D'Alberto, 20 Dec 2023, Strassen's Matrix Multiplication Algorithm Is Still Faster, https://arxiv.org/abs/2312.12732
Jean-Guillaume Dumas, Clément Pernet, Alexandre Sedoglavic, 28 Jun 2024 (v2), Strassen's algorithm is not optimally accurate, https://arxiv.org/abs/2402.05630
Afzal Ahmad, Linfeng Du, Wei Zhang, 4 Jun 2024, Fast and Practical Strassen's Matrix Multiplication using FPGAs, https://arxiv.org/abs/2406.02088
Tomonori Kouya, 7 Oct 2014, Accelerated Multiple Precision Matrix Multiplication using Strassen's Algorithm and Winograd's Variant, https://arxiv.org/abs/1410.1599
Tomonori Kouya, 29 Oct 2015, Performance evaluation of multiple precision matrix multiplications using parallelized Strassen and Winograd algorithms https://arxiv.org/abs/1510.08642
Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, Wei Zhou, 18 May 2009 (v5), Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm, https://arxiv.org/abs/0707.2347
Jean-Guillaume Dumas, Victor Pan, 17 Dec 2016, Fast Matrix Multiplication and Symbolic Computation, https://arxiv.org/abs/1612.05766

Winograd Matrix Multiplication

Research papers on using the Winograd or the Coppersmith-Winograd algorithm:

Coppersmith, D. & Winograd, S., Matrix multiplication via arithmetic progressions. In ACM Symposium on Theory of Computing 1–6 (ACM, 1987), https://dl.acm.org/doi/10.1145/28395.28396
Lavin, A.; Gray, S. Fast Algorithms for Convolutional Neural Networks. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 4013–4021, doi:10.1109/CVPR.2016.435 https://arxiv.org/abs/1509.09308 (Uses Winograd's algorithm.)
Javier Fernandez-Marques, Paul N. Whatmough, Andrew Mundy, Matthew Mattina, 2020, Searching for winograd-aware quantized networks, Proceedings of the 3rd MLSys Conference, Austin, TX, USA, 2020. PDF: https://proceedings.mlsys.org/paper_files/paper/2020/file/678e209691cd37f145a5502695378bac-Paper.pdf
Ruofan Wu, Feng Zhang, Jiawei Guan, Zhen Zheng, Xiaoyong Du, and Xipeng Shen. DREW: Efficient Winograd CNN Inference with Deep Reuse. In Proceedings of the ACM Web Conference 2022, pages 1807–1816, 2022. https://dl.acm.org/doi/10.1145/3485447.3511985 PDF: https://research.csc.ncsu.edu/picture/publications/papers/www2022.pdf (Combines the Winograd matrix multiplication algorithm with computation reuse.)
F Zhang, R Wu, J Guan, Z Zheng, X Guo, 2023, Expanding the Edge: Enabling Efficient Winograd CNN Inference With Deep Reuse on Edge Device, IEEE Transactions on Knowledge and Data Engineering, Volume 35, Issue 10, 01 October 2023, https://ieeexplore.ieee.org/abstract/document/10106424/ (Further analysis of DREW, which combines Winograd optimizations with data reuse.)
Zlateski, A., Jia, Z., Li, K., Durand, F., The anatomy of efficient fft and winograd convolutions on modern cpus. Proceedings of the ACM International Conference on Supercomputing (New York, NY, USA, 2019), ICS ’19, Association for Computing Machinery, p. 414–424. PDF: https://dl.acm.org/doi/pdf/10.1145/3330345.3330382
Tomonori Kouya, 7 Oct 2014, Accelerated Multiple Precision Matrix Multiplication using Strassen's Algorithm and Winograd's Variant, https://arxiv.org/abs/1410.1599
Tomonori Kouya, 29 Oct 2015, Performance evaluation of multiple precision matrix multiplications using parallelized Strassen and Winograd algorithms https://arxiv.org/abs/1510.08642
François Le Gall, Florent Urrutia, 6 Nov 2017 (v2), Improved Rectangular Matrix Multiplication using Powers of the Coppersmith-Winograd Tensor, https://arxiv.org/abs/1708.05622
Brice Boyer, Jean-Guillaume Dumas, Clément Pernet, Wei Zhou, 18 May 2009 (v5), Memory efficient scheduling of Strassen-Winograd's matrix multiplication algorithm, https://arxiv.org/abs/0707.2347
Enrique Galvez, Feb 2024, A study of Convolutions for Efficient Inference of Deep Neural Networks on Embedded Processors, Master's Thesis, France, University de Lyon, https://largo.lip6.fr/~cassagnea/docs/reports/Galvez2024%20-%20A%20Study%20of%20Convolutions%20for%20Efficient%20Inference%20of%20Deep%20Neural%20Networks%20on%20Embedded%20Processors.pdf
Protsenko, V., Kryzhanovskiy, V., Filippov, A. (2025). Quantization-Friendly Winograd Transformations for Convolutional Neural Networks. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15116. Springer, Cham. https://doi.org/10.1007/978-3-031-73636-0_11 https://link.springer.com/chapter/10.1007/978-3-031-73636-0_11

FFT Matrix Multiplication

Research papers using the Fast Fourier Transformation (FFT) for matrix multiplication:

Zlateski, A., Jia, Z., Li, K., Durand, F., The anatomy of efficient fft and winograd convolutions on modern cpus. Proceedings of the ACM International Conference on Supercomputing (New York, NY, USA, 2019), ICS ’19, Association for Computing Machinery, p. 414–424. PDF: https://dl.acm.org/doi/pdf/10.1145/3330345.3330382
Alexandre Siron, Sean Molesky, 25 Jun 2024, A Split Fast Fourier Transform Algorithm for Block Toeplitz Matrix-Vector Multiplication, https://arxiv.org/abs/2406.17981
Andrew Lavin, Scott Gray, 10 Nov 2015 (v2), Fast Algorithms for Convolutional Neural Networks, https://arxiv.org/abs/1509.09308

Approximate Matrix Multiplication

Approximate Matrix Multiplication (AMM) is a variety of complicated model optimization techniques that replace matrix multiplications with various approximations that avoid the cost of arithmetic multiplication, trading off some accuracy. These methods are usually distinct from quantization methods, are not specific to certain subclasses of matrices, and evoke more advanced mathematics in the theory of matrices.

Note that these algorithms apply at the high-level of how matrices are multiplied with other matrices or with vectors (e.g. avoiding some vector dot products), whereas there are also low-level optimizations of the arithmetic operation of multiplying two numbers (see advanced mathematics). These two classes of research are not the same, and are actually orthogonal to each other.

Research papers on approximate matrix multiplication include:

Davis Blalock, John Guttag, "Multiplying Matrices Without Multiplying", arXiv:2106.10860v1, June 2021, https://arxiv.org/abs/2106.10860
Anthony Alford, "MIT Researchers Open-Source Approximate Matrix Multiplication Algorithm MADDNESS", InfoQ, October 5, 2021, https://www.infoq.com/news/2021/10/mit-matrix-multiplication/
Anirban Dasgupta, Ravi Kumar, Tamás Sarlós, "A Sparse Johnson-Lindenstrauss Transform", In Proceedings of the forty-second ACM symposium on Theory of computing, pp.341–350, 2010, https://arxiv.org/abs/1004.4240
Mairal, J., Bach, F., Ponce, J., and Sapiro, G., "Online dictionary learning for sparse coding". In Proceedings of the 26th annual international conference on machine learning, pp. 689–696, 2009.
Application of Approximate Matrix Multiplication to Neural Networks and Distributed SLAM Brian Plancher; Camelia D. Brumar; Iulian Brumar; Lillian Pentecost; Saketh Rama; David Brooks, 2019 IEEE High Performance Extreme Computing Conference (HPEC) DOI: 10.1109/HPEC.2019.8916468, https://ieeexplore.ieee.org/abstract/document/8916468 PDF: https://brianplancher.com/files/Applied_Approx_MM.pdf
A. Deshpande, L. Rademacher, S. Vempala, and G. Wang, “Matrix approximation and projective clustering via volume sampling,” in Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 2006, pp.1117–1126, https://dl.acm.org/doi/10.5555/1109557.1109681
P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algorithms for matrices I: Approximating matrix multiplication,” SIAM Journal on Computing, vol. 36, no. 1, pp. 132–157, 2006, https://epubs.siam.org/doi/10.1137/S0097539704442684, https://doi.org/10.1137/S0097539704442684
P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix,” SIAM Journal on computing, vol. 36, no. 1, pp. 158–183, 2006, PDF: https://www.stat.berkeley.edu/~mmahoney/pubs/matrix2_SICOMP.pdf
P. Drineas, R. Kannan, and M. W. Mahoney, “Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition,” SIAM Journal on Computing, vol. 36, no. 1, pp. 184–206, 2006, PDF: https://www.stat.berkeley.edu/~mmahoney/pubs/matrix3_SICOMP.pdf
A. S. Householder and G. Young, “Matrix approximation and latent roots,” The American Mathematical Monthly, vol. 45, no. 3, pp. 165–171, 1938, https://www.jstor.org/stable/2302980
R. Pagh, “Compressed matrix multiplication,” ACM Transactions on Computation Theory (TOCT), vol. 5, no. 3, p. 9, 2013, https://arxiv.org/abs/1108.1320
N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011, https://arxiv.org/abs/0909.4061
T. Sarlos, “Improved approximation algorithms for large matrices via random projections,” in Foundations of Computer Science, 2006. FOCS’06. 47th Annual IEEE Symposium on. IEEE, 2006, pp. 143-152, https://ieeexplore.ieee.org/document/4031351
G. H. Golub, A. Hoffman, and G. W. Stewart, “A generalization of the eckart-young-mirsky matrix approximation theorem,” Linear Algebra and its applications, vol. 88, pp. 317–327, 1987, https://www.sciencedirect.com/science/article/pii/0024379587901145
A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha, “A generalized maximum entropy approach to bregman co-clustering and matrix approximation,” Journal of Machine Learning Research, vol. 8, no. Aug, pp. 1919–1986, 2007, https://dl.acm.org/doi/10.1145/1014052.1014111, PDF: https://www.cs.utexas.edu/users/inderjit/public_papers/kdd_bregman_coclustering.pdf
M. Adelman and M. Silberstein, “Faster neural network training with approximate tensor operations,” arXiv preprint arXiv:1805.08079, 2018, https://arxiv.org/abs/1805.08079
Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
12 Mar 2024, IM-Unpack: Training and Inference with Arbitrarily Low Precision Integers, Zhanpeng Zeng, Karthikeyan Sankaralingam, Vikas Singh, https://arxiv.org/abs/2403.07339v1 (Faster GEMM with low-bitwidth integer approximations.)
Uzair Shah1, Jens Schneider, Giovanni Pintore, Enrico Gobbetti, Mahmood Alzubaidi, Mowafa Househ, Marco Agus, 2024, EleViT: exploiting element-wise products for designing efficient and lightweight vision transformers, https://publications.crs4.it/pubdocs/2024/SSPGAHA24/cvprt4v2024-elevit.pdf
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
Piotr Indyk, Michael Kapralov, Kshiteej Sheth, Tal Wagner, 17 Jul 2024, Improved Algorithms for Kernel Matrix-Vector Multiplication, LCFM 2024, https://openreview.net/forum?id=7CCzyEtZXH https://openreview.net/pdf?id=7CCzyEtZXH
Noah Amsel, Tyler Chen, Feyza Duman Keles, Diana Halikias, Cameron Musco, Christopher Musco, 26 Mar 2024 (v3), Fixed-sparsity matrix approximation from matrix-vector products, https://arxiv.org/abs/2402.09379
SZ Lin, YC Chen, YH Chang, TW Kuo, HP Li, 2024, LUTIN: Efficient Neural Network Inference with Table Lookup, ISLPED ’24, August 5-7, 2024, Newport Beach, CA, USA, https://dl.acm.org/doi/pdf/10.1145/3665314.3670804
Q. Deng, Y. Zhang, M. Zhang and J. Yang, "LAcc: Exploiting Lookup Table-based Fast and Accurate Vector Multiplication in DRAM-based CNN Accelerator," 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV, USA, 2019, pp. 1-6. https://ieeexplore.ieee.org/document/8806810 PDF: https://dl.acm.org/doi/pdf/10.1145/3316781.3317845
Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, Dongsoo Lee, 31 Aug 2020 (v2), BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs, https://arxiv.org/abs/2005.09904
Hongyaoxing Gu, 27 May 2024, LRAMM -- Low precision approximates GEMM via RSVD, https://arxiv.org/abs/2405.16917
Yi Ren, Ruge Xu, Xinfei Guo, Weikang Qian, 27 Nov 2024, FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision Quantized DNNs--Down to 2 Bits! https://arxiv.org/abs/2411.18055
Lukas Schiavone, Filip Torphage, 2024, Engineering an Efficient Implementation of Pagh’s Compressed Matrix Multiplication Algorithm Master’s thesis, Department of Computer Science and Engineering, Chalmers University of Technology, University of Gothenburg, Gothenburg, Sweden, https://scholar.google.com/scholar_url?url=http://odr.chalmers.se/bitstreams/e3fbc596-7568-4fec-82c8-f17d2be2147c/download ("... implement the algorithm for approximating matrix multiplication described in Rasmus Pagh’s paper Compressed Matrix Multiplication (2013) in C++").
Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
Majid Daliri, Juliana Freire, Danrong Li, Christopher Musco, 29 Jan 2025, Matrix Product Sketching via Coordinated Sampling, https://arxiv.org/abs/2501.17836