Aussie AI

AI Matrix Algebra

Last Updated 21 March, 2025

by David Spuler, Ph.D.

Neural networks run on matrices and vectors. The main primitive is matrix multiplication, with lots of multiplication and addition, and matrix multiplication is actually a lot of vector dot product computations. Modern AI frameworks and research papers tend to use abbreviations to talk about matrix multiplication (i.e. "MATMUL" operations), and GEMM (General Matrix Multiplication).

Various attempts to optimize models involve manipulating matrix algebra. Some of the research areas include:

Matrix multiplication algorithms (e.g. Strassen, Winograd, FFT)
Alternative and advanced types of matrices (e.g. Butterfly, Monarch)
Advanced matrix algebra
Approximate matrix multiplication algorithms
Vector dot product optimizations
Matrix/tensor decomposition (factorizing matrices into smaller sub-matrices)
Low-rank matrices (smaller factors of matrices)

The individual multiplication operations between numbers inside a matrix can also be optimized:

Alternatives to using arithmetic multiplication (e.g. bitshifting, logarithms, other fancy stuff)
Faster algorithms for scalar arithmetic multiplication (mostly hardware acceleration algorithms for chip designers)
Approximate scalar arithmetic multiplication (faster ways to multiply two numbers by allowing errors)

Matrix Algebra

Matrix algebra underpins most of AI model inference (and training). The term "tensor" in AI theory mainly refers to multi-dimensional matrices, and the multiplication algorithm is the standard one you learned in high school.

There are areas of active research to use matrix algebra in different ways to reduce the total number of multiplication operations. Some of the types of matrices include:

The theory of matrix algebra can be applied to neural networks, since matrix multiplication is at the core of inference and training. Inference of a model involves executing matrix multiplication on a vector of probabilities. And matrix multiplication involves multiplying a row of that matrix over a vector, which is actually computing the vector dot product of two vectors. Undearneath all of those vector iterations are the low-level multiplication operations on pairs of numbers, usually floating point but also possibly integers (in quantized models that use integer-only arithmetic).

General Research on Matrix Algebra Theory

Some of the papers on theory of matrix algebra include:

Paul Cull, A Matrix Algebra for Neural Nets, https://link.springer.com/chapter/10.1007/978-1-4757-0555-3_43
Xian-Da Zhang, A Matrix Algebra Approach to Artificial Intelligence, January 2020, DOI:10.1007/978-981-15-2770-8, ISBN: 978-981-15-2769-2, https://www.researchgate.net/publication/341581565_A_Matrix_Algebra_Approach_to_Artificial_Intelligence
Hillar, C. J. & Lim, L.-H., Most tensor problems are NP-hard. J. ACM 60, 1–39 (2013) https://arxiv.org/abs/0911.1393
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, Nov 2022, Efficiently Scaling Transformer Inference, Google Research, https://arxiv.org/abs/2211.05102
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072 Code: https://github.com/spcl/substation
C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian and B. Yuan, "PermDNN: Efficient compressed DNN architecture with permuted diagonal matrices", Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), pp. 189-202, Oct. 2018. https://arxiv.org/abs/2004.10936
Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson, 10 Jun 2024, Compute Better Spent: Replacing Dense Layers with Structured Matrices, https://arxiv.org/abs/2406.06248
Haque, S.A.; Choudhury, N.; Hossain, S. Matrix Multiplication with Diagonals: Structured Sparse Matrices and Beyond. In Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications, Jinan, China, 17–19 June 2023; pp. 69–76. https://doi.org/10.1145/3606043.3606053
Sardar Anisul Haque,Mohammad Tanvir Parvez, Shahadat Hossain, Jan 2024, GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes, https://www.mdpi.com/1999-4893/17/1/31
Josh Alman, Zhao Song, 6 Oct 2023, How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation, https://arxiv.org/abs/2310.04064
Ishna Satyarth, Chao Yin, RuQing G. Xu, Devin A. Matthews, 15 Nov 2024, Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures, https://arxiv.org/abs/2411.09859

Advanced Types of Matrices

The simplest ideas are that smaller matrices (lower-rank) and sparse matrices can require less computation. Various advanced subtypes of matrices with specific constraints have properties that allow a reduced number of multiplications in evaluating matrix multiplication or matrix-vector multiplication.

Lower-Rank Submatrices: There are many theoriest that typical models are "over-parameterized", which means their matrices are bigger than they need to be. The number of weights is quadratic in the dimensions of the matrix, so reducing to smaller, lower-rank matrices improves performance. There is much theory about finding the "smaller models" that are inside the "big models", but have accuracy not much worse than the larger matrices. This is sometimes called "parameter factorization" of models. The non-mathematical technique of "distillation" is also related, as it finds a smaller model with similar accuracy. See Low-rank matrix factorization research.

Sparse Matrices: The idea with sparse matrices is that if they have a lot of zero weights, then the inference algorithm is doing a lot of unnecessary multiplications by zero. Various algorithms aim to focus on only doing multiplication in the submatrices that have data. Sparsity is closely related to model pruning, and these matrix sparsity techniques can also be amplified using dynamic pruning of near-zero weights, to further increase the total number of zeros in the matrix. See Sparse matrices and sparsification.

Butterfly Matrices: These special matrices are one approach to use of matrix algebra to achieve sparsity. Research includes:

Beidi Chen, Tri Dao and Chris Ré, Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models, Jan 17, 2022, https://hazyresearch.stanford.edu/blog/2022-01-17-Sparsity-3-Pixelated-Butterfly
Keivan Alizadeh Vahid, Anish Prabhu, Ali Farhadi, Mohammad Rastegari Apr 2020, Butterfly Transform: An Efficient FFT Based Neural Architecture Design, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://ieeexplore.ieee.org/abstract/document/9157695/, https://arxiv.org/abs/1906.02256
L Zheng, G Puy, E Riccietti, P Pérez, R Gribonval, Butterfly factorization by algorithmic identification of rank-one blocks, July 2023, https://arxiv.org/abs/2307.00820
L Zheng, E Riccietti, R Gribonval, 2023, Efficient Identification of Butterfly Sparse Matrix Factorizations, SIAM Journal on Mathematics of Data ScienceVol. 5, Iss. 1 (2023), DOI 10.1137/22M1488727, https://epubs.siam.org/doi/abs/10.1137/22M1488727, https://arxiv.org/abs/2110.01230
J Peca-Medlin, T Trogdon, 2023, Growth factors of random butterfly matrices and the stability of avoiding pivoting, SIAM Journal on Matrix Analysis and Applications, 2023, https://epubs.siam.org/doi/abs/10.1137/22M148762X, https://arxiv.org/abs/2203.15921
D. Stott Parker, 1995, "Random butterfly transformations with applications in computational linear algebra", Technical report, UCLA Computer Science Dept, 1995. https://searchworks.stanford.edu/view/4640257
D. Stott Parker, 1995, A randomizing butterfly transformation useful in block matrix computations, Technical report, UCLA Computer Science Dept. https://searchworks.stanford.edu/view/4640258
Tatsuya Member and Member Kazuyoshi, "Bidirectional learning for neural network having butterfly structure", Systems and Computers in Japan, vol. 26, pp. 64-73, 04 1995. https://onlinelibrary.wiley.com/doi/abs/10.1002/scj.4690260407
Li Yingzhou, Yang Haizhao, R. Martin Eileen, L. Ho Kenneth and Ying Lexing, "Butterfly factorization", Multiscale Modeling Simulation, vol. 13, pp. 714-732, 2015. https://arxiv.org/abs/1502.01379
Dao Tri, Gu Albert, Eichhorn Matthew, Rudra Atri and Ré Christopher, Learning fast algorithms for linear transforms using butterfly factorizations, 2019. http://proceedings.mlr.press/v97/dao19a.html
Y Li, X Cheng, J Lu, Apr 2020, Butterfly-Net: Optimal function representation based on convolutional neural networks, arXiv preprint arXiv:1805.07451, https://arxiv.org/abs/1805.07451
Rui Lin, Jie Ran, King Hung Chiu, Graziano Chesi, Ngai Wong, Mar 2022, Deformable butterfly: A highly structured and sparse linear transform NeurIPS 2021, https://arxiv.org/abs/2203.13556, https://proceedings.neurips.cc/paper/2021/file/86b122d4358357d834a87ce618a55de0-Paper.pdf
N Ailon, O Leibovitch, V Nair, July 2021, Sparse linear networks with a fixed butterfly structure: theory and practice, Uncertainty in Artificial Intelligence, UAI 2021, https://arxiv.org/abs/2007.08864 https://proceedings.mlr.press/v161/ailon21a/ailon21a.pdf
Daniel Y. Fu, Elliot L. Epstein, Eric Nguyen, Armin W. Thomas, Michael Zhang, Tri Dao, Atri Rudra, Christopher Ré, Feb 2023, Simple Hardware-Efficient Long Convolutions for Sequence Modeling, https://arxiv.org/abs/2302.06646 (FlashButterfly algorithm.)

Monarch Matrices: Monarch matrices are a superset of bufferfly matrices, named after the orange-winged monarch butterfly. These special types of matrices aim to exploit operations on submatrices to reduce the overall computational overhead of matrix multiplication.

Dan Fu, Simran Arora, Chris Ré, Monarch Mixer: Revisiting BERT, Without Attention or MLPs, Jul 25, 2023, https://hazyresearch.stanford.edu/blog/2023-07-25-m2-bert (Monarch Matrices) (Code: https://github.com/HazyResearch/m2)
Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré, Apr 2022, Monarch: Expressive Structured Matrices for Efficient and Accurate Training, Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. https://arxiv.org/abs/2204.00595 https://proceedings.mlr.press/v162/dao22a/dao22a.pdf
Sunil Babu Melingi, Ramesh Kumar Mojjada, C. Tamizhselvan, R. Surender, S. Yazhinian., 2022, A self-adaptive monarch butterfly optimization (MBO) algorithm based improved deep forest neural network model for detecting and classifying brain stroke lesions, Research on Biomedical Engineering volume 38, pages647–660 (2022), https://link.springer.com/article/10.1007/s42600-022-00214-2
Dan Fu, Simran Arora, Chris Ré, Jul 25, 2023, Monarch Mixer: A new model architecture for increased efficiency, Together AI blog, https://together.ai/blog/monarch-mixer, Code: https://github.com/HazyResearch/m2 (An implementation by Together AI of Stanford's Hazy Research AI engines using Monarch matrices.)
Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson, 10 Jun 2024, Compute Better Spent: Replacing Dense Layers with Structured Matrices, https://arxiv.org/abs/2406.06248
Mimoun Mohamed, Valentin Emiya, Caroline Chaux. 17 Jan 2025, Learning Permutations in Monarch Factorization. 2025. hal-04887483 https://hal.science/hal-04887483/document

Matrix/Tensor Factorization (Decomposition)

Methods to factorize or decompose larger matrices into smaller matrices (see low-rank matrices for optimization), or specific subtypes of matrices as above, require special algorithms. Some of the main theoretical decomposition algorithms are CANDECOMP/PARAFAC decomposition (CP decomposition) and Tucker decomposition.

Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer (2009), https://ieeexplore.ieee.org/abstract/document/5197422
Defu Lian, Rui Liu, Yong Ge, Kai Zheng, Xing Xie, and Longbing Cao. 2017. Discrete content-aware matrix factorization. In SIGKDD. 325–334, https://dl.acm.org/doi/10.1145/3097983.3098008
Andriy Mnih and Russ R Salakhutdinov. 2007. Probabilistic matrix factorization. NIPS 20 (2007), 1257–1264, https://dl.acm.org/doi/10.5555/2981562.2981720
Carroll, J Douglas and Chang, Jih-Jie. Analysis of individual differences in multidimensional scaling via an n-way generalization of eckart-young decomposition. Psychometrika, 35(3):283–319, 1970. https://link.springer.com/article/10.1007/BF02310791 (CANDECOMP/PARAFAC decomposition/factorization.)
Harshman, Richard A and Lundy, Margaret E. PARAFAC: Parallel factor analysis. Computational Statistics & Data Analysis, 18(1):39–72, 1994 https://www.sciencedirect.com/science/article/abs/pii/0167947394901325 (The P in CP decomposition/factorization.)
Shashua, Amnon and Hazan, Tamir. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22nd international conference on Machine learning, pp. 792–799. ACM, 2005. PDF: https://icml.cc/imls/conferences/2005/proceedings/papers/100_NonNegative_ShashuaHazan.pdf (CANDECOMP/PARAFAC decomposition/factorization.)
Tucker, Ledyard R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3): 279–311, 1966 https://link.springer.com/article/10.1007/BF02289464 (Tucker decomposition.)
De Lathauwer, Lieven, De Moor, Bart, and Vandewalle, Joos. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000. https://epubs.siam.org/doi/10.1137/S0895479896305696 (Tucker decomposition.)
Kim, Y.-D. and Choi, S. Nonnegative Tucker decomposition. In Proceedings of the IEEE CVPR 2007 Workshop on Component Analysis Methods, Minneapolis, Minnesota, 2007. https://ieeexplore.ieee.org/document/4270403
H Li, J Choi, Y Kwon, JH Ahn, Oct 2023, A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models, IEEE Computer Architecture Letters, https://ieeexplore.ieee.org/abstract/document/10285300/ (Tiled version of matrix multiplication with SVD factorization.)
Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
Chakshu Moar, 2024, Compressing Language Models using Low-Rank Decomposition and Characterizing the Accuracy- Efficiency Trade-offs, Master of Science Thesis, Electrical and Computer Engineering, University of California, Irvine, USA, https://escholarship.org/content/qt0t6967h4/qt0t6967h4.pdf
Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen, 21 May 2024, Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression, https://arxiv.org/abs/2405.12591
Chakshu Moar, Michael Pellauer, Hyoukjun Kwon, 10 May 2024, Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models, https://arxiv.org/abs/2405.06626
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Dolgov, R. Brasher, M. Perelshtein, 29 Jan 2024, TQCompressor: improving tensor decomposition methods in neural networks via permutations, V. Abronin, A. Naumov, D. Mazur, D. Bystrov, K. Tsarova, Ar. Melnikov, I. Oseledets, S. https://arxiv.org/abs/2401.16367 Code: https://huggingface.co/tq-ag/TQCompressedGPT2 Code: https://github.com/terra-quantum-public/TQCompressedGPT2 (A permutation-based enhancement to the Kronecker decomposition method.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim, Nov 2023, LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning, https://arxiv.org/abs/2311.12023
Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, Guangyu Sun, Dec 2023, ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models, https://arxiv.org/abs/2312.05821 Code: https://github.com/hahnyuan/ASVD4LLM
R Gribonval, T Mary, E Riccietti, 2023, Optimal quantization of rank-one matrices in floating-point arithmetic---with applications to butterfly factorizations https://inria.hal.science/hal-04125381/file/rank1_quant.pdf
H Fan, T Chau, SI Venieris, R Lee, 2022, Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design https://ieeexplore.ieee.org/abstract/document/9923888/ https://arxiv.org/pdf/2209.09570
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019, https://arxiv.org/abs/1909.11942 Code: https://github.com/google-research/ALBERT
Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
Jungi Lee, Wonbeom Lee, Jaewoong Sim, 16 Jun 2024, Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization, https://arxiv.org/abs/2406.12930 (Combining tensor decomposition and quantization with power-of-two scale factors.)
Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
Mirko Farina, Usman Ahmad, Ahmad Taha, Hussein Younes, Yusuf Mesbah, Xiao Yu, Witold Pedrycz, 2024, Sparsity in transformers: A systematic literature review, Neurocomputing, Volume 582, 14 May 2024, 127468, https://www.sciencedirect.com/science/article/abs/pii/S092523122400239X (General survey of sparsity methods, and techniques that create sparsity.)
Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Zeyu Zhang, Haiying Shen, 7 Aug 2024, Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference, https://arxiv.org/abs/2408.04107
Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin, 8 May 2024, Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers, https://arxiv.org/abs/2405.05219 (Attention optimization using multiple low-rank matrices.)
Sasindu Wijeratne, Rajgopal Kannan, Viktor Prasanna, 14 May 2024, Sparse MTTKRP Acceleration for Tensor Decomposition on GPU, https://arxiv.org/abs/2405.08470
Tugba Torun, Eren Yenigul, Ameer Taweel, Didem Unat, 8 May 2024, A Sparse Tensor Generator with Efficient Feature Extraction, https://arxiv.org/abs/2405.04944 https://github.com/sparcityeu/feaTen https://github.com/sparcityeu/genTen
Geonhwa Jeong, Po-An Tsai, Abhimanyu R. Bambhaniya, Stephen W. Keckler, Tushar Krishna, 31 Mar 2024 (v2), Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition, https://arxiv.org/abs/2403.07953
Jan Laukemann, Ahmed E. Helal, S. Isaac Geronimo Anderson, Fabio Checconi, Yongseok Soh, Jesmin Jahan Tithi, Teresa Ranadive, Brian J Gravelle, Fabrizio Petrini, Jee Choi, 11 Mar 2024, Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation, https://arxiv.org/abs/2403.06348
Nishant Yadav, May 2024, Efficient K-Nearet Neighbor Search With Black-Box Neural Similarity Functions, Ph.D. Dissertation, Manning College of Information and Computer Sciences, University of Massachusetts Amherst, https://scholarworks.umass.edu/bitstreams/5572c0b9-cd96-46d8-983d-f877c9d0e22e/download
Hongyaoxing Gu, 27 May 2024, LRAMM -- Low precision approximates GEMM via RSVD, https://arxiv.org/abs/2405.16917
Ignacio Hounie, Charilaos Kanatsoulis, Arnuv Tandon, Alejandro Ribeiro, 5 Oct 2024, LoRTA: Low Rank Tensor Adaptation of Large Language Models, https://arxiv.org/abs/2410.04060
Haoran Guan, Yuwei Fan, 9 Oct 2024, CholeskyQR for sparse matrices, https://arxiv.org/abs/2410.06525
D.Breen, Oct 2024, Towards Sustainable CNNs: Tensor Decompositions for Green AI Solutions: Exploring Energy Consumption of Large CNNs, Master's Thesis, Systems and Control & Robotics, Delft University of Technology, https://repository.tudelft.nl/file/File_8208301f-51ef-4edf-bd12-d6ec3d5a8711
Yubin Qin, Yang Wang, Zhiren Zhao, Xiaolong Yang, Yang Zhou, Shaojun Wei, Yang Hu, Shouyi Yin, 2024, MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition, 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), Year: 2024, Pages: 1032-1047, DOI Bookmark: 10.1109/ISCA59077.2024.00079, https://www.computer.org/csdl/proceedings-article/isca/2024/265800b032/1Z3pCEBnapO
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12 https://link.springer.com/chapter/10.1007/978-3-031-71253-1_12 (Speed up matrix operations with SVD and NMF via adaptive block sizing based on batching.)
Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
Ishna Satyarth, Chao Yin, RuQing G. Xu, Devin A. Matthews, 15 Nov 2024, Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures, https://arxiv.org/abs/2411.09859
Kwangryeol Park, Seulki Lee, 12 Dec 2024, SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization, https://arxiv.org/abs/2412.08894 (Gradient optimizer Adam optimized using low-rank matrix factorization.)
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Sanghyeon Park, Soo-Mook Moon, 8 Jan 2025, CURing Large Models: Compression via CUR Decomposition, https://arxiv.org/abs/2501.04211

Low-Rank Matrix Factorization

Matrix factorization (decomposition) can be used to find low-rank matrices that approximate the larger sets of weights.

Genta Indra Winata, Andrea Madotto, Jamin Shin, Elham J Barezi, and Pascale Fung. 2019. On the Effectiveness of Low-Rank Matrix Factorization for LSTM Model Compression. arXiv preprint arXiv:1908.09982. https://arxiv.org/abs/1908.09982
Ashish Khetan and Zohar Karnin. “schubert: Optimizing elements of bert”. arXiv preprint arXiv:2005.06628 (2020) https://arxiv.org/abs/2005.06628
Jian Xue, Jinyu Li, and Yifan Gong. 2013. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, pages 2365–2369. https://www.academia.edu/72568360/Restructuring_of_deep_neural_network_acoustic_models_with_singular_value_decomposition
Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. 2018. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, pages 10988–10998. https://arxiv.org/abs/1806.06950
Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. 2017. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7370–7379. PDF: https://openaccess.thecvf.com/content_cvpr_2017/papers/Yu_On_Compressing_Deep_CVPR_2017_paper.pdf
Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth. “Pruning Redundant Mappings in Transformer Models via SpectralNormalized Identity Prior”. arXiv preprint arXiv:2010.01791 (2020) https://arxiv.org/abs/2010.01791
Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. https://arxiv.org/abs/2303.10512, Code: https://github.com/QingruZhang/AdaLoRA
Arnav Chavan, Zhuang Liu, Deepak K. Gupta, Eric P. Xing, and Zhiqiang Shen. One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning. CoRR, abs/2306.07967, 2023. https://arxiv.org/abs/2306.07967
Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation, In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 3266–3279. Association for Computational Linguistics, 2023. https://arxiv.org/abs/2210.07558
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. https://arxiv.org/abs/2106.09685
Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, and Sanjeev Khudanpur. Semi-orthogonal low-rank matrix factorization for deep neural networks. In B. Yegnanarayana, editor, Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, pages 3743– 3747. ISCA, 2018. PDF: https://www.isca-speech.org/archive/pdfs/interspeech_2018/povey18_interspeech.pdf
Yerlan Idelbayev; Miguel Á. Carreira-Perpiñán, Low-Rank Compression of Neural Nets: Learning the Rank of Each Layer. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8046–8056. Computer Vision Foundation / IEEE, 2020. https://ieeexplore.ieee.org/document/9157223/
Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches including low-rank matrices.)

Singular Value Decomposition (SVD)

SVD is a specific type of matrix decomposition. Research papers on SVD:

Zeyu Zhang, Haiying Shen, 7 Aug 2024, Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference, https://arxiv.org/abs/2408.04107
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
Hongyaoxing Gu, 27 May 2024, LRAMM -- Low precision approximates GEMM via RSVD, https://arxiv.org/abs/2405.16917
Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao 12 Aug 2024 (v3), A Survey on LoRA of Large Language Models, https://arxiv.org/abs/2407.11046 https://github.com/ZJU-LLMs/Awesome-LoRAs.git
Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12 https://link.springer.com/chapter/10.1007/978-3-031-71253-1_12 (Speed up matrix operations with SVD and NMF via adaptive block sizing based on batching.)
Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
Shengwen Ding, Chenhui Hu, 24 Nov 2024, eFedLLM: Efficient LLM Inference Based on Federated Learning, https://arxiv.org/abs/2411.16003
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Hong Yankun, Li Xing, Zhen Hui-Ling, Yu Xianzhi, Liu Wulong, Yuan Mingxuan, 21 Feb 2025, SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention, https://arxiv.org/abs/2502.15304
Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, Mi Zhang, 16 Mar 2025, SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression, https://arxiv.org/abs/2503.12340 https://github.com/AIoT-MLSys-Lab/SVD-LLM

Tucker Decomposition

Research papers on Tucker decomposition:

Chakshu Moar, 2024, Compressing Language Models using Low-Rank Decomposition and Characterizing the Accuracy- Efficiency Trade-offs, Master of Science Thesis, Electrical and Computer Engineering, University of California, Irvine, USA, https://escholarship.org/content/qt0t6967h4/qt0t6967h4.pdf
Chakshu Moar, Michael Pellauer, Hyoukjun Kwon, 10 May 2024, Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models, https://arxiv.org/abs/2405.06626
Tucker, Ledyard R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3): 279–311, 1966 https://link.springer.com/article/10.1007/BF02289464 (Tucker decomposition.)
De Lathauwer, Lieven, De Moor, Bart, and Vandewalle, Joos. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000. https://epubs.siam.org/doi/10.1137/S0895479896305696 (Tucker decomposition.)
Kim, Y.-D. and Choi, S. Nonnegative Tucker decomposition. In Proceedings of the IEEE CVPR 2007 Workshop on Component Analysis Methods, Minneapolis, Minnesota, 2007. https://ieeexplore.ieee.org/document/4270403
Federica Stolf, Antonio Canale, 15 Nov 2024, Bayesian Adaptive Tucker Decompositions for Tensor Factorization, https://arxiv.org/abs/2411.10218
Matthew Pietrosanu, Bei Jiang, Linglong Kong, 13 Jun 2024, Oblivious subspace embeddings for compressed Tucker decompositions, https://arxiv.org/abs/2406.09387
Tobias Weber, Jakob Dexl, David Rügamer, Michael Ingrisch, 18 Apr 2024 (v2), Post-Training Network Compression for 3D Medical Image Segmentation: Reducing Computational Efforts via Tucker Decomposition, https://arxiv.org/abs/2404.09683
Ruizhong Qiu, Jun-Gi Jang, Xiao Lin, Lihui Liu, Hanghang Tong, 11 Jan 2025, TUCKET: A Tensor Time Series Data Structure for Efficient and Accurate Factor Analysis over Time Ranges, https://arxiv.org/abs/2501.06647 (Tucker decomposition to optimize time-series forecasting.)

Vector Dot Product Optimization

The computation of a vector dot product, also called "scalar product", is the basis of all AI operations. Matrix multiplications are everywhere, and a matrix multiply operation is just a series of dot product operations. Each element in the result of a matrix multiplication is computed via a vector dot product between two vectors: a row in one matrix, and a column in the other matrix.

Given the importance of vector dot products to the speed of AI, various attempts have been made to speed it up. Options for speedup include hardware accelerated dot product, faster vector dot product algorithms and the use of approximations.

Research on dot product optimization includes:

N Yamanaka, T Ogita, SM Rump, S Oishi, 2008, A parallel algorithm for accurate dot product, Parallel Computing, https://www.sciencedirect.com/science/article/pii/S016781910800032X, PDF: https://ogilab.w.waseda.jp/ogita/math/doc/2008_YaOgRuOi.pdf
W Kamp, A Bainbridge-Smith, 2007, Multiply accumulate unit optimised for fast dot-product evaluation, 2007 International Conference on Field-Programmable Technology, https://ieeexplore.ieee.org/abstract/document/4439283/
Chitta Ranjan May 9, 2019, Understanding the Kernel Trick with fundamentals, Towards Data Science https://towardsdatascience.com/truly-understanding-the-kernel-trick-1aeb11560769
J Diffenderfer, D Osei-Kuffuor, H Menon, March 2021, A framework for error-bounded approximate computing, with an application to dot products, SIAM Journal on Scientific Computing, https://www.osti.gov/servlets/purl/1959416
J Diffenderfer, D Osei-Kuffuor, H Menon, 2021, QDOT: Quantized dot product kernel for approximate high-performance computing, arXiv preprint arXiv:2105.00115, https://arxiv.org/abs/2105.00115
Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres, 2018, Enhanced Floating-Point Sums, Dot Products, and Polynomial Values, In: Handbook of Floating-Point Arithmetic, pp. 163–192, https://link.springer.com/chapter/10.1007/978-3-319-76526-6_5
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf
Lucas Klemmer; Saman Froehlich; Rolf Drechsler; Daniel Große, 2021, XbNN: Enabling CNNs on edge devices by approximate on-chip dot product encoding, 2021 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/document/9401780, PDF: https://agra.informatik.uni-bremen.de/doc/konf/2021_ISCAS_XBNN.pdf
Y. Nievergelt, Scalar fused multiply-add instructions produce floating-point matrix arithmetic provably accurate to the penultimate digit, ACM Trans. Math. Softw., 29 (2003), pp. 27–48, https://dl.acm.org/doi/10.1145/641876.641878
S Graillat, V Ménissier-Morain, 2012, Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic, Information and Computation, Volume 216, July 2012, Pages 57-71, https://www.sciencedirect.com/science/article/pii/S0890540112000715
AM Zaki, MH El-Shafey, AMB Eldin, 2010, A new architecture for accurate dot product of floating point numbers, The 2010 International Conference on Computer Engineering & Systems, https://ieeexplore.ieee.org/abstract/document/5674841/
K He, R Barrio, L Chen, H Jiang, J Liu, T Gu, 2021, A Class of Fast and Accurate Multi-layer Block Summation and Dot Product Algorithms, IFIP International Conference on Network and Parallel Computing, NPC 2021: Network and Parallel Computing, pp. 64–75, https://link.springer.com/chapter/10.1007/978-3-030-93571-9_6
S. Graillat, P. Langlois, N. Louvet, 15 September 2006, Choosing a twice more accurate dot product implementation, https://www.researchgate.net/publication/250769076_Choosing_a_Twice_More_Accurate_Dot_Product_Implementation, PDF: https://www-pequan.lip6.fr/~graillat/papers/icnaam06.pdf
A Knofel, 1991, Fast hardware units for the computation of accurate dot products, Proceedings 10th IEEE Symposium on Computer Arithmetic, https://ieeexplore.ieee.org/document/145536, PDF: https://scholar.archive.org/work/cp6cgjq7g5enzfqoqtte2bb6k4/access/wayback/http://www.acsel-lab.com/arithmetic/papers/ARITH10/ARITH10_Knofel.pdf

Aussie AI

AI Matrix Algebra

Matrix Algebra

General Research on Matrix Algebra Theory

Advanced Types of Matrices

Matrix/Tensor Factorization (Decomposition)

Low-Rank Matrix Factorization

Singular Value Decomposition (SVD)

Tucker Decomposition

Vector Dot Product Optimization

More AI Research

Quick Links

Product

New to Writing?

Writing Styles