Aussie AI
AI Matrix Algebra
-
Last Updated 29 November, 2024
-
by David Spuler, Ph.D.
Neural networks run on matrices and vectors. The main primitive is matrix multiplication, with lots of multiplication and addition, and matrix multiplication is actually a lot of vector dot product computations. Modern AI frameworks and research papers tend to use abbreviations to talk about matrix multiplication (i.e. "MATMUL" operations), and GEMM (General Matrix Multiplication).
Various attempts to optimize models involve manipulating matrix algebra. Some of the research areas include:
- Matrix multiplication algorithms (e.g. Strassen, Winograd, FFT)
- Alternative and advanced types of matrices (e.g. Butterfly, Monarch)
- Advanced matrix algebra
- Approximate matrix multiplication algorithms
- Vector dot product optimizations
- Matrix/tensor decomposition (factorizing matrices into smaller sub-matrices)
- Low-rank matrices (smaller factors of matrices)
The individual multiplication operations between numbers inside a matrix can also be optimized:
- Alternatives to using arithmetic multiplication (e.g. bitshifting, logarithms, other fancy stuff)
- Faster algorithms for scalar arithmetic multiplication (mostly hardware acceleration algorithms for chip designers)
- Approximate scalar arithmetic multiplication (faster ways to multiply two numbers by allowing errors)
Matrix Algebra
Matrix algebra underpins most of AI model inference (and training). The term "tensor" in AI theory mainly refers to multi-dimensional matrices, and the multiplication algorithm is the standard one you learned in high school.
There are areas of active research to use matrix algebra in different ways to reduce the total number of multiplication operations. Some of the types of matrices include:
- Structured matrices (the general class)
- Submatrices
- Low-rank matrices
- Sparse matrix optimizations
- Monarch matrices
- Butterfly matrices
- Low-rank matrices
The theory of matrix algebra can be applied to neural networks, since matrix multiplication is at the core of inference and training. Inference of a model involves executing matrix multiplication on a vector of probabilities. And matrix multiplication involves multiplying a row of that matrix over a vector, which is actually computing the vector dot product of two vectors. Undearneath all of those vector iterations are the low-level multiplication operations on pairs of numbers, usually floating point but also possibly integers (in quantized models that use integer-only arithmetic).
General Research on Matrix Algebra Theory
Some of the papers on theory of matrix algebra include:
- Paul Cull, A Matrix Algebra for Neural Nets, https://link.springer.com/chapter/10.1007/978-1-4757-0555-3_43
- Xian-Da Zhang, A Matrix Algebra Approach to Artificial Intelligence, January 2020, DOI:10.1007/978-981-15-2770-8, ISBN: 978-981-15-2769-2, https://www.researchgate.net/publication/341581565_A_Matrix_Algebra_Approach_to_Artificial_Intelligence
- Hillar, C. J. & Lim, L.-H., Most tensor problems are NP-hard. J. ACM 60, 1–39 (2013) https://arxiv.org/abs/0911.1393
- Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, Nov 2022, Efficiently Scaling Transformer Inference, Google Research, https://arxiv.org/abs/2211.05102
- Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Nov 2021, Data movement is all you need: A case study on optimizing transformers, Proceedings of Machine Learning and Systems, 3, 2021. https://arxiv.org/abs/2007.00072 Code: https://github.com/spcl/substation
- C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian and B. Yuan, "PermDNN: Efficient compressed DNN architecture with permuted diagonal matrices", Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), pp. 189-202, Oct. 2018. https://arxiv.org/abs/2004.10936
- Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson, 10 Jun 2024, Compute Better Spent: Replacing Dense Layers with Structured Matrices, https://arxiv.org/abs/2406.06248
- Haque, S.A.; Choudhury, N.; Hossain, S. Matrix Multiplication with Diagonals: Structured Sparse Matrices and Beyond. In Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications, Jinan, China, 17–19 June 2023; pp. 69–76. https://doi.org/10.1145/3606043.3606053
- Sardar Anisul Haque,Mohammad Tanvir Parvez, Shahadat Hossain, Jan 2024, GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes, https://www.mdpi.com/1999-4893/17/1/31
- Josh Alman, Zhao Song, 6 Oct 2023, How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation, https://arxiv.org/abs/2310.04064
- Ishna Satyarth, Chao Yin, RuQing G. Xu, Devin A. Matthews, 15 Nov 2024, Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures, https://arxiv.org/abs/2411.09859
Advanced Types of Matrices
The simplest ideas are that smaller matrices (lower-rank) and sparse matrices can require less computation. Various advanced subtypes of matrices with specific constraints have properties that allow a reduced number of multiplications in evaluating matrix multiplication or matrix-vector multiplication.
Lower-Rank Submatrices: There are many theoriest that typical models are "over-parameterized", which means their matrices are bigger than they need to be. The number of weights is quadratic in the dimensions of the matrix, so reducing to smaller, lower-rank matrices improves performance. There is much theory about finding the "smaller models" that are inside the "big models", but have accuracy not much worse than the larger matrices. This is sometimes called "parameter factorization" of models. The non-mathematical technique of "distillation" is also related, as it finds a smaller model with similar accuracy. See Low-rank matrix factorization research.
Sparse Matrices: The idea with sparse matrices is that if they have a lot of zero weights, then the inference algorithm is doing a lot of unnecessary multiplications by zero. Various algorithms aim to focus on only doing multiplication in the submatrices that have data. Sparsity is closely related to model pruning, and these matrix sparsity techniques can also be amplified using dynamic pruning of near-zero weights, to further increase the total number of zeros in the matrix. See Sparse matrices and sparsification.
Butterfly Matrices: These special matrices are one approach to use of matrix algebra to achieve sparsity. Research includes:
- Beidi Chen, Tri Dao and Chris Ré, Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models, Jan 17, 2022, https://hazyresearch.stanford.edu/blog/2022-01-17-Sparsity-3-Pixelated-Butterfly
- Keivan Alizadeh Vahid, Anish Prabhu, Ali Farhadi, Mohammad Rastegari Apr 2020, Butterfly Transform: An Efficient FFT Based Neural Architecture Design, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), https://ieeexplore.ieee.org/abstract/document/9157695/, https://arxiv.org/abs/1906.02256
- L Zheng, G Puy, E Riccietti, P Pérez, R Gribonval, Butterfly factorization by algorithmic identification of rank-one blocks, July 2023, https://arxiv.org/abs/2307.00820
- L Zheng, E Riccietti, R Gribonval, 2023, Efficient Identification of Butterfly Sparse Matrix Factorizations, SIAM Journal on Mathematics of Data ScienceVol. 5, Iss. 1 (2023), DOI 10.1137/22M1488727, https://epubs.siam.org/doi/abs/10.1137/22M1488727, https://arxiv.org/abs/2110.01230
- J Peca-Medlin, T Trogdon, 2023, Growth factors of random butterfly matrices and the stability of avoiding pivoting, SIAM Journal on Matrix Analysis and Applications, 2023, https://epubs.siam.org/doi/abs/10.1137/22M148762X, https://arxiv.org/abs/2203.15921
- D. Stott Parker, 1995, "Random butterfly transformations with applications in computational linear algebra", Technical report, UCLA Computer Science Dept, 1995. https://searchworks.stanford.edu/view/4640257
- D. Stott Parker, 1995, A randomizing butterfly transformation useful in block matrix computations, Technical report, UCLA Computer Science Dept. https://searchworks.stanford.edu/view/4640258
- Tatsuya Member and Member Kazuyoshi, "Bidirectional learning for neural network having butterfly structure", Systems and Computers in Japan, vol. 26, pp. 64-73, 04 1995. https://onlinelibrary.wiley.com/doi/abs/10.1002/scj.4690260407
- Li Yingzhou, Yang Haizhao, R. Martin Eileen, L. Ho Kenneth and Ying Lexing, "Butterfly factorization", Multiscale Modeling Simulation, vol. 13, pp. 714-732, 2015. https://arxiv.org/abs/1502.01379
- Dao Tri, Gu Albert, Eichhorn Matthew, Rudra Atri and Ré Christopher, Learning fast algorithms for linear transforms using butterfly factorizations, 2019. http://proceedings.mlr.press/v97/dao19a.html
- Y Li, X Cheng, J Lu, Apr 2020, Butterfly-Net: Optimal function representation based on convolutional neural networks, arXiv preprint arXiv:1805.07451, https://arxiv.org/abs/1805.07451
- Rui Lin, Jie Ran, King Hung Chiu, Graziano Chesi, Ngai Wong, Mar 2022, Deformable butterfly: A highly structured and sparse linear transform NeurIPS 2021, https://arxiv.org/abs/2203.13556, https://proceedings.neurips.cc/paper/2021/file/86b122d4358357d834a87ce618a55de0-Paper.pdf
- N Ailon, O Leibovitch, V Nair, July 2021, Sparse linear networks with a fixed butterfly structure: theory and practice, Uncertainty in Artificial Intelligence, UAI 2021, https://arxiv.org/abs/2007.08864 https://proceedings.mlr.press/v161/ailon21a/ailon21a.pdf
- Daniel Y. Fu, Elliot L. Epstein, Eric Nguyen, Armin W. Thomas, Michael Zhang, Tri Dao, Atri Rudra, Christopher Ré, Feb 2023, Simple Hardware-Efficient Long Convolutions for Sequence Modeling, https://arxiv.org/abs/2302.06646 (FlashButterfly algorithm.)
Monarch Matrices: Monarch matrices are a superset of bufferfly matrices, named after the orange-winged monarch butterfly. These special types of matrices aim to exploit operations on submatrices to reduce the overall computational overhead of matrix multiplication.
- Dan Fu, Simran Arora, Chris Ré, Monarch Mixer: Revisiting BERT, Without Attention or MLPs, Jul 25, 2023, https://hazyresearch.stanford.edu/blog/2023-07-25-m2-bert (Monarch Matrices) (Code: https://github.com/HazyResearch/m2)
- Tri Dao, Beidi Chen, Nimit Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, Christopher Ré, Apr 2022, Monarch: Expressive Structured Matrices for Efficient and Accurate Training, Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. https://arxiv.org/abs/2204.00595 https://proceedings.mlr.press/v162/dao22a/dao22a.pdf
- Sunil Babu Melingi, Ramesh Kumar Mojjada, C. Tamizhselvan, R. Surender, S. Yazhinian., 2022, A self-adaptive monarch butterfly optimization (MBO) algorithm based improved deep forest neural network model for detecting and classifying brain stroke lesions, Research on Biomedical Engineering volume 38, pages647–660 (2022), https://link.springer.com/article/10.1007/s42600-022-00214-2
- Dan Fu, Simran Arora, Chris Ré, Jul 25, 2023, Monarch Mixer: A new model architecture for increased efficiency, Together AI blog, https://together.ai/blog/monarch-mixer, Code: https://github.com/HazyResearch/m2 (An implementation by Together AI of Stanford's Hazy Research AI engines using Monarch matrices.)
- Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson, 10 Jun 2024, Compute Better Spent: Replacing Dense Layers with Structured Matrices, https://arxiv.org/abs/2406.06248
Matrix/Tensor Factorization (Decomposition)
Methods to factorize or decompose larger matrices into smaller matrices (see low-rank matrices for optimization), or specific subtypes of matrices as above, require special algorithms. Some of the main theoretical decomposition algorithms are CANDECOMP/PARAFAC decomposition (CP decomposition) and Tucker decomposition.
- Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer (2009), https://ieeexplore.ieee.org/abstract/document/5197422
- Defu Lian, Rui Liu, Yong Ge, Kai Zheng, Xing Xie, and Longbing Cao. 2017. Discrete content-aware matrix factorization. In SIGKDD. 325–334, https://dl.acm.org/doi/10.1145/3097983.3098008
- Andriy Mnih and Russ R Salakhutdinov. 2007. Probabilistic matrix factorization. NIPS 20 (2007), 1257–1264, https://dl.acm.org/doi/10.5555/2981562.2981720
- Carroll, J Douglas and Chang, Jih-Jie. Analysis of individual differences in multidimensional scaling via an n-way generalization of eckart-young decomposition. Psychometrika, 35(3):283–319, 1970. https://link.springer.com/article/10.1007/BF02310791 (CANDECOMP/PARAFAC decomposition/factorization.)
- Harshman, Richard A and Lundy, Margaret E. PARAFAC: Parallel factor analysis. Computational Statistics & Data Analysis, 18(1):39–72, 1994 https://www.sciencedirect.com/science/article/abs/pii/0167947394901325 (The P in CP decomposition/factorization.)
- Shashua, Amnon and Hazan, Tamir. Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22nd international conference on Machine learning, pp. 792–799. ACM, 2005. PDF: https://icml.cc/imls/conferences/2005/proceedings/papers/100_NonNegative_ShashuaHazan.pdf (CANDECOMP/PARAFAC decomposition/factorization.)
- Tucker, Ledyard R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3): 279–311, 1966 https://link.springer.com/article/10.1007/BF02289464 (Tucker decomposition.)
- De Lathauwer, Lieven, De Moor, Bart, and Vandewalle, Joos. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000. https://epubs.siam.org/doi/10.1137/S0895479896305696 (Tucker decomposition.)
- Kim, Y.-D. and Choi, S. Nonnegative Tucker decomposition. In Proceedings of the IEEE CVPR 2007 Workshop on Component Analysis Methods, Minneapolis, Minnesota, 2007. https://ieeexplore.ieee.org/document/4270403
- H Li, J Choi, Y Kwon, JH Ahn, Oct 2023, A Hardware-Friendly Tiled Singular-Value Decomposition-Based Matrix Multiplication for Transformer-Based Models, IEEE Computer Architecture Letters, https://ieeexplore.ieee.org/abstract/document/10285300/ (Tiled version of matrix multiplication with SVD factorization.)
- Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
- Chakshu Moar, 2024, Compressing Language Models using Low-Rank Decomposition and Characterizing the Accuracy- Efficiency Trade-offs, Master of Science Thesis, Electrical and Computer Engineering, University of California, Irvine, USA, https://escholarship.org/content/qt0t6967h4/qt0t6967h4.pdf
- Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen, 21 May 2024, Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression, https://arxiv.org/abs/2405.12591
- Chakshu Moar, Michael Pellauer, Hyoukjun Kwon, 10 May 2024, Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models, https://arxiv.org/abs/2405.06626
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Dolgov, R. Brasher, M. Perelshtein, 29 Jan 2024, TQCompressor: improving tensor decomposition methods in neural networks via permutations, V. Abronin, A. Naumov, D. Mazur, D. Bystrov, K. Tsarova, Ar. Melnikov, I. Oseledets, S. https://arxiv.org/abs/2401.16367 Code: https://huggingface.co/tq-ag/TQCompressedGPT2 Code: https://github.com/terra-quantum-public/TQCompressedGPT2 (A permutation-based enhancement to the Kronecker decomposition method.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
- Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim, Nov 2023, LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning, https://arxiv.org/abs/2311.12023
- Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, Guangyu Sun, Dec 2023, ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models, https://arxiv.org/abs/2312.05821 Code: https://github.com/hahnyuan/ASVD4LLM
- R Gribonval, T Mary, E Riccietti, 2023, Optimal quantization of rank-one matrices in floating-point arithmetic---with applications to butterfly factorizations https://inria.hal.science/hal-04125381/file/rank1_quant.pdf
- H Fan, T Chau, SI Venieris, R Lee, 2022, Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and Algorithm Co-design https://ieeexplore.ieee.org/abstract/document/9923888/ https://arxiv.org/pdf/2209.09570
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019, https://arxiv.org/abs/1909.11942 Code: https://github.com/google-research/ALBERT
- Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
- Jungi Lee, Wonbeom Lee, Jaewoong Sim, 16 Jun 2024, Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization, https://arxiv.org/abs/2406.12930 (Combining tensor decomposition and quantization with power-of-two scale factors.)
- Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
- Mirko Farina, Usman Ahmad, Ahmad Taha, Hussein Younes, Yusuf Mesbah, Xiao Yu, Witold Pedrycz, 2024, Sparsity in transformers: A systematic literature review, Neurocomputing, Volume 582, 14 May 2024, 127468, https://www.sciencedirect.com/science/article/abs/pii/S092523122400239X (General survey of sparsity methods, and techniques that create sparsity.)
- Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Zeyu Zhang, Haiying Shen, 7 Aug 2024, Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference, https://arxiv.org/abs/2408.04107
- Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin, 8 May 2024, Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers, https://arxiv.org/abs/2405.05219 (Attention optimization using multiple low-rank matrices.)
- Sasindu Wijeratne, Rajgopal Kannan, Viktor Prasanna, 14 May 2024, Sparse MTTKRP Acceleration for Tensor Decomposition on GPU, https://arxiv.org/abs/2405.08470
- Tugba Torun, Eren Yenigul, Ameer Taweel, Didem Unat, 8 May 2024, A Sparse Tensor Generator with Efficient Feature Extraction, https://arxiv.org/abs/2405.04944 https://github.com/sparcityeu/feaTen https://github.com/sparcityeu/genTen
- Geonhwa Jeong, Po-An Tsai, Abhimanyu R. Bambhaniya, Stephen W. Keckler, Tushar Krishna, 31 Mar 2024 (v2), Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition, https://arxiv.org/abs/2403.07953
- Jan Laukemann, Ahmed E. Helal, S. Isaac Geronimo Anderson, Fabio Checconi, Yongseok Soh, Jesmin Jahan Tithi, Teresa Ranadive, Brian J Gravelle, Fabrizio Petrini, Jee Choi, 11 Mar 2024, Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation, https://arxiv.org/abs/2403.06348
- Nishant Yadav, May 2024, Efficient K-Nearet Neighbor Search With Black-Box Neural Similarity Functions, Ph.D. Dissertation, Manning College of Information and Computer Sciences, University of Massachusetts Amherst, https://scholarworks.umass.edu/bitstreams/5572c0b9-cd96-46d8-983d-f877c9d0e22e/download
- Hongyaoxing Gu, 27 May 2024, LRAMM -- Low precision approximates GEMM via RSVD, https://arxiv.org/abs/2405.16917
- Ignacio Hounie, Charilaos Kanatsoulis, Arnuv Tandon, Alejandro Ribeiro, 5 Oct 2024, LoRTA: Low Rank Tensor Adaptation of Large Language Models, https://arxiv.org/abs/2410.04060
- Haoran Guan, Yuwei Fan, 9 Oct 2024, CholeskyQR for sparse matrices, https://arxiv.org/abs/2410.06525
- D.Breen, Oct 2024, Towards Sustainable CNNs: Tensor Decompositions for Green AI Solutions: Exploring Energy Consumption of Large CNNs, Master's Thesis, Systems and Control & Robotics, Delft University of Technology, https://repository.tudelft.nl/file/File_8208301f-51ef-4edf-bd12-d6ec3d5a8711
- Yubin Qin, Yang Wang, Zhiren Zhao, Xiaolong Yang, Yang Zhou, Shaojun Wei, Yang Hu, Shouyi Yin, 2024, MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition, 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), Year: 2024, Pages: 1032-1047, DOI Bookmark: 10.1109/ISCA59077.2024.00079, https://www.computer.org/csdl/proceedings-article/isca/2024/265800b032/1Z3pCEBnapO
- Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
- Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12 https://link.springer.com/chapter/10.1007/978-3-031-71253-1_12 (Speed up matrix operations with SVD and NMF via adaptive block sizing based on batching.)
- Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
- Ishna Satyarth, Chao Yin, RuQing G. Xu, Devin A. Matthews, 15 Nov 2024, Skew-Symmetric Matrix Decompositions on Shared-Memory Architectures, https://arxiv.org/abs/2411.09859
Low-Rank Matrix Factorization
Matrix factorization (decomposition) can be used to find low-rank matrices that approximate the larger sets of weights.
- Genta Indra Winata, Andrea Madotto, Jamin Shin, Elham J Barezi, and Pascale Fung. 2019. On the Effectiveness of Low-Rank Matrix Factorization for LSTM Model Compression. arXiv preprint arXiv:1908.09982. https://arxiv.org/abs/1908.09982
- Ashish Khetan and Zohar Karnin. “schubert: Optimizing elements of bert”. arXiv preprint arXiv:2005.06628 (2020) https://arxiv.org/abs/2005.06628
- Jian Xue, Jinyu Li, and Yifan Gong. 2013. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, pages 2365–2369. https://www.academia.edu/72568360/Restructuring_of_deep_neural_network_acoustic_models_with_singular_value_decomposition
- Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. 2018. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, pages 10988–10998. https://arxiv.org/abs/1806.06950
- Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. 2017. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7370–7379. PDF: https://openaccess.thecvf.com/content_cvpr_2017/papers/Yu_On_Compressing_Deep_CVPR_2017_paper.pdf
- Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth. “Pruning Redundant Mappings in Transformer Models via SpectralNormalized Identity Prior”. arXiv preprint arXiv:2010.01791 (2020) https://arxiv.org/abs/2010.01791
- Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. https://arxiv.org/abs/2303.10512, Code: https://github.com/QingruZhang/AdaLoRA
- Arnav Chavan, Zhuang Liu, Deepak K. Gupta, Eric P. Xing, and Zhiqiang Shen. One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning. CoRR, abs/2306.07967, 2023. https://arxiv.org/abs/2306.07967
- Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation, In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 3266–3279. Association for Computational Linguistics, 2023. https://arxiv.org/abs/2210.07558
- Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. https://arxiv.org/abs/2106.09685
- Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, and Sanjeev Khudanpur. Semi-orthogonal low-rank matrix factorization for deep neural networks. In B. Yegnanarayana, editor, Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, pages 3743– 3747. ISCA, 2018. PDF: https://www.isca-speech.org/archive/pdfs/interspeech_2018/povey18_interspeech.pdf
- Yerlan Idelbayev; Miguel Á. Carreira-Perpiñán, Low-Rank Compression of Neural Nets: Learning the Rank of Each Layer. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8046–8056. Computer Vision Foundation / IEEE, 2020. https://ieeexplore.ieee.org/document/9157223/
- Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches including low-rank matrices.)
Singular Value Decomposition (SVD)
SVD is a specific type of matrix decomposition. Research papers on SVD:
- Zeyu Zhang, Haiying Shen, 7 Aug 2024, Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference, https://arxiv.org/abs/2408.04107
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
- Hongyaoxing Gu, 27 May 2024, LRAMM -- Low precision approximates GEMM via RSVD, https://arxiv.org/abs/2405.16917
- Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao 12 Aug 2024 (v3), A Survey on LoRA of Large Language Models, https://arxiv.org/abs/2407.11046 https://github.com/ZJU-LLMs/Awesome-LoRAs.git
- Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12 https://link.springer.com/chapter/10.1007/978-3-031-71253-1_12 (Speed up matrix operations with SVD and NMF via adaptive block sizing based on batching.)
- Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
- Shengwen Ding, Chenhui Hu, 24 Nov 2024, eFedLLM: Efficient LLM Inference Based on Federated Learning, https://arxiv.org/abs/2411.16003
Tucker Decomposition
Research papers on Tucker decomposition:
- Chakshu Moar, 2024, Compressing Language Models using Low-Rank Decomposition and Characterizing the Accuracy- Efficiency Trade-offs, Master of Science Thesis, Electrical and Computer Engineering, University of California, Irvine, USA, https://escholarship.org/content/qt0t6967h4/qt0t6967h4.pdf
- Chakshu Moar, Michael Pellauer, Hyoukjun Kwon, 10 May 2024, Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models, https://arxiv.org/abs/2405.06626
- Tucker, Ledyard R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3): 279–311, 1966 https://link.springer.com/article/10.1007/BF02289464 (Tucker decomposition.)
- De Lathauwer, Lieven, De Moor, Bart, and Vandewalle, Joos. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000. https://epubs.siam.org/doi/10.1137/S0895479896305696 (Tucker decomposition.)
- Kim, Y.-D. and Choi, S. Nonnegative Tucker decomposition. In Proceedings of the IEEE CVPR 2007 Workshop on Component Analysis Methods, Minneapolis, Minnesota, 2007. https://ieeexplore.ieee.org/document/4270403
- Federica Stolf, Antonio Canale, 15 Nov 2024, Bayesian Adaptive Tucker Decompositions for Tensor Factorization, https://arxiv.org/abs/2411.10218
- Matthew Pietrosanu, Bei Jiang, Linglong Kong, 13 Jun 2024, Oblivious subspace embeddings for compressed Tucker decompositions, https://arxiv.org/abs/2406.09387
- Tobias Weber, Jakob Dexl, David Rügamer, Michael Ingrisch, 18 Apr 2024 (v2), Post-Training Network Compression for 3D Medical Image Segmentation: Reducing Computational Efforts via Tucker Decomposition, https://arxiv.org/abs/2404.09683
Vector Dot Product Optimization
The computation of a vector dot product, also called "scalar product", is the basis of all AI operations. Matrix multiplications are everywhere, and a matrix multiply operation is just a series of dot product operations. Each element in the result of a matrix multiplication is computed via a vector dot product between two vectors: a row in one matrix, and a column in the other matrix.
Given the importance of vector dot products to the speed of AI, various attempts have been made to speed it up. Options for speedup include hardware accelerated dot product, faster vector dot product algorithms and the use of approximations.
Research on dot product optimization includes:
- N Yamanaka, T Ogita, SM Rump, S Oishi, 2008, A parallel algorithm for accurate dot product, Parallel Computing, https://www.sciencedirect.com/science/article/pii/S016781910800032X, PDF: https://ogilab.w.waseda.jp/ogita/math/doc/2008_YaOgRuOi.pdf
- W Kamp, A Bainbridge-Smith, 2007, Multiply accumulate unit optimised for fast dot-product evaluation, 2007 International Conference on Field-Programmable Technology, https://ieeexplore.ieee.org/abstract/document/4439283/
- Chitta Ranjan May 9, 2019, Understanding the Kernel Trick with fundamentals, Towards Data Science https://towardsdatascience.com/truly-understanding-the-kernel-trick-1aeb11560769
- J Diffenderfer, D Osei-Kuffuor, H Menon, March 2021, A framework for error-bounded approximate computing, with an application to dot products, SIAM Journal on Scientific Computing, https://www.osti.gov/servlets/purl/1959416
- J Diffenderfer, D Osei-Kuffuor, H Menon, 2021, QDOT: Quantized dot product kernel for approximate high-performance computing, arXiv preprint arXiv:2105.00115, https://arxiv.org/abs/2105.00115
- Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Serge Torres, 2018, Enhanced Floating-Point Sums, Dot Products, and Polynomial Values, In: Handbook of Floating-Point Arithmetic, pp. 163–192, https://link.springer.com/chapter/10.1007/978-3-319-76526-6_5
- NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2, PDF: https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf
- Lucas Klemmer; Saman Froehlich; Rolf Drechsler; Daniel Große, 2021, XbNN: Enabling CNNs on edge devices by approximate on-chip dot product encoding, 2021 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/document/9401780, PDF: https://agra.informatik.uni-bremen.de/doc/konf/2021_ISCAS_XBNN.pdf
- Y. Nievergelt, Scalar fused multiply-add instructions produce floating-point matrix arithmetic provably accurate to the penultimate digit, ACM Trans. Math. Softw., 29 (2003), pp. 27–48, https://dl.acm.org/doi/10.1145/641876.641878
- S Graillat, V Ménissier-Morain, 2012, Accurate summation, dot product and polynomial evaluation in complex floating point arithmetic, Information and Computation, Volume 216, July 2012, Pages 57-71, https://www.sciencedirect.com/science/article/pii/S0890540112000715
- AM Zaki, MH El-Shafey, AMB Eldin, 2010, A new architecture for accurate dot product of floating point numbers, The 2010 International Conference on Computer Engineering & Systems, https://ieeexplore.ieee.org/abstract/document/5674841/
- K He, R Barrio, L Chen, H Jiang, J Liu, T Gu, 2021, A Class of Fast and Accurate Multi-layer Block Summation and Dot Product Algorithms, IFIP International Conference on Network and Parallel Computing, NPC 2021: Network and Parallel Computing, pp. 64–75, https://link.springer.com/chapter/10.1007/978-3-030-93571-9_6
- S. Graillat, P. Langlois, N. Louvet, 15 September 2006, Choosing a twice more accurate dot product implementation, https://www.researchgate.net/publication/250769076_Choosing_a_Twice_More_Accurate_Dot_Product_Implementation, PDF: https://www-pequan.lip6.fr/~graillat/papers/icnaam06.pdf
- A Knofel, 1991, Fast hardware units for the computation of accurate dot products, Proceedings 10th IEEE Symposium on Computer Arithmetic, https://ieeexplore.ieee.org/document/145536, PDF: https://scholar.archive.org/work/cp6cgjq7g5enzfqoqtte2bb6k4/access/wayback/http://www.acsel-lab.com/arithmetic/papers/ARITH10/ARITH10_Knofel.pdf
More AI Research
Read more about:
- Low-rank matrices
- Sparse matrices
- Advanced AI Mathematics
- Zero-Multiplication Models
- Logarithmic Models
- Approximate Computing
- Inference Optimizations
- « Research Home