Aussie AI

AI Matrix Algebra

  • Last Updated 29 November, 2024
  • by David Spuler, Ph.D.

Neural networks run on matrices and vectors. The main primitive is matrix multiplication, with lots of multiplication and addition, and matrix multiplication is actually a lot of vector dot product computations. Modern AI frameworks and research papers tend to use abbreviations to talk about matrix multiplication (i.e. "MATMUL" operations), and GEMM (General Matrix Multiplication).

Various attempts to optimize models involve manipulating matrix algebra. Some of the research areas include:

The individual multiplication operations between numbers inside a matrix can also be optimized:

  • Alternatives to using arithmetic multiplication (e.g. bitshifting, logarithms, other fancy stuff)
  • Faster algorithms for scalar arithmetic multiplication (mostly hardware acceleration algorithms for chip designers)
  • Approximate scalar arithmetic multiplication (faster ways to multiply two numbers by allowing errors)

Matrix Algebra

Matrix algebra underpins most of AI model inference (and training). The term "tensor" in AI theory mainly refers to multi-dimensional matrices, and the multiplication algorithm is the standard one you learned in high school.

There are areas of active research to use matrix algebra in different ways to reduce the total number of multiplication operations. Some of the types of matrices include:

The theory of matrix algebra can be applied to neural networks, since matrix multiplication is at the core of inference and training. Inference of a model involves executing matrix multiplication on a vector of probabilities. And matrix multiplication involves multiplying a row of that matrix over a vector, which is actually computing the vector dot product of two vectors. Undearneath all of those vector iterations are the low-level multiplication operations on pairs of numbers, usually floating point but also possibly integers (in quantized models that use integer-only arithmetic).

General Research on Matrix Algebra Theory

Some of the papers on theory of matrix algebra include:

Advanced Types of Matrices

The simplest ideas are that smaller matrices (lower-rank) and sparse matrices can require less computation. Various advanced subtypes of matrices with specific constraints have properties that allow a reduced number of multiplications in evaluating matrix multiplication or matrix-vector multiplication.

Lower-Rank Submatrices: There are many theoriest that typical models are "over-parameterized", which means their matrices are bigger than they need to be. The number of weights is quadratic in the dimensions of the matrix, so reducing to smaller, lower-rank matrices improves performance. There is much theory about finding the "smaller models" that are inside the "big models", but have accuracy not much worse than the larger matrices. This is sometimes called "parameter factorization" of models. The non-mathematical technique of "distillation" is also related, as it finds a smaller model with similar accuracy. See Low-rank matrix factorization research.

Sparse Matrices: The idea with sparse matrices is that if they have a lot of zero weights, then the inference algorithm is doing a lot of unnecessary multiplications by zero. Various algorithms aim to focus on only doing multiplication in the submatrices that have data. Sparsity is closely related to model pruning, and these matrix sparsity techniques can also be amplified using dynamic pruning of near-zero weights, to further increase the total number of zeros in the matrix. See Sparse matrices and sparsification.

Butterfly Matrices: These special matrices are one approach to use of matrix algebra to achieve sparsity. Research includes:

Monarch Matrices: Monarch matrices are a superset of bufferfly matrices, named after the orange-winged monarch butterfly. These special types of matrices aim to exploit operations on submatrices to reduce the overall computational overhead of matrix multiplication.

Matrix/Tensor Factorization (Decomposition)

Methods to factorize or decompose larger matrices into smaller matrices (see low-rank matrices for optimization), or specific subtypes of matrices as above, require special algorithms. Some of the main theoretical decomposition algorithms are CANDECOMP/PARAFAC decomposition (CP decomposition) and Tucker decomposition.

Low-Rank Matrix Factorization

Matrix factorization (decomposition) can be used to find low-rank matrices that approximate the larger sets of weights.

  • Genta Indra Winata, Andrea Madotto, Jamin Shin, Elham J Barezi, and Pascale Fung. 2019. On the Effectiveness of Low-Rank Matrix Factorization for LSTM Model Compression. arXiv preprint arXiv:1908.09982. https://arxiv.org/abs/1908.09982
  • Ashish Khetan and Zohar Karnin. “schubert: Optimizing elements of bert”. arXiv preprint arXiv:2005.06628 (2020) https://arxiv.org/abs/2005.06628
  • Jian Xue, Jinyu Li, and Yifan Gong. 2013. Restructuring of deep neural network acoustic models with singular value decomposition. In Interspeech, pages 2365–2369. https://www.academia.edu/72568360/Restructuring_of_deep_neural_network_acoustic_models_with_singular_value_decomposition
  • Patrick Chen, Si Si, Yang Li, Ciprian Chelba, and Cho-Jui Hsieh. 2018. GroupReduce: Block-wise low-rank approximation for neural language model shrinking. In Advances in Neural Information Processing Systems, pages 10988–10998. https://arxiv.org/abs/1806.06950
  • Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. 2017. On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7370–7379. PDF: https://openaccess.thecvf.com/content_cvpr_2017/papers/Yu_On_Compressing_Deep_CVPR_2017_paper.pdf
  • Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth. “Pruning Redundant Mappings in Transformer Models via SpectralNormalized Identity Prior”. arXiv preprint arXiv:2010.01791 (2020) https://arxiv.org/abs/2010.01791
  • Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. https://arxiv.org/abs/2303.10512, Code: https://github.com/QingruZhang/AdaLoRA
  • Arnav Chavan, Zhuang Liu, Deepak K. Gupta, Eric P. Xing, and Zhiqiang Shen. One-for-All: Generalized LoRA for Parameter-Efficient Fine-tuning. CoRR, abs/2306.07967, 2023. https://arxiv.org/abs/2306.07967
  • Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation, In Andreas Vlachos and Isabelle Augenstein, editors, Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 3266–3279. Association for Computational Linguistics, 2023. https://arxiv.org/abs/2210.07558
  • Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. https://arxiv.org/abs/2106.09685
  • Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li, Hainan Xu, Mahsa Yarmohammadi, and Sanjeev Khudanpur. Semi-orthogonal low-rank matrix factorization for deep neural networks. In B. Yegnanarayana, editor, Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, pages 3743– 3747. ISCA, 2018. PDF: https://www.isca-speech.org/archive/pdfs/interspeech_2018/povey18_interspeech.pdf
  • Yerlan Idelbayev; Miguel Á. Carreira-Perpiñán, Low-Rank Compression of Neural Nets: Learning the Rank of Each Layer. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8046–8056. Computer Vision Foundation / IEEE, 2020. https://ieeexplore.ieee.org/document/9157223/
  • Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023 https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches including low-rank matrices.)

Singular Value Decomposition (SVD)

SVD is a specific type of matrix decomposition. Research papers on SVD:

Tucker Decomposition

Research papers on Tucker decomposition:

  • Chakshu Moar, 2024, Compressing Language Models using Low-Rank Decomposition and Characterizing the Accuracy- Efficiency Trade-offs, Master of Science Thesis, Electrical and Computer Engineering, University of California, Irvine, USA, https://escholarship.org/content/qt0t6967h4/qt0t6967h4.pdf
  • Chakshu Moar, Michael Pellauer, Hyoukjun Kwon, 10 May 2024, Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models, https://arxiv.org/abs/2405.06626
  • Tucker, Ledyard R. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3): 279–311, 1966 https://link.springer.com/article/10.1007/BF02289464 (Tucker decomposition.)
  • De Lathauwer, Lieven, De Moor, Bart, and Vandewalle, Joos. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4):1253–1278, 2000. https://epubs.siam.org/doi/10.1137/S0895479896305696 (Tucker decomposition.)
  • Kim, Y.-D. and Choi, S. Nonnegative Tucker decomposition. In Proceedings of the IEEE CVPR 2007 Workshop on Component Analysis Methods, Minneapolis, Minnesota, 2007. https://ieeexplore.ieee.org/document/4270403
  • Federica Stolf, Antonio Canale, 15 Nov 2024, Bayesian Adaptive Tucker Decompositions for Tensor Factorization, https://arxiv.org/abs/2411.10218
  • Matthew Pietrosanu, Bei Jiang, Linglong Kong, 13 Jun 2024, Oblivious subspace embeddings for compressed Tucker decompositions, https://arxiv.org/abs/2406.09387
  • Tobias Weber, Jakob Dexl, David Rügamer, Michael Ingrisch, 18 Apr 2024 (v2), Post-Training Network Compression for 3D Medical Image Segmentation: Reducing Computational Efforts via Tucker Decomposition, https://arxiv.org/abs/2404.09683

Vector Dot Product Optimization

The computation of a vector dot product, also called "scalar product", is the basis of all AI operations. Matrix multiplications are everywhere, and a matrix multiply operation is just a series of dot product operations. Each element in the result of a matrix multiplication is computed via a vector dot product between two vectors: a row in one matrix, and a column in the other matrix.

Given the importance of vector dot products to the speed of AI, various attempts have been made to speed it up. Options for speedup include hardware accelerated dot product, faster vector dot product algorithms and the use of approximations.

Research on dot product optimization includes:

More AI Research

Read more about: