Aussie AI

Low-Rank Matrices

Last Updated 24 June, 2025

by David Spuler, Ph.D.

Low-rank matrices are matrices with smaller dimensions (i.e. fewer rows or columns). One form of model compression is to use matrix techniques to replace the large weight matrices with smaller "low-rank" matrices. This makes the model faster, but sometimes trades off decreased accuracy.

There are various approaches to find smaller matrices to replace a full-sized matrix. One approach is simply to look for matrices that are similar to the large model, but smaller. Another approach is to use "sparsification" to add a lot of zeros to the matrices, such that a smaller model can more easily replace it. Yet another approach is to use matrix algebra to "factorize" (also called "decompose") the large matrix into one or more smaller matrices (see also AI matrix algebra).

One common low-rank matrix technique has become popular, possibly because it's been given a friendly name: LoRA is "Low-Rank Adaptation" of matrices. If the model has been quantized first, then it is QLoRA, for "Quantized LoRA".

Singular Value Decomposition (SVD)

SVD is one of the methods of factorizing matrices into smaller sub-matrices. Research on SVD includes:

Zeyu Zhang, Haiying Shen, 7 Aug 2024, Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference, https://arxiv.org/abs/2408.04107
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
Hongyaoxing Gu, 27 May 2024, LRAMM -- Low precision approximates GEMM via RSVD, https://arxiv.org/abs/2405.16917
Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao 12 Aug 2024 (v3), A Survey on LoRA of Large Language Models, https://arxiv.org/abs/2407.11046 https://github.com/ZJU-LLMs/Awesome-LoRAs.git
Shi, J., Shi, C. (2025). Improve LLM Inference Performance with Matrix Decomposition Strategies. In: Shi, Z., Witbrock, M., Tian, Q. (eds) Intelligence Science V. ICIS 2024. IFIP Advances in Information and Communication Technology, vol 720. Springer, Cham. https://doi.org/10.1007/978-3-031-71253-1_12 https://link.springer.com/chapter/10.1007/978-3-031-71253-1_12 (Speed up matrix operations with SVD and NMF via adaptive block sizing based on batching.)
Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
Shengwen Ding, Chenhui Hu, 24 Nov 2024, eFedLLM: Efficient LLM Inference Based on Federated Learning, https://arxiv.org/abs/2411.16003
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Hong Yankun, Li Xing, Zhen Hui-Ling, Yu Xianzhi, Liu Wulong, Yuan Mingxuan, 21 Feb 2025, SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention, https://arxiv.org/abs/2502.15304
Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, Mi Zhang, 16 Mar 2025, SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression, https://arxiv.org/abs/2503.12340 https://github.com/AIoT-MLSys-Lab/SVD-LLM
Jiujun He, Huazhen Lin, 10 Jun 2025, Olica: Efficient Structured Pruning of Large Language Models without Retraining, https://arxiv.org/abs/2506.08436

Research on Low-Rank Matrices

Li, Y.; Yu, Y.; Zhang, Q.; Liang, C.; He, P.; Chen, W.; and Zhao, T. 2023. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 20336–20350. PMLR. https://arxiv.org/abs/2306.11222
Ma, X.; Fang, G.; and Wang, X. 2023. LLM-Pruner: On the Structural Pruning of Large Language Models. arXiv:2305.11627. https://arxiv.org/abs/2305.11627 Code: https://github.com/horseee/LLM-Pruner (Pruning during training and LoRA.)
M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. BMVC, 2014, https://arxiv.org/abs/1405.3866, PDF: https://www.robots.ox.ac.uk/~vgg/publications/2014/Jaderberg14b/jaderberg14b.pdf
Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang and D. Shin, "Compression of deep convolutional neural networks for fast and low power mobile applications", arXiv:1511.06530, 2015. https://arxiv.org/abs/1511.06530 (Low-rank via Bayesian matrix factorization and Tucker decomposition.)
V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets and V. Lempitsky, "Speeding-up convolutional neural networks using fine-tuned CP-decomposition", arXiv:1412.6553, 2014. https://arxiv.org/abs/1412.6553
Ali Edalati, Marzieh Tahaei, Ivan Kobyzev, Vahid Partovi Nia, James J. Clark, Mehdi Rezagholizadeh, Dec 2022, KronA: Parameter Efficient Tuning with Kronecker Adapter, arXiv preprint arXiv:2212.10650, https://arxiv.org/abs/2212.10650 (Kronecker product for matrix decomposition.)
Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. 2022. DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation, arXiv preprint arXiv:2210.07558. https://arxiv.org/abs/2210.07558
Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. 2021. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035. https://arxiv.org/abs/2106.04647
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, Sep 2023, LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 (Low-rank matrix attention allows up to 100k context windows.)
R Saha, V Srivastava, M Pilanci, 2023, Matrix Compression via Randomized Low Rank and Low Precision Factorization, 37th Conference on Neural Information Processing Systems (NeurIPS 2023), https://web.stanford.edu/~pilanci/papers/lplr.pdf
F Babiloni, T Tanay, J Deng, M Maggioni, S Zafeiriou, 2023, Factorized Dynamic Fully-Connected Layers for Neural Networks, ICCV workshop, https://openaccess.thecvf.com/content/ICCV2023W/RCV/papers/Babiloni_Factorized_Dynamic_Fully-Connected_Layers_for_Neural_Networks_ICCVW_2023_paper.pdf (Tensor decomposition into low-rank factors.)
Samuel Carreira, Tomás Marques, José Ribeiro, Carlos Grilo, Sep 2023, Revolutionizing Mobile Interaction: Enabling a 3 Billion Parameter GPT LLM on Mobile, arXiv preprint arXiv:2310.01434, https://browse.arxiv.org/abs/2310.01434 (LoRA on a mobile platform.)
Tamara G Kolda and Brett W Bader, 2009, Tensor Decompositions and Applications, SIAM Rev. 51, 3 (2009), 455–500, https://epubs.siam.org/doi/abs/10.1137/07070111X (Analysis of various algorithms for tensor decomposition.)
Stephan Rabanser, Oleksandr Shchur, Stephan Günnemann, Nov 2017, Introduction to Tensor Decompositions and their Applications in Machine Learning, https://browse.arxiv.org/pdf/1711.10781.pdf
Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. 2016. Compression of deep convolutional neural networks for fast and low power mobile applications. In Proceedings of the International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1511.06530 (Uses Tucker decomposition and Bayesian matrix factorization algorithms.)
Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao, Oct 2023, LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models, https://arxiv.org/abs/2310.08659 (QLoRA for LLMs.)
Chakshu Moar, Michael Pellauer, Hyoukjun Kwon, 10 May 2024, Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models, https://arxiv.org/abs/2405.06626
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Davis, Andrew and Arel, Itamar. 2013. Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461, https://arxiv.org/abs/1312.4461
Y Hu, J Zhang, C Zhao, C Li, H Chen, 2023, Transformer Compression via Subspace Projection, arXiv preprint arXiv:2308.16475, https://arxiv.org/abs/2308.16475
Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and acceleration for deep neural networks. CoRR, abs/1710.09282, 2017. https://arxiv.org/abs/1710.09282
Shikai Qiu, Andres Potapczynski, Marc Finzi, Micah Goldblum, Andrew Gordon Wilson, 10 Jun 2024, Compute Better Spent: Replacing Dense Layers with Structured Matrices, https://arxiv.org/abs/2406.06248
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Arnav Chavan, Nahush Lele, Deepak Gupta, Dec 2023, Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models https://arxiv.org/abs/2312.07046 Code: https://github.com/transmuteAI/trailmet/tree/main/trailmet/algorithms/llm-rom
S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” CoRR, vol. abs/2006.04768, 2020. https://arxiv.org/abs/2006.04768 (Low-rank approximation of attention.)
Idelbayev, Y. and Carreira-Perpinan, M. A. (2020). Low-rank compression of neural nets: Learning the rank of each layer. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8046–8056. URL: https://openaccess.thecvf.com/content_CVPR_2020/html/Idelbayev_Low_Rank_Compression_of_Neural_Nets_Learning_the_Rank_of_Each_CVPR_2020_ paper.html
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861. URL: http://arxiv.org/abs/1704.04861.
Zhang, J., Lei, Q., and Dhillon, I. (2018). Stabilizing gradients for deep neural networks via efficient SVD parameterization. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5806–5814. PMLR. URL: http://proceedings.mlr.press/v80/zhang18g.html
K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
Zheng Qu, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, Yuan Xie, 2022, DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration, ASPLOS ’22, February 28 ś March 4, 2022, Lausanne, Switzerland, PDF: https://dl.acm.org/doi/pdf/10.1145/3503222.3507738
Ivan Markovsky, Aug 3, 2018, Low-Rank Approximation: Algorithms, Implementation, Applications (Communications and Control Engineering) Part of: Communications and Control Engineering (62 books), https://www.amazon.com/Low-Rank-Approximation-Implementation-Applications-Communications/dp/3319896199/
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Yunsheng Li, Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Lu Yuan, Zicheng Liu, Lei Zhang, Nuno Vasconcelos, MicroNet: Improving Image Recognition with Extremely Low FLOPs, 2021, https://ieeexplore.ieee.org/abstract/document/9857393 PDF: https://openaccess.thecvf.com/content/ICCV2021/papers/Li_MicroNet_Improving_Image_Recognition_With_Extremely_Low_FLOPs_ICCV_2021_paper.pdf
Yubin Qin, Yang Wang, Zhiren Zhao, Xiaolong Yang, Yang Zhou, Shaojun Wei, Yang Hu, Shouyi Yin, 2024, MECLA: Memory-Compute-Efficient LLM Accelerator with Scaling Sub-matrix Partition, 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), Year: 2024, Pages: 1032-1047, DOI Bookmark: 10.1109/ISCA59077.2024.00079, https://www.computer.org/csdl/proceedings-article/isca/2024/265800b032/1Z3pCEBnapO
Jiuxiang Gu, Yingyu Liang, Heshan Liu, Zhenmei Shi, Zhao Song, Junze Yin, 8 May 2024, Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers, https://arxiv.org/abs/2405.05219 (Attention optimization using multiple low-rank matrices.)
Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy, 10 Aug 2024, Eigen Attention: Attention in Low-Rank Space for KV Cache Compression, https://arxiv.org/abs/2408.05646
Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou, 23 Aug 2024, Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time, https://arxiv.org/abs/2408.13233 (Training using low-rank matrices to approximate attention.)
Josh Alman, Zhao Song, 9 May 2023 (v2), Fast Attention Requires Bounded Entries, https://arxiv.org/abs/2302.13214 (Low-rank matrices in attention for fast inference.)
Josh Alman, Zhao Song, 6 Oct 2023, How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation, https://arxiv.org/abs/2310.04064
Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu, 30 Jul 2024, Palu: Compressing KV-Cache with Low-Rank Projection, https://arxiv.org/abs/2407.21118 https://github.com/shadowpa0327/Palu
Sneha Mehta, Huzefa Rangwala, Naren Ramakrishnan, 10 Aug 2020 (v2), Low Rank Factorization for Compact Multi-Head Self-Attention, https://arxiv.org/abs/1912.00835
Ignacio Hounie, Charilaos Kanatsoulis, Arnuv Tandon, Alejandro Ribeiro, 5 Oct 2024, LoRTA: Low Rank Tensor Adaptation of Large Language Models, https://arxiv.org/abs/2410.04060
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
Zebin Yang, Renze Chen, Taiqiang Wu, Ngai Wong, Yun Liang, Runsheng Wang, Ru Huang, Meng Li, 23 Oct 2024, MCUBERT: Memory-Efficient BERT Inference on Commodity Microcontrollers https://arxiv.org/abs/2410.17957
Elias Jääsaari, Ville Hyvönen, Teemu Roos, 24 Oct 2024, LoRANN: Low-Rank Matrix Factorization for Approximate Nearest Neighbor Search, https://arxiv.org/abs/2410.18926
Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu, 1 Nov 2024, V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM, https://arxiv.org/abs/2411.00915
Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Meyer Scetbon, James Hensman, 10 Dec 2024, Low-Rank Correction for Quantized LLMs, https://arxiv.org/abs/2412.07902
Kwangryeol Park, Seulki Lee, 12 Dec 2024, SMMF: Square-Matricized Momentum Factorization for Memory-Efficient Optimization, https://arxiv.org/abs/2412.08894 (Gradient optimizer Adam optimized using low-rank matrix factorization.)
Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen, 27 Dec 2024, A Survey on Large Language Model Acceleration based on KV Cache Management, https://arxiv.org/abs/2412.19442 (Huge survey of all KV cache optimization methods.)
Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou, Xiangyu Zhang, Heung-Yeung Shum, 26 Dec 2024, Multi-matrix Factorization Attention, https://arxiv.org/abs/2412.19255
Menglin Yang, Jialin Chen, Yifei Zhang, Jiahong Liu, Jiasheng Zhang, Qiyao Ma, Harshit Verma, Qianru Zhang, Min Zhou, Irwin King, Rex Ying, 31 Dec 2024, Low-Rank Adaptation for Foundation Models: A Comprehensive Review, https://arxiv.org/abs/2501.00365 (Extensive survey of LoRA.)
Q Wang, S Shen, Jan 2025, Activation-Guided Low-Rank Parameter Adaptation for Efficient Model Fine-Tuning, IEEE Access, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10852296 (Modified LoRA algorithm using activations for weighting.)
Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, Mi Zhang, 16 Mar 2025, SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression, https://arxiv.org/abs/2503.12340 https://github.com/AIoT-MLSys-Lab/SVD-LLM
Yiping Ji, Hemanth Saratchandran, Cameron Gordon, Zeyu Zhang, Simon Lucey, 17 Mar 2025 (v5), Efficient Learning With Sine-Activated Low-rank Matrices, ICLR 2025, AIML, https://arxiv.org/abs/2403.19243

Aussie AI

Low-Rank Matrices

Singular Value Decomposition (SVD)

Research on Low-Rank Matrices

More AI Research

Quick Links

Product

New to Writing?

Writing Styles