Aussie AI
Model Compression
-
Last Updated 3 December, 2024
-
by David Spuler, Ph.D.
Model compression is the general class of AI optimizations that reduce the size of the model. The goal is two-fold: (a) size reduction: have a smaller model that uses less memory storage, and (b) latency optimization: run faster inference on the more compact model.
Model compression techniques have been highly successful and are widely used, second only to hardware-acceleration in their impact on the AI industry. The main model compression techniques are:
There are various lesser-known types of model compression methods:
- Low-rank factorization of matrices (tensor decomposition)
- Weight sharing
- Layer fusion
- Weight clustering
- Big-little architectures
- Logarithmic models
- Zero-multiplication models (e.g. adder networks)
Survey Papers on Model Compression
General surveys that cover model compression include:
- Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang, A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023, https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches.)
- Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
- T Choudhary, V Mishra, A Goswami, 2020, A comprehensive survey on model compression and acceleration, Artifcial Intelligence Review, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
- Y Cheng, D Wang, P Zhou, T Zhang, June 2020 (revised), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, https://arxiv.org/abs/1710.09282
- K Nan, S Liu, J Du, H Liu - Tsinghua Science and Technology, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762, PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
- Yu Cheng; Duo Wang; Pan Zhou; Tao Zhang, 2018, Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine (Volume 35, Issue 1, January 2018), https://ieeexplore.ieee.org/document/8253600
- G Menghani, 2023, Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
- L Deng, G Li, S Han, L Shi, Y Xie, 2020, Model compression and hardware acceleration for neural networks: A comprehensive survey, Proceedings of the IEEE (Volume 108, Issue 4, April 2020), https://ieeexplore.ieee.org/abstract/document/9043731
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
- W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing on edge devices, including model compression.)
- A Jaiswal, Z Gan, X Du, B Zhang, Z Wang, Y Yang, Oct 2023, Compressing LLMs: The Truth is Rarely Pure and Never Simple, arXiv preprint arXiv:2310.01382, https://browse.arxiv.org/pdf/2310.01382.pdf
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- Dong Liu, 3 Sep 2024, Contemporary Model Compression on Large Language Models Inference, https://arxiv.org/abs/2409.01990
- Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
Research on Model Compression (Generally)
Research papers on model compression:
- Canwen Xu, 2024, Efficient Natural Language Processing for Language Models, Ph.D. thesis, Computer Science, UNIVERSITY OF CALIFORNIA SAN DIEGO, PDF: https://escholarship.org/uc/item/9dv1k5xv PDF: https://escholarship.org/content/qt9dv1k5xv/qt9dv1k5xv.pdf?t=sc34ay (Evaluates several acceleration methods including early-exit, PEFT, and distillation.)
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole, 6 Apr 2024, What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models, https://arxiv.org/abs/2404.04759 (General article shows that the big three of model compression work not just on compression big LLMs, but also on making small models even smaller.)
- Seungtae Hong, Gunju Park, Jeong-Si Kim, 9 June 2024, Automated deep-learning model optimization framework for microcontrollers, https://doi.org/10.4218/etrij.2023-0522 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0522 (Framework for using quantization and pruning on microcontroller devices.)
- Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (A method of training one large model, and then extracting many smaller sub-models from that model, using FFNs with a subset of parameters, which if done staticly can then be similar to a form of model compression, and elastic inference done dynamically is a type of adaptive inference.)
- W Li, H Hacid, E Almazrouei, M Debbah, 2023, A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748 (General survey of various model compression and other inference optimizations.)
- Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao, 5 Feb 2024. A Survey on Transformer Compression. https://arxiv.org/abs/2402.05964 (Model compression survey paper with focus on pruning, quantization, knowledge distillation, and efficient architecture design.)
- Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang, 27 Jan 2024, A Comprehensive Survey of Compression Algorithms for Language Models, https://arxiv.org/abs/2401.15347
- C Xu, J McAuley, 2023, A survey on model compression and acceleration for pretrained language models, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/26255/26027
- GC Marinó, A Petrini, D Malchiodi, M Frasca, 2023, Deep neural networks compression: A comparative survey and choice recommendations, https://www.sciencedirect.com/science/article/pii/S0925231222014643
- K Ramesh, A Chavan, S Pandit, 2023, A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
- K Nan, S Liu, J Du, H Liu, 2019, Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762 PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
- David Spuler, March 2024, Chapter 28. Deslugging AI Engines, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Ma W, Zhang Z, Xu Q and Chen W. An Automatic Scheme for Optimizing the Size of Deep Networks. Proceedings of the 2020 3rd International Conference on Signal Processing and Machine Learning. (21-27). https://doi.org/10.1145/3432291.3432293
- Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta, 24 Apr 2024 (v2), Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward, https://arxiv.org/abs/2402.01799 Code: https://github.com/nyunAI/Faster-LLM-Survey
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Xinji Mai, Zeng Tao, Junxiong Lin, Haoran Wang, Yang Chang, Yanlan Kang, Yan Wang, Wenqiang Zhang, 27 Jun 2024, From Efficient Multimodal Models to World Models: A Survey, https://arxiv.org/abs/2407.00118 (A survey of multimodal models with coverage of many optimization techniques.)
- Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Angie Boggust, Venkatesh Sivaraman, Yannick Assogba, Donghao Ren, Dominik Moritz, Fred Hohman, 6 Aug 2024, Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments, https://arxiv.org/abs/2408.03274
- Vladimir Malinovskii, Aug 5, 2024, The Evolution of Extreme LLM Compression: From QuIP to AQLM with PV-Tuning, https://medium.com/yandex/the-evolution-of-extreme-llm-compression-from-quip-to-aqlm-with-pv-tuning-19c44b91af96
- Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik, 30 May 2024 (v2), PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression, https://arxiv.org/abs/2405.14852 https://burlachenkok.github.io/PV-Tuning/
- Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Sophia R. Cunningham,Dominique Archambault,Austin Kung, May 2024, Efficient Training and Inference: Techniques for Large Language Models Using Llama, https://www.techrxiv.org/doi/full/10.36227/techrxiv.171651876.65094225/v1
- Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, 4 Jun 2024 (v2), APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, ICML 2024 Oral, https://arxiv.org/abs/2401.12200 https://github.com/ROIM1998/APT
- Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang. 2024. Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, New York, NY, USA, Article 648, 1–19. https://doi.org/10.1145/3613904.3642628 https://dl.acm.org/doi/full/10.1145/3613904.3642628
- Shaw Talebi, Aug 2024, Compressing Large Language Models (LLMs): Make LLMs 10X smaller without sacrificing performance https://towardsdatascience.com/compressing-large-language-models-llms-9f406eea5b5e
- David Spuler, March 2024, Model Compression, in Generative AI in C++, https://www.aussieai.com/book/ch28-model-compression
- David Spuler, September 2nd, 2024, 500+ LLM Inference Optimization Techniques, Aussie AI Blog, https://www.aussieai.com/blog/llm-inference-optimization
- Ummara Bibi, Mahrukh Mazharm Dilshad Sabir, Muhammad Fasih Uddin Butt, Ali Hassan, Mustansar Ali Ghazanfar, Arshad Ali Khan, Wadood Abdul, 2024, Advances in Pruning and Quantization for Natural Language Processing, IEEE Access, doi: 10.1109/ACCESS.2024.3465631. https://ieeexplore.ieee.org/document/10685352 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10685352
- Elias Frantar, September, 2024, Compressing Large Neural Networks Algorithms, Systems and Scaling Laws, Ph.D. Thesis, Graduate School, Institute of Science and Technology, Austria, https://research-explorer.ista.ac.at/download/17485/17880/frantar_thesis_final.pdf
- Yongchang Hao, Yanshuai Cao, Lili Mou, 28 Oct 2024, NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks, https://arxiv.org/abs/2410.20650 (Applying data compression algorithms to the exponent of floating-point formats.)
- Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
- Arun Nanda, Sep 7, 2024, Reducing the Size of AI Models. Running large AI models on edge devices, https://towardsdatascience.com/reducing-the-size-of-ai-models-4ab4cfe5887a
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Chinmay Jog, Pangiam November 9, 2024, Here are 3 critical LLM compression strategies to supercharge AI performance, https://venturebeat.com/ai/here-are-3-critical-llm-compression-strategies-to-supercharge-ai-performance/
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
KV Caching and Model Compression
There are several analogous model compression optimizations for KV cache data. Read more about these KV cache research areas:
- KV cache compression
- KV cache sparsity
- KV cache quantization
- KV cache layer fusion
- KV cache layer pruning
- KV cache token pruning
- KV caching (overall)
Data Compression
Data compression refers to the use of existing streaming type bit compression algorithms to make LLMs smaller. This refers to methods such as:
- Huffman coding
- Run-length encoding
- LZW compression
- Zip file formats
One particular example where compression is highly relevant is sparse models. For example, run-length encoding can track the number of zeros between non-zero values.
Research papers on data compression with LLMs:
- Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, Gennady Pekhimenko, 2018, Gist: Efficient Data Encoding for Deep Neural Network Training, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/document/8416872 PDF: https://www.microsoft.com/en-us/research/uploads/prod/2018/04/fiddle-gist-isca18.pdf
- Arnav Chavan, Nahush Lele, Deepak Gupta, Dec 2023, Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models https://arxiv.org/abs/2312.07046 Code: https://github.com/transmuteAI/trailmet/tree/main/trailmet/algorithms/llm-rom
- Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari, 17 Jun 2024, Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference, https://arxiv.org/abs/2406.11674
- Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Song Han, Huizi Mao, William J. Dally, 15 Feb 2016 (v5), Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, https://arxiv.org/abs/1510.00149
- Rishikesh R. Gajjala, Shashwat Banchhor, Ahmed M. Abdelmoniem, Aritra Dutta, Marco Canini, and Panos Kalnis. 2020. Huffman Coding Based Encoding Techniques for Fast Distributed Deep Learning. In Proceedings of the 1st Workshop on Distributed Machine Learning (DistributedML'20). Association for Computing Machinery, New York, NY, USA, 21–27. https://doi.org/10.1145/3426745.3431334, https://dl.acm.org/doi/abs/10.1145/3426745.3431334
- C. Pal, S. Pankaj, W. Akram, A. Acharyya and D. Biswas, "Modified Huffman based compression methodology for Deep Neural Network Implementation on Resource Constrained Mobile Platforms," 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 2018, pp. 1-5, doi: 10.1109/ISCAS.2018.8351234, https://ieeexplore.ieee.org/abstract/document/8351234
- M. Chandra, "Data Bandwidth Reduction in Deep Neural Network SoCs using History Buffer and Huffman Coding," 2018 International Conference on Computing, Power and Communication Technologies (GUCON), Greater Noida, India, 2018, pp. 1-3, doi: 10.1109/GUCON.2018.8675036, https://ieeexplore.ieee.org/abstract/document/8675036
- Yi Ding, Weiwei Fang, Mengran Liu, Meng Wang, Yusong Cheng, Naixue Xiong, 2023, JMDC: A joint model and data compression system for deep neural networks collaborative computing in edge-cloud networks, Journal of Parallel and Distributed Computing, Volume 173, Pages 83-93, ISSN 0743-7315, https://doi.org/10.1016/j.jpdc.2022.11.008 https://www.sciencedirect.com/science/article/abs/pii/S0743731522002416
- Folino, F., Folino, G., Pisani, F.S. et al., 2024, Efficiently approaching vertical federated learning by combining data reduction and conditional computation techniques. J Big Data 11, 77 (2024). https://doi.org/10.1186/s40537-024-00933-6 https://link.springer.com/article/10.1186/s40537-024-00933-6 https://link.springer.com/content/pdf/10.1186/s40537-024-00933-6.pdf
- D. Becking et al., "Neural Network Coding of Difference Updates for Efficient Distributed Learning Communication," in IEEE Transactions on Multimedia, vol. 26, pp. 6848-6863, 2024, doi: 10.1109/TMM.2024.3357198, https://ieeexplore.ieee.org/abstract/document/10412190
- Yongchang Hao, Yanshuai Cao, Lili Mou, 28 Oct 2024, NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks, https://arxiv.org/abs/2410.20650 (Applying data compression algorithms to the exponent of floating-point formats.)
- Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu, 31 Oct 2024, BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments, https://arxiv.org/abs/2410.23918 https://github.com/xinghaow99/BitStack
- C. M. Rahman, M. E. Sobhani, A. T. Rodela and S. Shatabda, "An Enhanced Text Compression Approach Using Transformer-based Language Models," 2024 IEEE Region 10 Symposium (TENSYMP), New Delhi, India, 2024, pp. 1-6, doi: 10.1109/TENSYMP61132.2024.10752239. https://ieeexplore.ieee.org/abstract/document/10752239
- Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
More Model Compression Research
Read more about:
- Quantization
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Normalization pruning
- Length pruning
- Width pruning
- Channel pruning
- Logarithmic Models
- Inference Optimizations
- « Research Home