Aussie AI

Sparsity Optimizations in LLMs

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

What is Sparsity?

Sparse matrices are model weight matrices that have a lot of zeros in them. Various techniques are used to avoid performing multiplication on the zero values, and thereby a more efficient model is created.

Sparsity of the weight matrices is called "static sparsity" because the weights that are zeroed do not change at runtime. However, sparsity of the activations is called "dynamic sparsity" because it changes dependent on the input.

Static Sparsification

Various techniques can be used to "sparsify" a matrix, so that the model is more sparse. This is a form of model compression if done after training, but adding sparsity can also occur during training.

The simplest sparsification technique is "magnitude pruning" whereby small near-zero values are converted to zero. The result is a model with more efficient inference, at the cost of some accuracy. Another common technique is top-K pruning, and there are many other sparsification techniques.

If a matrix has enough zeros in it, the odds are high that some rows and/or columns are all zeros (or near-zeros). In such cases, a smaller-dimension matrix can replace the full matrix without much loss of accuracy. This related optimization technique is called "low-rank matrices".

Dynamic Sparsification

Dynamic sparsification is creating zeros on the fly, rather than removing them from the model weights. There are several ways to do this dynamically:

  • Activation sparsity
  • Dynamic structural pruning (i.e., when pruning is used in adaptive inference techniques)

Pruning and Sparsity

There is a very close associations and much overlap between sparsification and pruning. After all, the effect of pruning is to create zeros for weights, and if you do enough of this, sparsity results. Hence, pruning lots of weights is what sparsification is about.

Static pruning is removal of weights from the model files. Magnitude pruning is unstructured pruning that remove weights in any structure. Static structured pruning involves removing whole structures, such as static layer pruning (removing entire layers of weights).

Dynamic pruning is an adaptive inference optimization at runtime. Dynamic unstructured pruning is not very useful, but there are many types of dynamic structured pruning. In fact, there are 4 dimensions of pruning:

KV Caching and Sparsity

There are analogous sparsification optimizations for KV cache data. Research has shown that K and V vectors are often sparse, because the issue of attention computations is usually sparse. Hence, KV sparsification can be a good way to reduce the in-memory size of the KV cache and thereby reduce its computation cost for faster inference. Read more about these KV cache research areas:

Research on KV sparsity:

Sparse Attention

The attention computations are core to Transformer inference, and research has shown that they are often sparse. Hence, there is much research on "sparse attention" optimizations.

Sparse attention is somewhat related to fully deactivating attention heads (or neurons) in attention head pruning and also other types of pruning on the width dimension; see width pruning.

Research papers on sparse attention:

  • Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
  • Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
  • Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
  • Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • M Pagliardini, D Paliotta, M Jaggi, F Fleuret, 2023, Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention, https://openreview.net/pdf?id=UINHuKeWUa
  • Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
  • Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
  • Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
  • Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, Longformer: The Long-Document Transformer, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
  • Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
  • 3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
  • S Dai, H Genc, R Venkatesan, B Khailany, 2023 Efficient Transformer Inference with Statically Structured Sparse Attention, https://ieeexplore.ieee.org/abstract/document/10247993
  • Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
  • Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
  • Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang, 14 Jun 2024, HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning, https://arxiv.org/abs/2406.09827 (Sparse attention using the top-k features and a tree-based structure.)
  • Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
  • Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
  • Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
  • Bokyeong Yoon; Ah-Hyun Lee; Jinsung Kim; Gordon Euhyun Mo, 9 July 2024, Exploring Attention Sparsity to Accelerate Transformer Training on GPUs, IEEE Access ( Early Access ), DOI: 10.1109/ACCESS.2024.3425638, https://ieeexplore.ieee.org/document/10589623
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Minh Lenhat, Viet Anh Nguyen, Khoa Nguyen, Duong Duc Hieu, Dao Huu Hung, Truong Son Hy, 10 Aug 2024, SAMSA: Efficient Transformer for Many Data Modalities, https://arxiv.org/abs/2408.05391 https://github.com/HySonLab/SAMSA
  • Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
  • Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
  • Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
  • Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang, 21 Feb 2024, Do Efficient Transformers Really Save Computation? https://arxiv.org/abs/2402.13934
  • Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://arxiv.org/abs/2110.15343 (Attention optimization using both sparse attention and low-rank matrix attention.)
  • Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
  • Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466

Feed-Forward Network Sparsity

FFN sparsity is limiting sparsification to the FFN modules in model layers. There is a close relationship between FFN sparsity and FFN pruning optimizations.

Research on FFN sparsity:

Activation Sparsity

Activation sparsity refers to dynamic analysis of the "activations" during inference. It is a particular type of "dynamic sparsity" optimization (other types are optimizations that dynamically remove model data, such as dynamic structural pruning).

There is a close relationship between "activation sparsity" and pruning along the same dimension; see embedding-dimension pruning optimizations.

Research on activation sparsity:

  • Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
  • Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun, 27 Feb 2024 (v2), ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models, https://arxiv.org/abs/2402.13516 (Increases activation sparsity by using RELU and other techniques.)
  • Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
  • Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath, Raghu Meka, 26 Jun 2024, Learning Neural Networks with Sparse Activations, https://arxiv.org/abs/2406.17989
  • James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun, 26 Aug 2024, Training-Free Activation Sparsity in Large Language Models, https://arxiv.org/abs/2408.14690
  • Cody Wild, Jesper Anderson, 10 Jul 2024, Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers, https://arxiv.org/abs/2407.07848
  • Xiaolong Yu, Cong Tian, 30 May 2024, Dual sparse training framework: inducing activation map sparsity via Transformed ℓ1 regularization, https://arxiv.org/abs/2405.19652
  • Rongyu Zhang, Aosong Cheng, Yulin Luo, Gaole Dai, Huanrui Yang, Jiaming Liu, Ran Xu, Li Du, Yuan Du, Yanbing Jiang, Shanghang Zhang, 26 May 2024, Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation, https://arxiv.org/abs/2405.16486 https://github.com/RoyZry98/MoASE-Pytorch
  • Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, David Kappel, Anand Subramoney, 1 May 2024, Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models, https://arxiv.org/abs/2405.00433
  • Andreas Müller, Erwin Quiring, 27 Mar 2024, The Impact of Uniform Inputs on Activation Sparsity and Energy-Latency Attacks in Computer Vision, https://arxiv.org/abs/2403.18587
  • Ilan Price, Nicholas Daultry Ball, Samuel C.H. Lam, Adam C. Jones, Jared Tanner, 25 Feb 2024, Deep Neural Network Initialization with Sparsity Inducing Activations, https://arxiv.org/abs/2402.16184
  • Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, Anand Subramoney, 7 Dec 2023 (v2), Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference, https://arxiv.org/abs/2311.07625
  • Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar, 6 Oct 2023, ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, https://arxiv.org/abs/2310.04564
  • Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
  • Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
  • Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
  • Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
  • Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
  • Chaojun Xiao, Zhengyan Zhang, Chenyang Song, Dazhi Jiang, Feng Yao, Xu Han, Xiaozhi Wang, Shuo Wang, Yufei Huang, Guanyu Lin, Yingfa Chen, Weilin Zhao, Yuge Tu, Zexuan Zhong, Ao Zhang, Chenglei Si, Khai Hao Moo, Chenyang Zhao, Huimin Chen, Yankai Lin, Zhiyuan Liu, Jingbo Shang, Maosong Sun, Sep 2024, Configurable Foundation Models: Building LLMs from a Modular Perspective, https://arxiv.org/pdf/2409.02877
  • Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
  • Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen, 23 Oct 2024, CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation, https://arxiv.org/abs/2410.18311 https://wangqinsi1.github.io/coreinfer_page/
  • Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun, 4 Nov 2024, Sparsing Law: Towards Large Language Models with Greater Activation Sparsity, https://arxiv.org/abs/2411.02335
  • Jiho Shin, Hoeseok Yang, Youngmin Yi, 19 Nov 2024, SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference, https://arxiv.org/abs/2411.12692
  • Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin, 3 Dec 2024 (v2), Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification, https://arxiv.org/abs/2412.00876 https://github.com/Osilly/dynamic_llava (Sparsification of the context in vision model.)
  • Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, 10 Dec 2024 (v2), Mixture of Hidden-Dimensions Transformer, https://arxiv.org/abs/2412.05644

Sparse Matrix Multiplication

There are special optimizations made possible by sparsity in the MatMul/GEMM kernels. Research on sparse matrix computations:

Block Sparsity

Research on block-level sparsity:

  • Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
  • Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
  • Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
  • Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
  • Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
  • Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal Sarmah, Samuel Hsu, Yaswanth Raparti, Nitesh Pipralia, 12 Jul 2024, Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators, https://arxiv.org/abs/2407.09453
  • Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu, 20 Aug 2024, LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models, https://arxiv.org/abs/2408.10631 https://github.com/YupengSu/LLM-Barber
  • Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
  • Kuo-Wei Chang, Tian-Sheuan Chang, 2 May 2022, VSCNN: Convolution Neural Network Accelerator With Vector Sparsity, https://arxiv.org/abs/2205.02271
  • Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
  • Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang, 18 Oct 2024 (v2), SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs, https://arxiv.org/abs/2410.13276

Vector Sparsity

Vector sparsity is similar to block sparsity, but only along a single dimension. Research on vector-level sparsity:

  • S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, 2016, Eie: Efficient inference engine on compressed deep neural network, in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA, IEEE Press, 2016, pp. 243–254, https://doi.org/10.1109/ISCA.2016.30 https://arxiv.org/abs/1602.01528
  • Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
  • Kuo-Wei Chang, Tian-Sheuan Chang, 2 May 2022, VSCNN: Convolution Neural Network Accelerator With Vector Sparsity, https://arxiv.org/abs/2205.02271
  • M. Zhu, T. Zhang, Z. Gu and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs", Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, pp. 359-371, Oct. 2019. https://dl.acm.org/doi/pdf/10.1145/3352460.3358269 (Vector-wise sparsity.)
  • Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
  • Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka, 29 Aug 2024, PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation, https://arxiv.org/abs/2408.16246

Tensor Sparsity

Research on sparse tensors:

SLIDE (Sparse Hashing for Back-Propagation in Training)

Research papers on SLIDE:

Dynamic Sparsity Research

Papers on dynamic sparsity include:

General Research on Sparsity Techniques

More AI Research

Read more about: