Aussie AI

Sparsity Optimizations in LLMs

Last Updated 22 May, 2025

by David Spuler, Ph.D.

What is Sparsity?

Sparse matrices are model weight matrices that have a lot of zeros in them. Various techniques are used to avoid performing multiplication on the zero values, and thereby a more efficient model is created.

Sparsity of the weight matrices is called "static sparsity" because the weights that are zeroed do not change at runtime. However, sparsity of the activations is called "dynamic sparsity" because it changes dependent on the input.

Static Sparsification

Various techniques can be used to "sparsify" a matrix, so that the model is more sparse. This is a form of model compression if done after training, but adding sparsity can also occur during training.

The simplest sparsification technique is "magnitude pruning" whereby small near-zero values are converted to zero. The result is a model with more efficient inference, at the cost of some accuracy. Another common technique is top-K pruning, and there are many other sparsification techniques.

If a matrix has enough zeros in it, the odds are high that some rows and/or columns are all zeros (or near-zeros). In such cases, a smaller-dimension matrix can replace the full matrix without much loss of accuracy. This related optimization technique is called "low-rank matrices".

Dynamic Sparsification

Dynamic sparsification is creating zeros on the fly, rather than removing them from the model weights. There are several ways to do this dynamically:

Activation sparsity
Dynamic structural pruning (i.e., when pruning is used in adaptive inference techniques)

Pruning and Sparsity

There is a very close associations and much overlap between sparsification and pruning. After all, the effect of pruning is to create zeros for weights, and if you do enough of this, sparsity results. Hence, pruning lots of weights is what sparsification is about.

Static pruning is removal of weights from the model files. Magnitude pruning is unstructured pruning that remove weights in any structure. Static structured pruning involves removing whole structures, such as static layer pruning (removing entire layers of weights).

Dynamic pruning is an adaptive inference optimization at runtime. Dynamic unstructured pruning is not very useful, but there are many types of dynamic structured pruning. In fact, there are 4 dimensions of pruning:

Depthwise: early exiting, layer skipping, depth pruning, etc.
Widthwise: attention head pruning, FFN pruning, width pruning.
Lengthwise: input token pruning, prompt compression, length pruning, etc.
Model dimension: embedding-dimension pruning.

KV Caching and Sparsity

There are analogous sparsification optimizations for KV cache data. Research has shown that K and V vectors are often sparse, because the issue of attention computations is usually sparse. Hence, KV sparsification can be a good way to reduce the in-memory size of the KV cache and thereby reduce its computation cost for faster inference. Read more about these KV cache research areas:

Research on KV sparsity:

Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time, 2023. http://arxiv.org/abs/2305.17118 (Reduces the size of the KV cache by limiting storage to only pivotal tokens.)
H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Discusses token pruning reducing size of KV cache.)
S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
G Xiao, Y Tian, B Chen, S Han, M Lewis, Sep 2023, Efficient Streaming Language Models with Attention Sinks, arXiv preprint arXiv:2309.17453, https://arxiv.org/abs/2309.17453 (Sliding window KV caching.)
Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang, 2024, Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract-Conference.html https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf https://github.com/VITA-Group/Q-Hitter
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He, 30 Oct 2024, BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference, https://arxiv.org/abs/2410.23079 https://github.com/JunqiZhao888/buzz-llm
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
Junhyuck Kim, Jongho Park, Jaewoong Cho, Dimitris Papailiopoulos, 12 Dec 2024, Lexico: Extreme KV Cache Compression via Sparse Coding over Universal Dictionaries, https://arxiv.org/abs/2412.08890 https://github.com/krafton-ai/lexico (Sparsification of KV cache in prefill, using INT8 and vector lookup in a dictionary of predefined vectors.)
H Kang, Q Zhang, S Kundu, G Jeong, Z Liu, T Krishna, Dec 2024, GEAR: An Efficient Error Reduction Framework for KV Cache Compression in LLM Inference, https://neurips2024-enlsp.github.io/papers/paper_3.pdf (Use extra information in low-rank and sparse matrices to efficiently alleviate lossy KV cache quantization issues such as outliers.)
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, Luis Ceze, 2 Jan 2025, FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving, https://arxiv.org/abs/2501.01005
Qihui Zhou, Peiqi Yin, Pengfei Zuo, James Cheng, 1 Mar 2025, Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving, https://arxiv.org/abs/2503.00392
AnonymousACLsubmission, 2025, TokenSelect: Efficient Long-Context Inferenceand Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection, https://openreview.net/pdf?id=l7i2gtDKdU
Zehao Fan, Garrett Gagnon, Zhenyu Liu, Liu Liu, 9 May 2025, Sparse Attention Remapping with Clustering for Efficient LLM Decoding on PIM, https://arxiv.org/abs/2505.05772

Sparse Attention

The attention computations are core to Transformer inference, and research has shown that they are often sparse. Hence, there is much research on "sparse attention" optimizations.

Sparse attention is somewhat related to fully deactivating attention heads (or neurons) in attention head pruning and also other types of pruning on the width dimension; see width pruning.

Research papers on sparse attention:

Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
M Pagliardini, D Paliotta, M Jaggi, F Fleuret, 2023, Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention, https://openreview.net/pdf?id=UINHuKeWUa
Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, Longformer: The Long-Document Transformer, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
S Dai, H Genc, R Venkatesan, B Khailany, 2023 Efficient Transformer Inference with Statically Structured Sparse Attention, https://ieeexplore.ieee.org/abstract/document/10247993
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang, 14 Jun 2024, HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning, https://arxiv.org/abs/2406.09827 (Sparse attention using the top-k features and a tree-based structure.)
Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
Bokyeong Yoon; Ah-Hyun Lee; Jinsung Kim; Gordon Euhyun Mo, 9 July 2024, Exploring Attention Sparsity to Accelerate Transformer Training on GPUs, IEEE Access ( Early Access ), DOI: 10.1109/ACCESS.2024.3425638, https://ieeexplore.ieee.org/document/10589623
Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Minh Lenhat, Viet Anh Nguyen, Khoa Nguyen, Duong Duc Hieu, Dao Huu Hung, Truong Son Hy, 10 Aug 2024, SAMSA: Efficient Transformer for Many Data Modalities, https://arxiv.org/abs/2408.05391 https://github.com/HySonLab/SAMSA
Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang, 21 Feb 2024, Do Efficient Transformers Really Save Computation? https://arxiv.org/abs/2402.13934
Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://arxiv.org/abs/2110.15343 (Attention optimization using both sparse attention and low-rank matrix attention.)
Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov, 19 Feb 2025, RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression, https://arxiv.org/abs/2502.14051 (Combines KV token pruning with sparse attention algorithms.)

Feed-Forward Network Sparsity

FFN sparsity is limiting sparsification to the FFN modules in model layers. There is a close relationship between FFN sparsity and FFN pruning optimizations.

Research on FFN sparsity:

Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
Jie Tang; Shuai Wang; Song Chen; Yi Kang, May 2024, DP-FFN: Block-Based Dynamic Pooling for Accelerating Feed-Forward Layers in Transformers, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/abstract/document/10558119
Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
Zhiyang Chen; Yousong Zhu; Zhaowen Li; Fan Yang et al., The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 4130-4134, doi: 10.1109/ICASSP48485.2024.10447756, https://ieeexplore.ieee.org/abstract/document/10447756

Activation Sparsity

Activation sparsity refers to dynamic analysis of the "activations" during inference. It is a particular type of "dynamic sparsity" optimization (other types are optimizations that dynamically remove model data, such as dynamic structural pruning).

There is a close relationship between "activation sparsity" and pruning along the same dimension; see embedding-dimension pruning optimizations.

Research on activation sparsity:

Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun, 27 Feb 2024 (v2), ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models, https://arxiv.org/abs/2402.13516 (Increases activation sparsity by using RELU and other techniques.)
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath, Raghu Meka, 26 Jun 2024, Learning Neural Networks with Sparse Activations, https://arxiv.org/abs/2406.17989
James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun, 26 Aug 2024, Training-Free Activation Sparsity in Large Language Models, https://arxiv.org/abs/2408.14690
Cody Wild, Jesper Anderson, 10 Jul 2024, Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers, https://arxiv.org/abs/2407.07848
Xiaolong Yu, Cong Tian, 30 May 2024, Dual sparse training framework: inducing activation map sparsity via Transformed ℓ1 regularization, https://arxiv.org/abs/2405.19652
Rongyu Zhang, Aosong Cheng, Yulin Luo, Gaole Dai, Huanrui Yang, Jiaming Liu, Ran Xu, Li Du, Yuan Du, Yanbing Jiang, Shanghang Zhang, 26 May 2024, Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation, https://arxiv.org/abs/2405.16486 https://github.com/RoyZry98/MoASE-Pytorch
Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, David Kappel, Anand Subramoney, 1 May 2024, Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models, https://arxiv.org/abs/2405.00433
Andreas Müller, Erwin Quiring, 27 Mar 2024, The Impact of Uniform Inputs on Activation Sparsity and Energy-Latency Attacks in Computer Vision, https://arxiv.org/abs/2403.18587
Ilan Price, Nicholas Daultry Ball, Samuel C.H. Lam, Adam C. Jones, Jared Tanner, 25 Feb 2024, Deep Neural Network Initialization with Sparsity Inducing Activations, https://arxiv.org/abs/2402.16184
Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, Anand Subramoney, 7 Dec 2023 (v2), Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference, https://arxiv.org/abs/2311.07625
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar, 6 Oct 2023, ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, https://arxiv.org/abs/2310.04564
Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
Chaojun Xiao, Zhengyan Zhang, Chenyang Song, Dazhi Jiang, Feng Yao, Xu Han, Xiaozhi Wang, Shuo Wang, Yufei Huang, Guanyu Lin, Yingfa Chen, Weilin Zhao, Yuge Tu, Zexuan Zhong, Ao Zhang, Chenglei Si, Khai Hao Moo, Chenyang Zhao, Huimin Chen, Yankai Lin, Zhiyuan Liu, Jingbo Shang, Maosong Sun, Sep 2024, Configurable Foundation Models: Building LLMs from a Modular Perspective, https://arxiv.org/pdf/2409.02877
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen, 23 Oct 2024, CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation, https://arxiv.org/abs/2410.18311 https://wangqinsi1.github.io/coreinfer_page/
Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun, 4 Nov 2024, Sparsing Law: Towards Large Language Models with Greater Activation Sparsity, https://arxiv.org/abs/2411.02335
Jiho Shin, Hoeseok Yang, Youngmin Yi, 19 Nov 2024, SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference, https://arxiv.org/abs/2411.12692
Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin, 3 Dec 2024 (v2), Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification, https://arxiv.org/abs/2412.00876 https://github.com/Osilly/dynamic_llava (Sparsification of the context in vision model.)
Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, 10 Dec 2024 (v2), Mixture of Hidden-Dimensions Transformer, https://arxiv.org/abs/2412.05644
Vui Seng Chua, Yujie Pan, Nilesh Jain, 10 Dec 2024, Post-Training Statistical Calibration for Higher Activation Sparsity, https://arxiv.org/abs/2412.07174 https://github.com/IntelLabs/SCAP
Nobel Dhar, Bobin Deng, Md Romyull Islam, Kazi Fahim Ahmad Nasif, Liang Zhao, Kun Suo, 13 Dec 2024, Activation Sparsity Opportunities for Compressing General Large Language Models, https://arxiv.org/abs/2412.12178
Zihao Zheng, Yuanchun Li, Jiayu Chen, Peng Zhou, Xiang Chen, Yunxin Liu, 18 Dec 2024, Threshold Neuron: A Brain-inspired Artificial Neuron for Efficient On-device Inference, https://arxiv.org/abs/2412.13902 (Multiplication-free model architecture using comparisons and subtraction, including a threshold mechanism that make it analogous to activation sparsification.)
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Y An, Z Chen, C Xiong, B Chen, Jan 2025, Herd: Grouping before Pruning for Batch Inference, https://oasis-git.github.io/data/herd.pdf (Dynamic activation pruning across multiple queries in a batch.)
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
Shangqian Gao, Ting Hua, Reza Shirkavand, Chi-Heng Lin, Zhen Tang, Zhengao Li, Longge Yuan, Fangyi Li, Zeyu Zhang, Alireza Ganjdanesh, Lou Qian, Xu Jie, Yen-Chang Hsu, 25 Jan 2025, ToMoE: Converting Dense Large Language Models to Mixture-of-Experts through Dynamic Structural Pruning, https://arxiv.org/abs/2501.15316
W Zhang, X Ren, Mar 2025, ReM: Sparsify and MoEfy Models with Post-Hoc ReLU Modulation, ICLR 2025 review, https://openreview.net/pdf?id=cizhOu3CZa (Induce activation sparsity for MoE choice in the model router.)

Sparse Matrix Multiplication

There are special optimizations made possible by sparsity in the MatMul/GEMM kernels. Research on sparse matrix computations:

Y Yang, JS Emer, D Sanchez, 2024, Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications, MIT, https://yang-yifan.github.io/papers/isca24_trapezoid.pdf
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
Hongyaoxing Gu, 11 Mar 2024, A method for accelerating low precision operations by sparse matrix multiplication, https://arxiv.org/abs/2403.06924v1
Haque, S.A.; Choudhury, N.; Hossain, S. Matrix Multiplication with Diagonals: Structured Sparse Matrices and Beyond. In Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications, Jinan, China, 17–19 June 2023; pp. 69–76. https://doi.org/10.1145/3606043.3606053
Sardar Anisul Haque,Mohammad Tanvir Parvez, Shahadat Hossain, Jan 2024, GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes, https://www.mdpi.com/1999-4893/17/1/31
D. Mukunoki, M. Kawai and T. Imamura, 2023, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 2023, pp. 608-615, doi: 10.1109/MCSoC60832.2023.00094, https://ieeexplore.ieee.org/abstract/document/10387875
Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
Helin Cheng, Wenxuan Li, Yuechen Lu, and Weifeng Liu. 2023. HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore Processors. In Proceedings of the 52nd International Conference on Parallel Processing (ICPP '23). Association for Computing Machinery, New York, NY, USA, 807–817. https://doi.org/10.1145/3605573.3605611 https://dl.acm.org/doi/abs/10.1145/3605573.3605611
Chunxu Lin, Wensheng Luo, Yixiang Fang, Chenhao Ma, Xilin Liu, and Yuchi Ma. 2024. On Efficient Large Sparse Matrix Chain Multiplication. Proc. ACM Manag. Data 2, 3, Article 156 (June 2024), 27 pages. https://doi.org/10.1145/3654959 https://dl.acm.org/doi/abs/10.1145/3654959
Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
Zhang, H., Ma, W., Yuan, W. et al. Mixed-precision block incomplete sparse approximate preconditioner on Tensor core. CCF Trans. HPC 6, 54–67 (2024). https://doi.org/10.1007/s42514-023-00165-9 https://link.springer.com/article/10.1007/s42514-023-00165-9
Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
Kaige Zhang, Xiaoyan Liu, Hailong Yang, Tianyu Feng, Xinyu Yang, Yi Liu, Zhongzhi Luan, and Depei Qian. 2024. Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor Core. In Proceedings of the 53rd International Conference on Parallel Processing (ICPP '24). Association for Computing Machinery, New York, NY, USA, 1124–1134. https://doi.org/10.1145/3673038.3673108 https://dl.acm.org/doi/abs/10.1145/3673038.3673108
Bobby Yan, Alexander J. Root, Trevor Gale, David Broman, Fredrik Kjolstad, 20 Jun 2024 (v2), Scorch: A Library for Sparse Deep Learning, https://arxiv.org/abs/2405.16883
Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad, 21 Aug 2024, Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication, https://arxiv.org/abs/2408.11988
Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
Noah Amsel, Tyler Chen, Feyza Duman Keles, Diana Halikias, Cameron Musco, Christopher Musco, 26 Mar 2024 (v3), Fixed-sparsity matrix approximation from matrix-vector products, https://arxiv.org/abs/2402.09379
Peiming Liu, Alexander J Root, Anlunxu, Yinyig Li, Fredrik Kjolstad, Aart C. Bik, 2024, Compiler Support for Sparse Tensor Convolutions, https://rootjalex.github.io/publications/oopsla2024-spconv.pdf
Pranav Dangi, Zhenyu Bai, Dhananjaya Wijerathne, Rohan Juneja, 2024, ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in ML, https://pranavdangi.github.io/papers/PACT24.pdf
Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
Anton Lokhmotov, 17 Nov 2015 (v2), GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication, https://arxiv.org/abs/1511.03742
Xiaobo Lu, Jianbin Fang, Lin Peng, Chun Huang, Zidong Du, Yongwei Zhao, and Zheng Wang. 2024. Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product. ACM Trans. Archit. Code Optim. Just Accepted (August 2024). https://doi.org/10.1145/3688612 https://dl.acm.org/doi/abs/10.1145/3688612
Patrik Okanovic, Grzegorz Kwasniewski, Paolo Sylos Labini, Maciej Besta, Flavio Vella, Torsten Hoefler, 21 Aug 2024, High Performance Unstructured SpMM Computation Using Tensor Cores, https://arxiv.org/abs/2408.11551
Takuma Yamaguchi and Federico Busato, Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores, Mar 19, 2021, https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/
OpenAI, December 6, 2017, Block-sparse GPU kernels, https://openai.com/index/block-sparse-gpu-kernels/ https://cdn.openai.com/blocksparse/blocksparsepaper.pdf https://github.com/openai/blocksparse
Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
R. L. Castro, D. Andrade and B. B. Fraguela, "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning," in IEEE Access, vol. 12, pp. 70581-70599, 2024, doi: 10.1109/ACCESS.2024.3402326. https://ieeexplore.ieee.org/abstract/document/10534045 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10534045
Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
X. Wang et al., "Vision Transformer Acceleration via a Versatile Attention Optimization Framework," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3513265. https://ieeexplore.ieee.org/abstract/document/10782988/
Zhonggen Li, Xiangyu Ke, Yifan Zhu, Yunjun Gao, Yaofeng Tu, 12 Dec 2024, HC-SpMM: Accelerating Sparse Matrix-Matrix Multiplication for Graphs with Hybrid GPU Cores, https://arxiv.org/abs/2412.08902
Dongyun Kam, Myeongji Yun, Sunwoo Yoo, Seungwoo Hong, Zhengya Zhang, Youngjoo Lee, 13 Dec 2024, Panacea: Novel DNN Accelerator using Accuracy-Preserving Asymmetric Quantization and Energy-Saving Bit-Slice Sparsity, https://arxiv.org/abs/2412.10059
Y. Ogiwara and H. Kawashima, "Sparse Ternary Matrix Multiplication with Tensor Core for Transformer," 2024 Twelfth International Symposium on Computing and Networking Workshops (CANDARW), Naha, Japan, 2024, pp. 150-156, doi: 10.1109/CANDARW64572.2024.00031. https://ieeexplore.ieee.org/abstract/document/10817836/
Jordi Wolfson-Pou, Jan Laukemann, Fabrizio Petrini, 13 Jan 2025, Generating Data Locality to Accelerate Sparse Matrix-Matrix Multiplication on CPUs, https://arxiv.org/abs/2501.07056
Vasileios Titopoulos, Kosmas Alexandridis, Christodoulos Peltekis, Chrysostomos Nicopoulos, Giorgos Dimitrakopoulos, 17 Jan 2025, Optimizing Structured-Sparse Matrix Multiplication in RISC-V Vector Processors, https://arxiv.org/abs/2501.10189 (New hardware instruction for permutation-based sparse MatMul.)
Jonas M Kübler, Yu-Xiang Wang, Shoham Sabach, Navid Ansari, Matthäus Kleindessner, Kailash Budhathoki, Volkan Cevher, George Karypis, 29 Jan 2025, A Proximal Operator for Inducing 2:4-Sparsity, https://arxiv.org/abs/2501.18015
Ruibo Fan, Xiangrui Yu, Peijie Dong, Zeyu Li, Gu Gong, Qiang Wang, Wei Wang, and Xiaowen Chu. 2025. SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25). Association for Computing Machinery, New York, NY, USA, 243–260. https://doi.org/10.1145/3689031.3717481 https://dl.acm.org/doi/abs/10.1145/3689031.3717481
Xu, J., He, L. & Jin, Z. Mixed precision SpMV on GPUs for irregular data with hierarchical precision selection. CCF Trans. HPC (2025). https://doi.org/10.1007/s42514-024-00202-1 https://link.springer.com/article/10.1007/s42514-024-00202-1
Lizhi Xiang, Omid Asudeh, Gerald Sabin, Aravind Sukumaran-Rajam, P. Sadayappan, 8 Apr 2025, cuTeSpMM: Accelerating Sparse-Dense Matrix Multiplication using GPU Tensor Cores, https://arxiv.org/abs/2504.06443
Zitong Li, Aparna Chandramowlishwaran, 12 May 2025, Fused3S: Fast Sparse Attention on Tensor Cores, https://arxiv.org/abs/2505.08098

Block Sparsity

Research on block-level sparsity:

Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal Sarmah, Samuel Hsu, Yaswanth Raparti, Nitesh Pipralia, 12 Jul 2024, Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators, https://arxiv.org/abs/2407.09453
Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu, 20 Aug 2024, LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models, https://arxiv.org/abs/2408.10631 https://github.com/YupengSu/LLM-Barber
Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
Kuo-Wei Chang, Tian-Sheuan Chang, 2 May 2022, VSCNN: Convolution Neural Network Accelerator With Vector Sparsity, https://arxiv.org/abs/2205.02271
Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang, 18 Oct 2024 (v2), SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs, https://arxiv.org/abs/2410.13276
Q. Dong, S. Zhang and Z. Wang, "An Efficient Window-Based Vision Transformer Accelerator via Mixed-Granularity Sparsity," in IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2025.3527541. https://ieeexplore.ieee.org/abstract/document/10844888

Vector Sparsity

Vector sparsity is similar to block sparsity, but only along a single dimension. Research on vector-level sparsity:

S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, 2016, Eie: Efficient inference engine on compressed deep neural network, in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA, IEEE Press, 2016, pp. 243–254, https://doi.org/10.1109/ISCA.2016.30 https://arxiv.org/abs/1602.01528
Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
Kuo-Wei Chang, Tian-Sheuan Chang, 2 May 2022, VSCNN: Convolution Neural Network Accelerator With Vector Sparsity, https://arxiv.org/abs/2205.02271
M. Zhu, T. Zhang, Z. Gu and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs", Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, pp. 359-371, Oct. 2019. https://dl.acm.org/doi/pdf/10.1145/3352460.3358269 (Vector-wise sparsity.)
Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka, 29 Aug 2024, PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation, https://arxiv.org/abs/2408.16246
Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica, 19 Dec 2024, HashAttention: Semantic Sparsity for Faster Inference, https://arxiv.org/abs/2412.14468

Tensor Sparsity

Research on sparse tensors:

David Spuler, March 2024, Chapter 23. Tensors, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Bobby Yan, Alexander J. Root, Trevor Gale, David Broman, Fredrik Kjolstad, 20 Jun 2024 (v2), Scorch: A Library for Sparse Deep Learning, https://arxiv.org/abs/2405.16883
Philipp Trunschke, Anthony Nouy, Martin Eigel, 13 Oct 2023, Weighted sparsity and sparse tensor networks for least squares approximation, https://arxiv.org/abs/2310.08942
Kyle Deeds, Willow Ahrens, Magda Balazinska, Dan Suciu, 29 Aug 2024 (v2), Galley: Modern Query Optimization for Sparse Tensor Programs, https://arxiv.org/abs/2408.14706
Junjing Zheng, Xinyu Zhang, Weidong Jiang, 24 Jul 2024, Sparse Tensor PCA via Tensor Decomposition for Unsupervised Feature Selection, https://arxiv.org/abs/2407.16985
Sasindu Wijeratne, Rajgopal Kannan, Viktor Prasanna, 14 May 2024, Sparse MTTKRP Acceleration for Tensor Decomposition on GPU, https://arxiv.org/abs/2405.08470
Tugba Torun, Eren Yenigul, Ameer Taweel, Didem Unat, 8 May 2024, A Sparse Tensor Generator with Efficient Feature Extraction, https://arxiv.org/abs/2405.04944 https://github.com/sparcityeu/feaTen https://github.com/sparcityeu/genTen
Geonhwa Jeong, Po-An Tsai, Abhimanyu R. Bambhaniya, Stephen W. Keckler, Tushar Krishna, 31 Mar 2024 (v2), Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition, https://arxiv.org/abs/2403.07953
Jan Laukemann, Ahmed E. Helal, S. Isaac Geronimo Anderson, Fabio Checconi, Yongseok Soh, Jesmin Jahan Tithi, Teresa Ranadive, Brian J Gravelle, Fabrizio Petrini, Jee Choi, 11 Mar 2024, Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation, https://arxiv.org/abs/2403.06348
Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
David Spuler, March 2024, Sparse Tensors, in Generative AI in C++, https://www.aussieai.com/book/ch23-sparse-tensors
Peiming Liu, Alexander J Root, Anlunxu, Yinyig Li, Fredrik Kjolstad, Aart C. Bik, 2024, Compiler Support for Sparse Tensor Convolutions, https://rootjalex.github.io/publications/oopsla2024-spconv.pdf

SLIDE (Sparse Hashing for Back-Propagation in Training)

Research papers on SLIDE:

Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava, 1 Mar 2020 (v2), SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems, https://arxiv.org/abs/1903.03129
Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Anshumali Shrivastava, 2021, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, Part of Proceedings of Machine Learning and Systems 3 (MLSys 2021), https://proceedings.mlsys.org/paper_files/paper/2021/hash/de4086ad4276d895be8ef25ec03c964b-Abstract.html https://proceedings.mlsys.org/paper_files/paper/2021/file/de4086ad4276d895be8ef25ec03c964b-Paper.pdf
Minghao Yan, Nicholas Meisburger, Tharun Medini, Anshumali Shrivastava, 29 Jan 2022, Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity, https://arxiv.org/abs/2201.12667
Z. Pan, F. Zhang, H. Li, C. Zhang, X. Du and D. Deng, "G-SLIDE: A GPU-Based Sub-Linear Deep Learning Engine via LSH Sparsification," in IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 3015-3027, 1 Nov. 2022, doi: 10.1109/TPDS.2021.3132493. https://ieeexplore.ieee.org/abstract/document/9635657
S. Ko, A. Rucker, Y. Zhang, P. Mure and K. Olukotun, "Accelerating SLIDE: Exploiting Sparsity on Accelerator Architectures," 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Lyon, France, 2022, pp. 663-670, doi: 10.1109/IPDPSW55747.2022.00116. https://ieeexplore.ieee.org/abstract/document/9835529
Eslam Al-Sobh, Prof. Mahmoud Alshbool, Dr. Yaser Jararweh, Prof. Moath Jarrah, July 2024, Empirical Study To Compare The Performance Of Novel CPU Implementation Of Deep Learning Algorithms With GPU-BASED Implementatioms, https://doi.org/10.21203/rs.3.rs-4625052/v1 https://www.researchsquare.com/article/rs-4625052/v1 https://assets-eu.researchsquare.com/files/rs-4625052/v1_covered_d2a74f37-a02e-45ed-98ab-b104064ee4ed.pdf?c=1721744350

Dynamic Sparsity Research

Papers on dynamic sparsity include:

D. Kim, J. Ahn and S. Yoo, "ZeNA: Zero-aware neural network accelerator", IEEE Des. Test, vol. 35, no. 1, pp. 39-46, Feb. 2018. https://ieeexplore.ieee.org/document/8013151 (Dynamic sparsity.)
H. T. Kung, B. McDanel and S. Q. Zhang, "Adaptive tiling: Applying fixed-size systolic arrays to sparse convolutional neural networks", Proc. 24th Int. Conf. Pattern Recognit. (ICPR), pp. 1006-1011, Aug. 2018. https://ieeexplore.ieee.org/document/8545462 (Dynamic sparsity.)
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Uses a "greedy interleaving" algorithm for processing sparse matrices to avoid zero multiplications.)
Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji, Oct 2023, Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs, https://arxiv.org/abs/2310.08915
H Fan, SI Venieris, A Kouris, ND Lane, Oct 2023 Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads, arXiv preprint arXiv:2310.11096, https://arxiv.org/pdf/2310.11096.pdf
S Tan, Y Shen, Z Chen, A Courville, C Gan, Oct 2023, Sparse Universal Transformer, arXiv preprint arXiv:2310.07096, https://arxiv.org/pdf/2310.07096.pdf
Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233
Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,” arXiv preprint arXiv:2104.08378, 2021. https://arxiv.org/abs/2104.08378
Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu, 16 Apr 2024, SparseDM: Toward Sparse Efficient Diffusion Models, https://arxiv.org/abs/2404.10445
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Y.-H. Chen, T.-J. Yang, J. Emer and V. Sze, "Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices", IEEE J. Emerging Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292-308, Jun. 2019. https://arxiv.org/abs/1807.07928
Mirko Farina, Usman Ahmad, Ahmad Taha, Hussein Younes, Yusuf Mesbah, Xiao Yu, Witold Pedrycz, 2024, Sparsity in transformers: A systematic literature review, Neurocomputing, Volume 582, 14 May 2024, 127468, https://www.sciencedirect.com/science/article/abs/pii/S092523122400239X (General survey of sparsity methods, and techniques that create sparsity.)
Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun, 27 Feb 2024 (v2), ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models, https://arxiv.org/abs/2402.13516 (Increases activation sparsity by using RELU and other techniques.)
Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, Lidong Zhou, October 2023, PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation, SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles, Pages 331–347, https://doi.org/10.1145/3600006.3613139 https://dl.acm.org/doi/abs/10.1145/3600006.3613139 (Deep learning compiler for dynamic sparsity.)
Lu Yin, Gen Li, Meng Fang, Li Shen, Tianjin Huang, Zhangyang Wang, Vlado Menkovski, Xiaolong Ma, Mykola Pechenizkiy, Shiwei Liu, 10 Nov 2023 (v2), Dynamic Sparsity Is Channel-Level Sparsity Learner, 37th Conference on Neural Information Processing Systems (NeurIPS 2023), https://arxiv.org/abs/2305.19454 Code: https://github.com/luuyin/chase
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
Liu, Zichang. April 2024, Dynamic Sparsity for Efficient Machine Learning, Ph.D. Thesis, Rice University, Houston, Texas, 31532859, https://www.proquest.com/openview/e8536826101c752bd8618ce95292a711/1
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
Jordan Dotzel, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, Sep 2024, Opportunities for Post-Training Dynamic Layer Sparsity in Large Vision and Language Models, https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Dotzel_Opportunities_for_Post-Training_Dynamic_Layer_Sparsity_in_Large_Vision_and_CVPRW_2024_paper.pdf (Layerwise dynamic sparsity for vision models.)
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
Nasib Ullah, Erik Schultheis, Mike Lasby, Yani Ioannou, Rohit Babbar, 6 Nov 2024 (v2), Navigating Extremes: Dynamic Sparsity in Large Output Space, https://arxiv.org/abs/2411.03171
Xin Tan, Yuetao Chen, Yimin Jiang, Xing Chen, Kun Yan, Nan Duan, Yibo Zhu, Daxin Jiang, Hong Xu, 11 Feb 2025, DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training, https://arxiv.org/abs/2502.07590
Renyuan Liu, Yuyang Leng, Shilei Tian, Shaohan Hu, Chun-Fu (Richard) Chen, and Shuochao Yao. 2024. DynaSpa: Exploiting Spatial Sparsity for Efficient Dynamic DNN Inference on Devices. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys '24). Association for Computing Machinery, New York, NY, USA, 422–435. https://doi.org/10.1145/3666025.3699348 https://dl.acm.org/doi/abs/10.1145/3666025.3699348 https://dl.acm.org/doi/pdf/10.1145/3666025.3699348
Tiernan Ray, Feb. 19, 2025, What is sparsity? DeepSeek AI's secret, revealed by Apple researchers: The AI model that shook the world is part of a broad trend to squeeze more out of chips. Here's how it works. https://www.zdnet.com/article/what-is-sparsity-deepseek-ais-secret-revealed-by-apple-researchers/
Qihui Zhou, Peiqi Yin, Pengfei Zuo, James Cheng, 1 Mar 2025, Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving, https://arxiv.org/abs/2503.00392

General Research on Sparsity Techniques

Li, Y.; Yu, Y.; Zhang, Q.; Liang, C.; He, P.; Chen, W.; and Zhao, T. 2023. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 20336–20350. PMLR. https://arxiv.org/abs/2306.11222
Frantar, E.; and Alistarh, D. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774. https://arxiv.org/abs/2301.00774
X. Dai, H. Yin and N. K. Jha, "NeST: A neural network synthesis tool based on a grow-and-prune paradigm", IEEE Trans. Comput., vol. 68, no. 10, pp. 1487-1497, Oct. 2019. https://arxiv.org/abs/1711.02017
S. Cao et al., "Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity", Proc. Int. Symp. Field-Prog. Gate Arrays, pp. 63-72, 2019. PDF: https://wencongxiao.github.io/res/fpga19/FPGA19.pdf
W. Wen, C. Wu, Y. Wang, Y. Chen and H. Li, "Learning structured sparsity in deep neural networks", Proc. Adv. Neural Inf. Process. Syst., vol. 29, pp. 2074-2082, 2016. https://arxiv.org/abs/1608.03665
M. Zhu, T. Zhang, Z. Gu and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs", Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, pp. 359-371, Oct. 2019. https://dl.acm.org/doi/pdf/10.1145/3352460.3358269 (Vector-wise sparsity.)
Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally, "Exploring the regularity of sparse structure in convolutional neural networks", arXiv:1705.08922, 2017. https://arxiv.org/abs/1705.08922
H. Wang, Q. Zhang, Y. Wang, L. Yu and H. Hu, "Structured pruning for efficient ConvNets via incremental regularization", Proc. Int. Joint Conf. Neural Netw. (IJCNN), pp. 1-8, Jul. 2019. https://openreview.net/pdf?id=S1e_xM7_iQ
S. Narang, E. Elsen, G. Diamos and S. Sengupta, "Exploring sparsity in recurrent neural networks", arXiv:1704.05119, 2017. https://arxiv.org/abs/1704.05119
Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, William J. Dally, "ESE: Efficient speech recognition engine with sparse LSTM on FPGA", Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA), pp. 75-84, 2017. https://arxiv.org/abs/1612.00694
M. Zhu and S. Gupta, "To prune or not to prune: Exploring the efficacy of pruning for model compression", arXiv:1710.01878, 2017. https://arxiv.org/abs/1710.01878
S. Zhang et al., "Cambricon-X: An accelerator for sparse neural networks", Proc. Int. Symp. Microarchitecture, pp. 1-12, 2016. https://ieeexplore.ieee.org/document/7783723
Z.-G. Liu, P. N. Whatmough and M. Mattina, "Systolic tensor array: An efficient structured-sparse GEMM accelerator for mobile CNN inference", IEEE Comput. Archit. Lett., vol. 19, no. 1, pp. 34-37, Jan. 2020. https://arxiv.org/abs/2005.08098
Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, Tobi Delbruck, "NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps", IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644-656, Mar. 2019. https://arxiv.org/abs/1706.01406
Y. Lu, C. Wang, L. Gong, X. Zhou, SparseNN: a performance-efficient accelerator for large-scale sparse neural networks, Int. J. Parallel Program. 46 (4) (2018) 648–659. https://arxiv.org/abs/1711.01263
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N.E. Jerger, A. Moshovos, Cnvlutin: ineffectual-neuron-free deep neural network computing, in: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 1–13. https://ieeexplore.ieee.org/document/7551378
S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, Y. Chen, Cambricon-x: an accelerator for sparse neural networks, in: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, 2016, p. 20. https://ieeexplore.ieee.org/document/7783723
A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, W.J. Dally, SCNN: an accelerator for compressed-sparse convolutional neural networks, in: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, 2017, pp. 27–40. https://arxiv.org/abs/1708.04485
Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. In International Conference on Machine Learning. 3299–3308. https://arxiv.org/abs/1706.06197 Code: https://github.com/lancopku/meProp (Structural sparsification.)
Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 2074–2082. https://arxiv.org/abs/1608.03665 (Structural sparsification.)
P. Grigoras, P. Burovskiy, E. Hung, and W. Luk. Accelerating SpMV on FPGAs by compressing nonzero values. In International Symposium on Field Programmable Gate Arrays, pages 64–67, 2015. https://ieeexplore.ieee.org/document/7160041 (Sparse multiplication of non-zero values.)
S Liu, Z Wang, 2023, Ten lessons we have learned in the new" sparseland": A short handbook for sparse neural network researchers arXiv preprint arXiv:2302.02596, https://arxiv.org/abs/2302.02596
Enmao Diao, Ganghua Wang, Jiawei Zhan, Yuhong Yang, Jie Ding, Vahid Tarokh, Aug 2023, Pruning Deep Neural Networks from a Sparsity Perspective, https://arxiv.org/abs/2302.05601
B Yoon, Y Han, GE Moon, 2023, SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling arXiv preprint arXiv:2309.12578, https://arxiv.org/pdf/2309.12578.pdf
Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
A Jaiswal, Z Gan, X Du, B Zhang, Z Wang, Y Yang, Oct 2023, Compressing LLMs: The Truth is Rarely Pure and Never Simple, arXiv preprint arXiv:2310.01382, https://browse.arxiv.org/pdf/2310.01382.pdf
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar, Oct 2023 ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, https://arxiv.org/abs/2310.04564 (Recommends reinstating the simpler RELU rather than GELU or SiLU, with a focus on inference efficiency.)
Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594 (High sparsity on Llama2 models.)
Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren, 7 Jun 2024, MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter, https://arxiv.org/abs/2406.04984 Code: https://github.com/CURRENTF/MEFT
Ganesh Jawahar, April 2024, Methods for design of efficient on-device natural language processing architectures, Ph.D. thesis, Computer Science, The University of British Columbia (Vancouver) https://open.library.ubc.ca/media/download/pdf/24/1.0441384/4
Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu, 16 Apr 2024, SparseDM: Toward Sparse Efficient Diffusion Models, https://arxiv.org/abs/2404.10445
Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini, 12 Apr 2024, CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models, https://arxiv.org/abs/2404.08763 (Sparsity with dynamic control over the thresholds with an effect that is similar to intra-model MoE.)
Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
Panjie Qi; Edwin Hsing-Mean Sha; Qingfeng Zhuge; Hongwu Peng; Shaoyi Hua, 2021, Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization, 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/document/9643586
Cong Wei, Brendan Duke, Ruowei Jiang, Parham Aarabi, Graham W. Taylor, Florian Shkurti, Mar 2023, Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers, CVPR 2023, https://arxiv.org/abs/2303.13755 https://openaccess.thecvf.com/content/CVPR2023/papers/Wei_Sparsifiner_Learning_Sparse_Instance-Dependent_Attention_for_Efficient_Vision_Transformers_CVPR_2023_paper.pdf
Rahul Chand, Yashoteja Prabhu, Pratyush Kumar, 20 Dec 2023, DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization, https://arxiv.org/abs/2312.13211
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun, Dec 2019, Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection, https://arxiv.org/abs/1912.11637
Georgios Georgiadis. 2019. Accelerating Convolutional Neural Networks via Activation Map Compression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7085–7095. https://arxiv.org/abs/1812.04056
Bunyodbek Ibrokhimov, Cheonghwan Hur, and Sanggil Kang. 2020. Effective node selection technique towards sparse learning. APPLIED INTELLIGENCE (2020), https://dl.acm.org/doi/abs/10.1007/s10489-020-01720-5
Zehao Huang and Naiyan Wang. 2018. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). 304–320. https://arxiv.org/abs/1707.01213
Shiyao Xu; Jingfei Jiang; Jinwei Xu; Chaorun Liu; Yuanhong He; Xiaohang Liu, 2022, Sparkle: A High Efficient Sparse Matrix Multiplication Accelerator for Deep Learning, 2022 IEEE 40th International Conference on Computer Design (ICCD) https://ieeexplore.ieee.org/document/9978530
C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian and B. Yuan, "PermDNN: Efficient compressed DNN architecture with permuted diagonal matrices", Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), pp. 189-202, Oct. 2018. https://arxiv.org/abs/2004.10936
Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)
C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, “DeltaRNN: A power-efficient recurrent neural network accelerator,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018, pp. 21–30. PDF: https://dl.acm.org/doi/pdf/10.1145/3174243.3174261
Gale, T., Elsen, E., and Hooker, S., The state of sparsity in deep neural networks, arXiv preprint arXiv:1902.09574, 2019, https://arxiv.org/abs/1902.09574
Kwon, W., Kim, S., Mahoney, M. W., Hassoun, J., Keutzer, K., and Gholami, A., 2022, A fast post-training pruning framework for transformers, arXiv preprint arXiv:2204.09656, https://arxiv.org/abs/2204.09656
Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré, Feb 2023, Hyena Hierarchy: Towards Larger Convolutional Language Models, https://arxiv.org/abs/2302.10866 -
Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen, 11 Jun 2024 (v2), Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters, https://arxiv.org/abs/2406.05955
Splash: Sparse Flash Attention, 2024, https://github.com/google/jax/blob/main/jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_kernel.py
1 Jun 2023, Faster Causal Attention Over Large Sequences Through Sparse Flash Attention, Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret, https://arxiv.org/abs/2306.01160
Mingxuan He, Mithuna Thottethodi, T.N. Vijaykumar, 6 Apr 2024, Efficient Sparse Processing-in-Memory Architecture (ESPIM) for Machine Learning Inference, https://arxiv.org/abs/2404.04708
Mirko Farina, Usman Ahmad, Ahmad Taha, Hussein Younes, Yusuf Mesbah, Xiao Yu, Witold Pedrycz, 2024, Sparsity in transformers: A systematic literature review, Neurocomputing, Volume 582, 14 May 2024, 127468, https://www.sciencedirect.com/science/article/abs/pii/S092523122400239X (General survey of sparsity methods, and techniques that create sparsity.)
Reece Shuttleworth, CHARACTERIZING SPARSITY IN TRANSFORMERS https://reeceshuttle.me/assets/9.58-Final-Project-Report.pdf Code: https://github.com/reeceshuttle/958
Jianlei Yang, Jiacheng Liao, Fanding Lei, Meichen Liu, Junyi Chen, Lingkun Long, Han Wan, Bei Yu, Weisheng Zhao, Nov 2023, TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices, https://arxiv.org/abs/2311.01759
Jun Liu; Guohao Dai; Hao Xia; Lidong Guo; Xiangsheng Shi; Jiaming Xu; Nov 2023, TSTC: Two-Level Sparsity Tensor Core Enabling both Algorithm Flexibility and Hardware Efficiency, 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/abstract/document/10323775 (Managing sparse tensors efficiently by using two-level data structures that allows granular control of sparsity.)
Eunji Kwon; Jongho Yoon; Seokhyeong Kang, Dec 2023, Mobile Transformer Accelerator Exploiting Various Line Sparsity and Tile-Based Dynamic Quantization, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10375766
Luca Dordoni, Dec 2023, Sparsification of deep neural network via ternary quantization, Masters Thesis, POLITECNICO DI TORINO, Italy, https://webthesis.biblio.polito.it/29424/1/tesi.pdf
Trevor Gale, Matei Zaharia, Cliff Young, Erich Elsen, Aug 2020, Sparse GPU Kernels for Deep Learning, https://arxiv.org/abs/2006.10901
Ziheng Wang, Aug 2020, SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference, https://arxiv.org/abs/2008.11849
Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, Shiwei Liu, Oct 2023, Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity, https://arxiv.org/abs/2310.05175
Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
Anonymous, SPARSITY IN LARGE LANGUAGE MODELS RS BACK, RELU STRIKES BACK: EXPLOITING ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS, https://openreview.net/pdf?id=osoWxY8q2E
Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Lukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, and Jonni Kanerva. Sparse is enough in scaling transformers. In Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=-b5OSCydOMe. https://arxiv.org/abs/2111.12763
M Piórczyński, F Szatkowski, K Bałazy, B Wójcik, 2023, Exploiting Transformer Activation Sparsity with Dynamic Inference https://arxiv.org/pdf/2310.04361.pdf
KAA Fuad, L Chen, 2023, A Survey on Sparsity Exploration in Transformer-Based Accelerators https://www.mdpi.com/2079-9292/12/10/2299
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
Zehao Huang. 2018. Data-Driven Sparse Structure Selection for Deep Neural Networks. Papers with Code. https://paperswithcode.com/paper/data-driven-sparse-structure-selection-for (2021).
M. A. Nasution, D. Chahyati and M. I. Fanany, 2017, "Faster R-CNN with structured sparsity learning and Ristretto for mobile environment", Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
25 May 2024, Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection, Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen, https://arxiv.org/abs/2405.16178
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin, 2019, Adaptive attention span in transformers. CoRR, abs/1905.07799, 2019, http://arxiv.org/abs/1905.07799.
Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019, Augmenting self-attention with persistent memory. CoRR, abs/1907.01470, 2019. http://arxiv.org/abs/1907.01470
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. https://openai.com/blog/sparse-transformers, 2019, https://arxiv.org/abs/1904.10509
Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. Large memory layers with product keys. CoRR, abs/1907.05242, 2019. http://arxiv.org/abs/1907.05242
Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari, 17 Jun 2024, Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference, https://arxiv.org/abs/2406.11674
Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus, 29 Apr 2022 (v2), ST-MoE: Designing Stable and Transferable Sparse Expert Models, https://arxiv.org/abs/2202.08906
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna, 19 Jun 2024, SDQ: Sparse Decomposed Quantization for LLM Inference, https://arxiv.org/abs/2406.13868 (Combining sparsity and quantization.)
Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang, 16 Jul 2024, Scaling Diffusion Transformers to 16 Billion Parameters, https://arxiv.org/abs/2407.11633 Project: https://github.com/feizc/DiT-MoE
Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, 4 Jan 2024 (v2), LLM in a flash: Efficient Large Language Model Inference with Limited Memory, https://arxiv.org/abs/2312.11514 (Storing model parameters in flash memory on phones.)
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Peter Belcak, Roger Wattenhofer, Aug 2024, UltraSparseBERT: 99% Conditionally Sparse Language Modelling, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 104–108, August 11-16, 2024, https://aclanthology.org/2024.acl-short.10.pdf
Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, https://arxiv.org/abs/2312.00678
Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, 22 Aug 2024, A Tighter Complexity Analysis of SparseGPT, https://arxiv.org/abs/2408.12151
Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang, 21 Feb 2024, Do Efficient Transformers Really Save Computation? https://arxiv.org/abs/2402.13934
James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun, 26 Aug 2024, Training-Free Activation Sparsity in Large Language Models, https://arxiv.org/abs/2408.14690
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
Amir Basic, 2024, Sparsification with Variational Dropout, Master’s thesis, Data Science, Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Oslo, Norway, https://www.duo.uio.no/bitstream/handle/10852/112199/1/Amir_Basic_Masteroppgave.pdf
Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
My Social, May 17, 2024, Sparse Llama: Revolutionizing LLMs with 70% Sparsity, https://medium.com/aimonks/sparse-llama-revolutionizing-llms-with-70-sparsity-e6e9664f38e1
Cerebras, May 15, 2024, Introducing Sparse Llama: 70% Smaller, 3x Faster, Full Accuracy, https://cerebras.ai/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy
Neural Magic, 2024, Sparse Foundational Llama 2 Models, https://docs.neuralmagic.com/llms/models/sparse-foundational-llama-2/
Jaxpruner: A Concise Library for Sparsity Research, Joo Hyung Lee, Wonpyo Park, Nicole Elyse Mitchell, Jonathan Pilault, Johan Samir Obando Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Woohyun Han, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart J.C. Bik, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci, Conference on Parsimony and Learning, PMLR 234:515-528, 2024. https://proceedings.mlr.press/v234/lee24a.html https://proceedings.mlr.press/v234/lee24a/lee24a.pdf https://openreview.net/forum?id=H2rCZCfXkS https://openreview.net/pdf?id=H2rCZCfXkS
Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh, 31 May 2024, Effective Interplay between Sparsity and Quantization: From Theory to Practice, https://arxiv.org/abs/2405.20935
Krisna Pinasthika, Blessius Sheldo Putra Laksono, Riyandi Banovbi Putera Irsal, Syifa Hukma Shabiyya, Novanto Yudistira, 11 Sep 2023, SparseSwin: Swin Transformer with Sparse Transformer Block, https://arxiv.org/abs/2309.05224 https://www.sciencedirect.com/science/article/abs/pii/S0925231224002042
Zhang, H., Ma, W., Yuan, W. et al. Mixed-precision block incomplete sparse approximate preconditioner on Tensor core. CCF Trans. HPC 6, 54–67 (2024). https://doi.org/10.1007/s42514-023-00165-9 https://link.springer.com/article/10.1007/s42514-023-00165-9
Bobby Yan, Alexander J. Root, Trevor Gale, David Broman, Fredrik Kjolstad, 20 Jun 2024 (v2), Scorch: A Library for Sparse Deep Learning, https://arxiv.org/abs/2405.16883
Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
Yuzong Chen, Jian Meng, Jae-sun Seo, Mohamed S. Abdelfattah, 8 Sep 2024, BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration, https://arxiv.org/abs/2409.05227
Jordan Dotzel, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, Sep 2024, Opportunities for Post-Training Dynamic Layer Sparsity in Large Vision and Language Models, https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Dotzel_Opportunities_for_Post-Training_Dynamic_Layer_Sparsity_in_Large_Vision_and_CVPRW_2024_paper.pdf (Layerwise dynamic sparsity for vision models.)
Y. Jin, R. Zhong, S. Long and J. Zhai, "Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment," in IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2024.3462092. https://ieeexplore.ieee.org/document/10682058 https://www.computer.org/csdl/journal/td/5555/01/10682058/20jHtbSkOJO https://doi.ieeecomputersociety.org/10.1109/TPDS.2024.3462092
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
Elias Frantar, September, 2024, Compressing Large Neural Networks Algorithms, Systems and Scaling Laws, Ph.D. Thesis, Graduate School, Institute of Science and Technology, Austria, https://research-explorer.ista.ac.at/download/17485/17880/frantar_thesis_final.pdf
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Juan Pablo Muñoz, Jinjie Yuan, Nilesh Jain, 1 Oct 2024, SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models, https://arxiv.org/abs/2410.03750 https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang, 9 Oct 2024 (v2), SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference, https://arxiv.org/abs/2410.04417 https://github.com/Gumpest/SparseVLMs
C. Zhang et al., "DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures," in IEEE Transactions on Computers, doi: 10.1109/TC.2024.3475814. https://ieeexplore.ieee.org/abstract/document/10709841 (Sparse kernels in hardware.)
Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin, 21 Oct 2024, Pruning Foundation Models for High Accuracy without Retraining, https://arxiv.org/abs/2410.15567 https://github.com/piuzha/APT
Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen, 23 Oct 2024, CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation, https://arxiv.org/abs/2410.18311 https://wangqinsi1.github.io/coreinfer_page/
Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin, 3 Dec 2024 (v2), Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification, https://arxiv.org/abs/2412.00876 https://github.com/Osilly/dynamic_llava (Sparsification of the context in vision model.)
Hongxuan Zhang, Yao Zhao, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen, 16 Dec 2024, CSR:Achieving 1 Bit Key-Value Cache via Sparse Representation, https://arxiv.org/abs/2412.11741
Wenchao Xu, Jinyu Chen, Peirong Zheng, Xiaoquan Yi, Tianyi Tian, Wenhui Zhu, Quan Wan, Haozhao Wang, Yunfeng Fan, Qinliang Su, Xuemin Shen, https://arxiv.org/abs/2412.13437 18 Dec 2024, Deploying Foundation Model Powered Agent Services: A Survey, (A survey of not just deployment, but many inference optimization techniques.)
Chao Zeng, Songwei Liu, Shu Yang, Fangmin Chen, Xing Mei, Lean Fu, 23 Dec 2024, GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference, https://arxiv.org/abs/2412.17560
Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, Vimal Thilak, 25 Jan 2025 (v2), Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models, https://arxiv.org/abs/2501.12370
Tiernan Ray, Feb. 19, 2025, What is sparsity? DeepSeek AI's secret, revealed by Apple researchers: The AI model that shook the world is part of a broad trend to squeeze more out of chips. Here's how it works. https://www.zdnet.com/article/what-is-sparsity-deepseek-ais-secret-revealed-by-apple-researchers/
Ahmed F. AbouElhamayed, Jordan Dotzel, Yash Akhauri, Chi-Chih Chang, Sameh Gobriel, J. Pablo Muñoz, Vui Seng Chua, Nilesh Jain, Mohamed S. Abdelfattah, 18 Feb 2025, SparAMX: Accelerating Compressed LLMs Token Generation on AMX-powered CPUs, https://arxiv.org/abs/2502.12444
Ruibo Fan, Xiangrui Yu, Peijie Dong, Zeyu Li, Gu Gong, Qiang Wang, Wei Wang, and Xiaowen Chu. 2025. SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs. In Proceedings of the Twentieth European Conference on Computer Systems (EuroSys '25). Association for Computing Machinery, New York, NY, USA, 243–260. https://doi.org/10.1145/3689031.3717481 https://dl.acm.org/doi/abs/10.1145/3689031.3717481
Petr Kasalický, Martin Spišák, Vojtěch Vančura, Daniel Bohuněk, Rodrigo Alves, Pavel Kordík, 16 May 2025, The Future is Sparse: Embedding Compression for Scalable Retrieval in Recommender Systems, https://arxiv.org/abs/2505.11388