Aussie AI
Sparsity Optimizations in LLMs
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
What is Sparsity?
Sparse matrices are model weight matrices that have a lot of zeros in them. Various techniques are used to avoid performing multiplication on the zero values, and thereby a more efficient model is created.
Sparsity of the weight matrices is called "static sparsity" because the weights that are zeroed do not change at runtime. However, sparsity of the activations is called "dynamic sparsity" because it changes dependent on the input.
Static Sparsification
Various techniques can be used to "sparsify" a matrix, so that the model is more sparse. This is a form of model compression if done after training, but adding sparsity can also occur during training.
The simplest sparsification technique is "magnitude pruning" whereby small near-zero values are converted to zero. The result is a model with more efficient inference, at the cost of some accuracy. Another common technique is top-K pruning, and there are many other sparsification techniques.
If a matrix has enough zeros in it, the odds are high that some rows and/or columns are all zeros (or near-zeros). In such cases, a smaller-dimension matrix can replace the full matrix without much loss of accuracy. This related optimization technique is called "low-rank matrices".
Dynamic Sparsification
Dynamic sparsification is creating zeros on the fly, rather than removing them from the model weights. There are several ways to do this dynamically:
- Activation sparsity
- Dynamic structural pruning (i.e., when pruning is used in adaptive inference techniques)
Pruning and Sparsity
There is a very close associations and much overlap between sparsification and pruning. After all, the effect of pruning is to create zeros for weights, and if you do enough of this, sparsity results. Hence, pruning lots of weights is what sparsification is about.
Static pruning is removal of weights from the model files. Magnitude pruning is unstructured pruning that remove weights in any structure. Static structured pruning involves removing whole structures, such as static layer pruning (removing entire layers of weights).
Dynamic pruning is an adaptive inference optimization at runtime. Dynamic unstructured pruning is not very useful, but there are many types of dynamic structured pruning. In fact, there are 4 dimensions of pruning:
- Depthwise: early exiting, layer skipping, depth pruning, etc.
- Widthwise: attention head pruning, FFN pruning, width pruning.
- Lengthwise: input token pruning, prompt compression, length pruning, etc.
- Model dimension: embedding-dimension pruning.
KV Caching and Sparsity
There are analogous sparsification optimizations for KV cache data. Research has shown that K and V vectors are often sparse, because the issue of attention computations is usually sparse. Hence, KV sparsification can be a good way to reduce the in-memory size of the KV cache and thereby reduce its computation cost for faster inference. Read more about these KV cache research areas:
- KV cache sparsity
- KV cache compression
- KV caching (overall)
Research on KV sparsity:
- Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
- Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time, 2023. http://arxiv.org/abs/2305.17118 (Reduces the size of the KV cache by limiting storage to only pivotal tokens.)
- H Jiang, Q Wu, CY Lin, Y Yang, L Qiu, Oct 2023, LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, arXiv preprint arXiv:2310.05736, https://arxiv.org/pdf/2310.05736.pdf, Code: https://aka.ms/LLMLingua (Discusses token pruning reducing size of KV cache.)
- S Ren, Q Jia, KQ Zhu, arXiv preprint arXiv:2310.08152, Context Compression for Auto-regressive Transformers with Sentinel Tokens, Oct 2023, https://arxiv.org/pdf/2310.08152.pdf, Code: https://github.com/DRSY/KV_Compression
- G Xiao, Y Tian, B Chen, S Han, M Lewis, Sep 2023, Efficient Streaming Language Models with Attention Sinks, arXiv preprint arXiv:2309.17453, https://arxiv.org/abs/2309.17453 (Sliding window KV caching.)
- Qiaozhi He, Zhihua Wu, 28 Apr 2024, Efficient LLM Inference with Kcache, https://arxiv.org/abs/2404.18057 (Splits the KV cache into a KCache stored in HBM and a Vcache stored in CPU memory. The requests for the V cache are limited by filtering after attention based on the Softmax scaled top-N results of the QK matrix multiplication, so thereby pruning a lot of the V cache memory loads and corresponding calculations.)
- Zihao Wang, Shaoduo Gan, 7 Apr 2024] SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget, https://arxiv.org/abs/2404.04793 Code: https://github.com/hetailang/squeezeattention (Optimization of the KV cache along the two dimensions of layers and input sequence.)
- Youpeng Zhao, Di Wu, Jun Wang, 26 Mar 2024, ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching, https://arxiv.org/abs/2403.17312 (Improved memory management of the cache for KV caching during autoregressive inference with prioritization of tokens based on sparse window attention, and managing caching versus recomputation.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
- Zhenyu Zhang, Shiwei Liu, Runjin Chen, Bhavya Kailkhura, Beidi Chen, Atlas Wang, 2024, Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, https://proceedings.mlsys.org/paper_files/paper/2024/hash/bbb7506579431a85861a05fff048d3e1-Abstract-Conference.html https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf https://github.com/VITA-Group/Q-Hitter
- Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
- Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He, 30 Oct 2024, BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference, https://arxiv.org/abs/2410.23079 https://github.com/JunqiZhao888/buzz-llm
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
- Yanqi Zhang, Yuwei Hu, Runyuan Zhao, John C.S. Lui, Haibo Chen, 4 Dec 2024, Unifying KV Cache Compression for Large Language Models with LeanKV, https://arxiv.org/abs/2412.03131 (KV cache compression via mixed-precision KV quantization, token-specific KV pruning, and KV sparsity. Also uses a KV paging method similar to paged attention.)
Sparse Attention
The attention computations are core to Transformer inference, and research has shown that they are often sparse. Hence, there is much research on "sparse attention" optimizations.
Sparse attention is somewhat related to fully deactivating attention heads (or neurons) in attention head pruning and also other types of pruning on the width dimension; see width pruning.
Research papers on sparse attention:
- Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele, 4 Jun 2024, Loki: Low-Rank Keys for Efficient Sparse Attention, https://arxiv.org/abs/2406.02542 (Sparsification of the KV cache values with a focus on the key vectors.)
- Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
- Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, Beidi Chen, 18 Apr 2024, TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding, https://arxiv.org/abs/2404.11912 Code: https://github.com/Infini-AI-Lab/TriForce (Improves issues with long context in the use of speculative decoding, such as small model context sizes and KV cache memory usage bottlenecks for Llama2-7B-128K model on an A100 and RTX 4090 GPU.)
- Junbo Qiao, Wei Li, Haizhen Xie, Hanting Chen, Yunshuai Zhou, Zhijun Tu, Jie Hu, Shaohui Lin, 9 Apr 2024, LIPT: Latency-aware Image Processing Transformer, https://arxiv.org/abs/2404.06075
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- M Pagliardini, D Paliotta, M Jaggi, F Fleuret, 2023, Fast Attention Over Long Sequences With Dynamic Sparse Flash Attention, https://openreview.net/pdf?id=UINHuKeWUa
- Yunpeng Huang, Jingwei Xu, Zixu Jiang, Junyu Lai, Zenan Li, Yuan Yao, Taolue Chen, Lijuan Yang, Zhou Xin, Xiaoxing Ma, Nov 2023, Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey, https://arxiv.org/abs/2311.12351 Project: https://github.com/Strivin0311/long-llms-learning
- Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
- Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- Iz Beltagy, Matthew E. Peters, Arman Cohan, Dec 2020, Longformer: The Long-Document Transformer, arXiv preprint arXiv:2004.05150 (2020). https://arxiv.org/abs/2004.05150
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. 2019. Generating Long Sequences with Sparse Transformers. https://arxiv.org/abs/1904.10509
- 3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
- S Dai, H Genc, R Venkatesan, B Khailany, 2023 Efficient Transformer Inference with Statically Structured Sparse Attention, https://ieeexplore.ieee.org/abstract/document/10247993
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
- Vgel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
- Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang, 14 Jun 2024, HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning, https://arxiv.org/abs/2406.09827 (Sparse attention using the top-k features and a tree-based structure.)
- Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
- Wonbeom Lee, Jungi Lee, Junghwan Seo, Jaewoong Sim, 28 Jun 2024, InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management, https://arxiv.org/abs/2406.19707 (KV caching optimization using salient token pruning for the attention layer.)
- Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
- Bokyeong Yoon; Ah-Hyun Lee; Jinsung Kim; Gordon Euhyun Mo, 9 July 2024, Exploring Attention Sparsity to Accelerate Transformer Training on GPUs, IEEE Access ( Early Access ), DOI: 10.1109/ACCESS.2024.3425638, https://ieeexplore.ieee.org/document/10589623
- Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Minh Lenhat, Viet Anh Nguyen, Khoa Nguyen, Duong Duc Hieu, Dao Huu Hung, Truong Son Hy, 10 Aug 2024, SAMSA: Efficient Transformer for Many Data Modalities, https://arxiv.org/abs/2408.05391 https://github.com/HySonLab/SAMSA
- Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
- Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr, July 2024, SparQ Attention: Bandwidth-Efficient LLM Inference, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:42558-42583, 2024, https://proceedings.mlr.press/v235/ribar24a.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/ribar24a/ribar24a.pdf
- Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
- Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang, 21 Feb 2024, Do Efficient Transformers Really Save Computation? https://arxiv.org/abs/2402.13934
- Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. In Advances in Neural Information Processing Systems (NeurIPS), 2021. https://arxiv.org/abs/2110.15343 (Attention optimization using both sparse attention and low-rank matrix attention.)
- Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
- Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, 7 Oct 2024, Differential Transformer, https://arxiv.org/abs/2410.05258
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
Feed-Forward Network Sparsity
FFN sparsity is limiting sparsification to the FFN modules in model layers. There is a close relationship between FFN sparsity and FFN pruning optimizations.
Research on FFN sparsity:
- Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- Jie Tang; Shuai Wang; Song Chen; Yi Kang, May 2024, DP-FFN: Block-Based Dynamic Pooling for Accelerating Feed-Forward Layers in Transformers, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/abstract/document/10558119
- Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
- Zhiyang Chen; Yousong Zhu; Zhaowen Li; Fan Yang et al., The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 4130-4134, doi: 10.1109/ICASSP48485.2024.10447756, https://ieeexplore.ieee.org/abstract/document/10447756
Activation Sparsity
Activation sparsity refers to dynamic analysis of the "activations" during inference. It is a particular type of "dynamic sparsity" optimization (other types are optimizations that dynamically remove model data, such as dynamic structural pruning).
There is a close relationship between "activation sparsity" and pruning along the same dimension; see embedding-dimension pruning optimizations.
Research on activation sparsity:
- Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
- Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun, 27 Feb 2024 (v2), ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models, https://arxiv.org/abs/2402.13516 (Increases activation sparsity by using RELU and other techniques.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath, Raghu Meka, 26 Jun 2024, Learning Neural Networks with Sparse Activations, https://arxiv.org/abs/2406.17989
- James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun, 26 Aug 2024, Training-Free Activation Sparsity in Large Language Models, https://arxiv.org/abs/2408.14690
- Cody Wild, Jesper Anderson, 10 Jul 2024, Uncovering Layer-Dependent Activation Sparsity Patterns in ReLU Transformers, https://arxiv.org/abs/2407.07848
- Xiaolong Yu, Cong Tian, 30 May 2024, Dual sparse training framework: inducing activation map sparsity via Transformed ℓ1 regularization, https://arxiv.org/abs/2405.19652
- Rongyu Zhang, Aosong Cheng, Yulin Luo, Gaole Dai, Huanrui Yang, Jiaming Liu, Ran Xu, Li Du, Yuan Du, Yanbing Jiang, Shanghang Zhang, 26 May 2024, Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation, https://arxiv.org/abs/2405.16486 https://github.com/RoyZry98/MoASE-Pytorch
- Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, David Kappel, Anand Subramoney, 1 May 2024, Weight Sparsity Complements Activity Sparsity in Neuromorphic Language Models, https://arxiv.org/abs/2405.00433
- Andreas Müller, Erwin Quiring, 27 Mar 2024, The Impact of Uniform Inputs on Activation Sparsity and Energy-Latency Attacks in Computer Vision, https://arxiv.org/abs/2403.18587
- Ilan Price, Nicholas Daultry Ball, Samuel C.H. Lam, Adam C. Jones, Jared Tanner, 25 Feb 2024, Deep Neural Network Initialization with Sparsity Inducing Activations, https://arxiv.org/abs/2402.16184
- Rishav Mukherji, Mark Schöne, Khaleelulla Khan Nazeer, Christian Mayr, Anand Subramoney, 7 Dec 2023 (v2), Activity Sparsity Complements Weight Sparsity for Efficient RNN Inference, https://arxiv.org/abs/2311.07625
- Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar, 6 Oct 2023, ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, https://arxiv.org/abs/2310.04564
- Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
- Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
- Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
- Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
- Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
- Chaojun Xiao, Zhengyan Zhang, Chenyang Song, Dazhi Jiang, Feng Yao, Xu Han, Xiaozhi Wang, Shuo Wang, Yufei Huang, Guanyu Lin, Yingfa Chen, Weilin Zhao, Yuge Tu, Zexuan Zhong, Ao Zhang, Chenglei Si, Khai Hao Moo, Chenyang Zhao, Huimin Chen, Yankai Lin, Zhiyuan Liu, Jingbo Shang, Maosong Sun, Sep 2024, Configurable Foundation Models: Building LLMs from a Modular Perspective, https://arxiv.org/pdf/2409.02877
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen, 23 Oct 2024, CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation, https://arxiv.org/abs/2410.18311 https://wangqinsi1.github.io/coreinfer_page/
- Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun, 4 Nov 2024, Sparsing Law: Towards Large Language Models with Greater Activation Sparsity, https://arxiv.org/abs/2411.02335
- Jiho Shin, Hoeseok Yang, Youngmin Yi, 19 Nov 2024, SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference, https://arxiv.org/abs/2411.12692
- Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin, 3 Dec 2024 (v2), Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification, https://arxiv.org/abs/2412.00876 https://github.com/Osilly/dynamic_llava (Sparsification of the context in vision model.)
- Yilong Chen, Junyuan Shang, Zhengyu Zhang, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang, 10 Dec 2024 (v2), Mixture of Hidden-Dimensions Transformer, https://arxiv.org/abs/2412.05644
Sparse Matrix Multiplication
There are special optimizations made possible by sparsity in the MatMul/GEMM kernels. Research on sparse matrix computations:
- Y Yang, JS Emer, D Sanchez, 2024, Trapezoid: A Versatile Accelerator for Dense and Sparse Matrix Multiplications, MIT, https://yang-yifan.github.io/papers/isca24_trapezoid.pdf
- Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
- Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
- Hongyaoxing Gu, 11 Mar 2024, A method for accelerating low precision operations by sparse matrix multiplication, https://arxiv.org/abs/2403.06924v1
- Haque, S.A.; Choudhury, N.; Hossain, S. Matrix Multiplication with Diagonals: Structured Sparse Matrices and Beyond. In Proceedings of the 2023 7th International Conference on High Performance Compilation, Computing and Communications, Jinan, China, 17–19 June 2023; pp. 69–76. https://doi.org/10.1145/3606043.3606053
- Sardar Anisul Haque,Mohammad Tanvir Parvez, Shahadat Hossain, Jan 2024, GPU Algorithms for Structured Sparse Matrix Multiplication with Diagonal Storage Schemes, https://www.mdpi.com/1999-4893/17/1/31
- D. Mukunoki, M. Kawai and T. Imamura, 2023, Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor, 2023 IEEE 16th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Singapore, 2023, pp. 608-615, doi: 10.1109/MCSoC60832.2023.00094, https://ieeexplore.ieee.org/abstract/document/10387875
- Jianhua Gao, Weixing Ji, Fangli Chang, Shiyu Han, Bingxin Wei, Zeming Liu, Yizhuo Wang, 11 Jul 2023 (v3), A Systematic Survey of General Sparse Matrix-Matrix Multiplication, https://arxiv.org/abs/2002.11273 https://dl.acm.org/doi/abs/10.1145/3571157
- Helin Cheng, Wenxuan Li, Yuechen Lu, and Weifeng Liu. 2023. HASpGEMM: Heterogeneity-Aware Sparse General Matrix-Matrix Multiplication on Modern Asymmetric Multicore Processors. In Proceedings of the 52nd International Conference on Parallel Processing (ICPP '23). Association for Computing Machinery, New York, NY, USA, 807–817. https://doi.org/10.1145/3605573.3605611 https://dl.acm.org/doi/abs/10.1145/3605573.3605611
- Chunxu Lin, Wensheng Luo, Yixiang Fang, Chenhao Ma, Xilin Liu, and Yuchi Ma. 2024. On Efficient Large Sparse Matrix Chain Multiplication. Proc. ACM Manag. Data 2, 3, Article 156 (June 2024), 27 pages. https://doi.org/10.1145/3654959 https://dl.acm.org/doi/abs/10.1145/3654959
- Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594
- NVIDIA, 2024, cuSparse, https://docs.nvidia.com/cuda/cusparse/index.html
- Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
- Zhang, H., Ma, W., Yuan, W. et al. Mixed-precision block incomplete sparse approximate preconditioner on Tensor core. CCF Trans. HPC 6, 54–67 (2024). https://doi.org/10.1007/s42514-023-00165-9 https://link.springer.com/article/10.1007/s42514-023-00165-9
- Mohammad Mahdi Salehi Dezfuli, Kazem Cheshmi, 28 Jun 2024, Improving Locality in Sparse and Dense Matrix Multiplications, https://arxiv.org/abs/2407.00243
- A. Haan, D. T. Popovici, K. Sen, C. Iacu and A. Cheung, 2014, "To Tile or not to Tile, That is the Question," 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), San Francisco, CA, USA, 2024, pp. 449-458, doi: 10.1109/IPDPSW63119.2024.00096, https://ieeexplore.ieee.org/abstract/document/10596518
- Kaige Zhang, Xiaoyan Liu, Hailong Yang, Tianyu Feng, Xinyu Yang, Yi Liu, Zhongzhi Luan, and Depei Qian. 2024. Jigsaw: Accelerating SpMM with Vector Sparsity on Sparse Tensor Core. In Proceedings of the 53rd International Conference on Parallel Processing (ICPP '24). Association for Computing Machinery, New York, NY, USA, 1124–1134. https://doi.org/10.1145/3673038.3673108 https://dl.acm.org/doi/abs/10.1145/3673038.3673108
- Bobby Yan, Alexander J. Root, Trevor Gale, David Broman, Fredrik Kjolstad, 20 Jun 2024 (v2), Scorch: A Library for Sparse Deep Learning, https://arxiv.org/abs/2405.16883
- Isuru Ranawaka, Md Taufique Hussain, Charles Block, Gerasimos Gerogiannis, Josep Torrellas, Ariful Azad, 21 Aug 2024, Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication, https://arxiv.org/abs/2408.11988
- Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
- Noah Amsel, Tyler Chen, Feyza Duman Keles, Diana Halikias, Cameron Musco, Christopher Musco, 26 Mar 2024 (v3), Fixed-sparsity matrix approximation from matrix-vector products, https://arxiv.org/abs/2402.09379
- Peiming Liu, Alexander J Root, Anlunxu, Yinyig Li, Fredrik Kjolstad, Aart C. Bik, 2024, Compiler Support for Sparse Tensor Convolutions, https://rootjalex.github.io/publications/oopsla2024-spconv.pdf
- Pranav Dangi, Zhenyu Bai, Dhananjaya Wijerathne, Rohan Juneja, 2024, ZeD: A Generalized Accelerator for Variably Sparse Matrix Computations in ML, https://pranavdangi.github.io/papers/PACT24.pdf
- Valentin Isaac–Chassande, Adrian Evans, Yves Durand, and Frédéric Rousseau. 2024. Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey. ACM Trans. Archit. Code Optim. 21, 2, Article 27 (June 2024), 26 pages. https://doi.org/10.1145/3640542 https://dl.acm.org/doi/full/10.1145/3640542
- Anton Lokhmotov, 17 Nov 2015 (v2), GEMMbench: a framework for reproducible and collaborative benchmarking of matrix multiplication, https://arxiv.org/abs/1511.03742
- Xiaobo Lu, Jianbin Fang, Lin Peng, Chun Huang, Zidong Du, Yongwei Zhao, and Zheng Wang. 2024. Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product. ACM Trans. Archit. Code Optim. Just Accepted (August 2024). https://doi.org/10.1145/3688612 https://dl.acm.org/doi/abs/10.1145/3688612
- Patrik Okanovic, Grzegorz Kwasniewski, Paolo Sylos Labini, Maciej Besta, Flavio Vella, Torsten Hoefler, 21 Aug 2024, High Performance Unstructured SpMM Computation Using Tensor Cores, https://arxiv.org/abs/2408.11551
- Takuma Yamaguchi and Federico Busato, Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores, Mar 19, 2021, https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/
- OpenAI, December 6, 2017, Block-sparse GPU kernels, https://openai.com/index/block-sparse-gpu-kernels/ https://cdn.openai.com/blocksparse/blocksparsepaper.pdf https://github.com/openai/blocksparse
- Zijing Gu, 26 Jul 2020, Optimizing Block-Sparse Matrix Multiplications on CUDA with TVM, https://arxiv.org/abs/2007.13055
- R. L. Castro, D. Andrade and B. B. Fraguela, "STuning-DL: Model-Driven Autotuning of Sparse GPU Kernels for Deep Learning," in IEEE Access, vol. 12, pp. 70581-70599, 2024, doi: 10.1109/ACCESS.2024.3402326. https://ieeexplore.ieee.org/abstract/document/10534045 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10534045
- Agniv Sharma, Jonas Geiping, 24 Sep 2024 (v2), Efficiently Dispatching Flash Attention For Partially Filled Attention Masks, https://arxiv.org/abs/2409.15097 (Optimizing Flash attention for sparse attention data.)
- Jianhua Gao, Bingjie Liu, Weixing Ji, Hua Huang, 9 Apr 2024, A Systematic Literature Survey of Sparse Matrix-Vector Multiplication, https://arxiv.org/abs/2404.06047
Block Sparsity
Research on block-level sparsity:
- Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, Daniel Haziza, xFormers: A modular and hackable Transformer modelling library, 2022, Facebook Research, Code: https://github.com/facebookresearch/xformers
- Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
- Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
- Lee, E., Han, Y., Moon, G.E. (2024). Accelerated Block-Sparsity-Aware Matrix Reordering for Leveraging Tensor Cores in Sparse Matrix-Multivector Multiplication. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14803. Springer, Cham. https://doi.org/10.1007/978-3-031-69583-4_1 https://link.springer.com/chapter/10.1007/978-3-031-69583-4_1
- Cong Guo; Fengchen Xue; Jingwen Leng; Yuxian Qiu, May 2024, Accelerating Sparse DNNs Based on Tiled GEMM, IEEE Transactions on Computers, vol. 73, no. 5, pp. 1275-1289, May 2024, doi: 10.1109/TC.2024.3365942, https://ieeexplore.ieee.org/abstract/document/10436533
- Paolo D'Alberto, Taehee Jeong, Akshai Jain, Shreyas Manjunath, Mrinal Sarmah, Samuel Hsu, Yaswanth Raparti, Nitesh Pipralia, 12 Jul 2024, Weight Block Sparsity: Training, Compilation, and AI Engine Accelerators, https://arxiv.org/abs/2407.09453
- Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu, 20 Aug 2024, LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models, https://arxiv.org/abs/2408.10631 https://github.com/YupengSu/LLM-Barber
- Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
- Kuo-Wei Chang, Tian-Sheuan Chang, 2 May 2022, VSCNN: Convolution Neural Network Accelerator With Vector Sparsity, https://arxiv.org/abs/2205.02271
- Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie, 19 Sep 2024, CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs, https://arxiv.org/abs/2409.12490
- Yizhao Gao, Zhichen Zeng, Dayou Du, Shijie Cao, Hayden Kwok-Hay So, Ting Cao, Fan Yang, Mao Yang, 18 Oct 2024 (v2), SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs, https://arxiv.org/abs/2410.13276
Vector Sparsity
Vector sparsity is similar to block sparsity, but only along a single dimension. Research on vector-level sparsity:
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, 2016, Eie: Efficient inference engine on compressed deep neural network, in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA, IEEE Press, 2016, pp. 243–254, https://doi.org/10.1109/ISCA.2016.30 https://arxiv.org/abs/1602.01528
- Seungmin Yu, Xiaodie Yi, Hayun Lee, Dongkun Shin, 30 Jul 2024, Toward Efficient Permutation for Hierarchical N:M Sparsity on GPUs, https://arxiv.org/abs/2407.20496
- Kuo-Wei Chang, Tian-Sheuan Chang, 2 May 2022, VSCNN: Convolution Neural Network Accelerator With Vector Sparsity, https://arxiv.org/abs/2205.02271
- M. Zhu, T. Zhang, Z. Gu and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs", Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, pp. 359-371, Oct. 2019. https://dl.acm.org/doi/pdf/10.1145/3352460.3358269 (Vector-wise sparsity.)
- Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
- Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka, 29 Aug 2024, PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation, https://arxiv.org/abs/2408.16246
Tensor Sparsity
Research on sparse tensors:
- David Spuler, March 2024, Chapter 23. Tensors, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Bobby Yan, Alexander J. Root, Trevor Gale, David Broman, Fredrik Kjolstad, 20 Jun 2024 (v2), Scorch: A Library for Sparse Deep Learning, https://arxiv.org/abs/2405.16883
- Philipp Trunschke, Anthony Nouy, Martin Eigel, 13 Oct 2023, Weighted sparsity and sparse tensor networks for least squares approximation, https://arxiv.org/abs/2310.08942
- Kyle Deeds, Willow Ahrens, Magda Balazinska, Dan Suciu, 29 Aug 2024 (v2), Galley: Modern Query Optimization for Sparse Tensor Programs, https://arxiv.org/abs/2408.14706
- Junjing Zheng, Xinyu Zhang, Weidong Jiang, 24 Jul 2024, Sparse Tensor PCA via Tensor Decomposition for Unsupervised Feature Selection, https://arxiv.org/abs/2407.16985
- Sasindu Wijeratne, Rajgopal Kannan, Viktor Prasanna, 14 May 2024, Sparse MTTKRP Acceleration for Tensor Decomposition on GPU, https://arxiv.org/abs/2405.08470
- Tugba Torun, Eren Yenigul, Ameer Taweel, Didem Unat, 8 May 2024, A Sparse Tensor Generator with Efficient Feature Extraction, https://arxiv.org/abs/2405.04944 https://github.com/sparcityeu/feaTen https://github.com/sparcityeu/genTen
- Geonhwa Jeong, Po-An Tsai, Abhimanyu R. Bambhaniya, Stephen W. Keckler, Tushar Krishna, 31 Mar 2024 (v2), Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition, https://arxiv.org/abs/2403.07953
- Jan Laukemann, Ahmed E. Helal, S. Isaac Geronimo Anderson, Fabio Checconi, Yongseok Soh, Jesmin Jahan Tithi, Teresa Ranadive, Brian J Gravelle, Fabrizio Petrini, Jee Choi, 11 Mar 2024, Accelerating Sparse Tensor Decomposition Using Adaptive Linearized Representation, https://arxiv.org/abs/2403.06348
- Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
- David Spuler, March 2024, Sparse Tensors, in Generative AI in C++, https://www.aussieai.com/book/ch23-sparse-tensors
- Peiming Liu, Alexander J Root, Anlunxu, Yinyig Li, Fredrik Kjolstad, Aart C. Bik, 2024, Compiler Support for Sparse Tensor Convolutions, https://rootjalex.github.io/publications/oopsla2024-spconv.pdf
SLIDE (Sparse Hashing for Back-Propagation in Training)
Research papers on SLIDE:
- Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava, 1 Mar 2020 (v2), SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems, https://arxiv.org/abs/1903.03129
- Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Anshumali Shrivastava, 2021, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, Part of Proceedings of Machine Learning and Systems 3 (MLSys 2021), https://proceedings.mlsys.org/paper_files/paper/2021/hash/de4086ad4276d895be8ef25ec03c964b-Abstract.html https://proceedings.mlsys.org/paper_files/paper/2021/file/de4086ad4276d895be8ef25ec03c964b-Paper.pdf
- Minghao Yan, Nicholas Meisburger, Tharun Medini, Anshumali Shrivastava, 29 Jan 2022, Distributed SLIDE: Enabling Training Large Neural Networks on Low Bandwidth and Simple CPU-Clusters via Model Parallelism and Sparsity, https://arxiv.org/abs/2201.12667
- Z. Pan, F. Zhang, H. Li, C. Zhang, X. Du and D. Deng, "G-SLIDE: A GPU-Based Sub-Linear Deep Learning Engine via LSH Sparsification," in IEEE Transactions on Parallel and Distributed Systems, vol. 33, no. 11, pp. 3015-3027, 1 Nov. 2022, doi: 10.1109/TPDS.2021.3132493. https://ieeexplore.ieee.org/abstract/document/9635657
- S. Ko, A. Rucker, Y. Zhang, P. Mure and K. Olukotun, "Accelerating SLIDE: Exploiting Sparsity on Accelerator Architectures," 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Lyon, France, 2022, pp. 663-670, doi: 10.1109/IPDPSW55747.2022.00116. https://ieeexplore.ieee.org/abstract/document/9835529
- Eslam Al-Sobh, Prof. Mahmoud Alshbool, Dr. Yaser Jararweh, Prof. Moath Jarrah, July 2024, Empirical Study To Compare The Performance Of Novel CPU Implementation Of Deep Learning Algorithms With GPU-BASED Implementatioms, https://doi.org/10.21203/rs.3.rs-4625052/v1 https://www.researchsquare.com/article/rs-4625052/v1 https://assets-eu.researchsquare.com/files/rs-4625052/v1_covered_d2a74f37-a02e-45ed-98ab-b104064ee4ed.pdf?c=1721744350
Dynamic Sparsity Research
Papers on dynamic sparsity include:
- D. Kim, J. Ahn and S. Yoo, "ZeNA: Zero-aware neural network accelerator", IEEE Des. Test, vol. 35, no. 1, pp. 39-46, Feb. 2018. https://ieeexplore.ieee.org/document/8013151 (Dynamic sparsity.)
- H. T. Kung, B. McDanel and S. Q. Zhang, "Adaptive tiling: Applying fixed-size systolic arrays to sparse convolutional neural networks", Proc. 24th Int. Conf. Pattern Recognit. (ICPR), pp. 1006-1011, Aug. 2018. https://ieeexplore.ieee.org/document/8545462 (Dynamic sparsity.)
- Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, "AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Uses a "greedy interleaving" algorithm for processing sparse matrices to avoid zero multiplications.)
- Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji, Oct 2023, Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs, https://arxiv.org/abs/2310.08915
- H Fan, SI Venieris, A Kouris, ND Lane, Oct 2023 Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads, arXiv preprint arXiv:2310.11096, https://arxiv.org/pdf/2310.11096.pdf
- S Tan, Y Shen, Z Chen, A Courville, C Gan, Oct 2023, Sparse Universal Transformer, arXiv preprint arXiv:2310.07096, https://arxiv.org/pdf/2310.07096.pdf
- Maurizio Capra, Beatrice Bussolino, Alberto Marchisio, Guido Masera, Maurizio Martina, Muhammad Shafique, 2020, Hardware and software optimizations for accelerating deep neural networks: Survey of current trends, challenges, and the road ahead, https://ieeexplore.ieee.org/iel7/6287639/6514899/09269334.pdf, https://arxiv.org/abs/2012.11233
- Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, Henri E. Bal, 2023, Optimization techniques for GPU programming, ACM Computing Surveys, Volume 55, Issue 11, Article No. 239, pp 1–81, https://dl.acm.org/doi/abs/10.1145/3570638, PDF: https://dl.acm.org/doi/pdf/10.1145/3570638 (Extensive survey of software optimizations to improve GPU latency and throughput.)
- A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,” arXiv preprint arXiv:2104.08378, 2021. https://arxiv.org/abs/2104.08378
- Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
- Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu, 16 Apr 2024, SparseDM: Toward Sparse Efficient Diffusion Models, https://arxiv.org/abs/2404.10445
- M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Y.-H. Chen, T.-J. Yang, J. Emer and V. Sze, "Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices", IEEE J. Emerging Sel. Topics Circuits Syst., vol. 9, no. 2, pp. 292-308, Jun. 2019. https://arxiv.org/abs/1807.07928
- Mirko Farina, Usman Ahmad, Ahmad Taha, Hussein Younes, Yusuf Mesbah, Xiao Yu, Witold Pedrycz, 2024, Sparsity in transformers: A systematic literature review, Neurocomputing, Volume 582, 14 May 2024, 127468, https://www.sciencedirect.com/science/article/abs/pii/S092523122400239X (General survey of sparsity methods, and techniques that create sparsity.)
- Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun, 27 Feb 2024 (v2), ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models, https://arxiv.org/abs/2402.13516 (Increases activation sparsity by using RELU and other techniques.)
- Ningxin Zheng, Huiqiang Jiang, Quanlu Zhang, Zhenhua Han, Lingxiao Ma, Yuqing Yang, Fan Yang, Chengruidong Zhang, Lili Qiu, Mao Yang, Lidong Zhou, October 2023, PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation, SOSP '23: Proceedings of the 29th Symposium on Operating Systems Principles, Pages 331–347, https://doi.org/10.1145/3600006.3613139 https://dl.acm.org/doi/abs/10.1145/3600006.3613139 (Deep learning compiler for dynamic sparsity.)
- Lu Yin, Gen Li, Meng Fang, Li Shen, Tianjin Huang, Zhangyang Wang, Vlado Menkovski, Xiaolong Ma, Mykola Pechenizkiy, Shiwei Liu, 10 Nov 2023 (v2), Dynamic Sparsity Is Channel-Level Sparsity Learner, 37th Conference on Neural Information Processing Systems (NeurIPS 2023), https://arxiv.org/abs/2305.19454 Code: https://github.com/luuyin/chase
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu, 2 Jul 2024, MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention, https://arxiv.org/abs/2407.02490 Code: https://aka.ms/MInference
- Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
- Liu, Zichang. April 2024, Dynamic Sparsity for Efficient Machine Learning, Ph.D. Thesis, Rice University, Houston, Texas, 31532859, https://www.proquest.com/openview/e8536826101c752bd8618ce95292a711/1
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
- Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
- Jordan Dotzel, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, Sep 2024, Opportunities for Post-Training Dynamic Layer Sparsity in Large Vision and Language Models, https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Dotzel_Opportunities_for_Post-Training_Dynamic_Layer_Sparsity_in_Large_Vision_and_CVPRW_2024_paper.pdf (Layerwise dynamic sparsity for vision models.)
- Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen, 28 Oct 2024, ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference, https://arxiv.org/abs/2410.21465 https://github.com/bytedance/ShadowKV
- Nasib Ullah, Erik Schultheis, Mike Lasby, Yani Ioannou, Rohit Babbar, 6 Nov 2024 (v2), Navigating Extremes: Dynamic Sparsity in Large Output Space, https://arxiv.org/abs/2411.03171
General Research on Sparsity Techniques
- Li, Y.; Yu, Y.; Zhang, Q.; Liang, C.; He, P.; Chen, W.; and Zhao, T. 2023. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. In Krause, A.; Brunskill, E.; Cho, K.; Engelhardt, B.; Sabato, S.; and Scarlett, J., eds., Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, 20336–20350. PMLR. https://arxiv.org/abs/2306.11222
- Frantar, E.; and Alistarh, D. 2023. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774. https://arxiv.org/abs/2301.00774
- X. Dai, H. Yin and N. K. Jha, "NeST: A neural network synthesis tool based on a grow-and-prune paradigm", IEEE Trans. Comput., vol. 68, no. 10, pp. 1487-1497, Oct. 2019. https://arxiv.org/abs/1711.02017
- S. Cao et al., "Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity", Proc. Int. Symp. Field-Prog. Gate Arrays, pp. 63-72, 2019. PDF: https://wencongxiao.github.io/res/fpga19/FPGA19.pdf
- W. Wen, C. Wu, Y. Wang, Y. Chen and H. Li, "Learning structured sparsity in deep neural networks", Proc. Adv. Neural Inf. Process. Syst., vol. 29, pp. 2074-2082, 2016. https://arxiv.org/abs/1608.03665
- M. Zhu, T. Zhang, Z. Gu and Y. Xie, "Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs", Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchitecture, pp. 359-371, Oct. 2019. https://dl.acm.org/doi/pdf/10.1145/3352460.3358269 (Vector-wise sparsity.)
- Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally, "Exploring the regularity of sparse structure in convolutional neural networks", arXiv:1705.08922, 2017. https://arxiv.org/abs/1705.08922
- H. Wang, Q. Zhang, Y. Wang, L. Yu and H. Hu, "Structured pruning for efficient ConvNets via incremental regularization", Proc. Int. Joint Conf. Neural Netw. (IJCNN), pp. 1-8, Jul. 2019. https://openreview.net/pdf?id=S1e_xM7_iQ
- S. Narang, E. Elsen, G. Diamos and S. Sengupta, "Exploring sparsity in recurrent neural networks", arXiv:1704.05119, 2017. https://arxiv.org/abs/1704.05119
- Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li, Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, William J. Dally, "ESE: Efficient speech recognition engine with sparse LSTM on FPGA", Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA), pp. 75-84, 2017. https://arxiv.org/abs/1612.00694
- M. Zhu and S. Gupta, "To prune or not to prune: Exploring the efficacy of pruning for model compression", arXiv:1710.01878, 2017. https://arxiv.org/abs/1710.01878
- S. Zhang et al., "Cambricon-X: An accelerator for sparse neural networks", Proc. Int. Symp. Microarchitecture, pp. 1-12, 2016. https://ieeexplore.ieee.org/document/7783723
- Z.-G. Liu, P. N. Whatmough and M. Mattina, "Systolic tensor array: An efficient structured-sparse GEMM accelerator for mobile CNN inference", IEEE Comput. Archit. Lett., vol. 19, no. 1, pp. 34-37, Jan. 2020. https://arxiv.org/abs/2005.08098
- Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, Tobi Delbruck, "NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps", IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644-656, Mar. 2019. https://arxiv.org/abs/1706.01406
- Y. Lu, C. Wang, L. Gong, X. Zhou, SparseNN: a performance-efficient accelerator for large-scale sparse neural networks, Int. J. Parallel Program. 46 (4) (2018) 648–659. https://arxiv.org/abs/1711.01263
- J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N.E. Jerger, A. Moshovos, Cnvlutin: ineffectual-neuron-free deep neural network computing, in: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 1–13. https://ieeexplore.ieee.org/document/7551378
- S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, Y. Chen, Cambricon-x: an accelerator for sparse neural networks, in: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, 2016, p. 20. https://ieeexplore.ieee.org/document/7783723
- A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, W.J. Dally, SCNN: an accelerator for compressed-sparse convolutional neural networks, in: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, 2017, pp. 27–40. https://arxiv.org/abs/1708.04485
- Xu Sun, Xuancheng Ren, Shuming Ma, and Houfeng Wang. 2017. meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. In International Conference on Machine Learning. 3299–3308. https://arxiv.org/abs/1706.06197 Code: https://github.com/lancopku/meProp (Structural sparsification.)
- Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS). 2074–2082. https://arxiv.org/abs/1608.03665 (Structural sparsification.)
- P. Grigoras, P. Burovskiy, E. Hung, and W. Luk. Accelerating SpMV on FPGAs by compressing nonzero values. In International Symposium on Field Programmable Gate Arrays, pages 64–67, 2015. https://ieeexplore.ieee.org/document/7160041 (Sparse multiplication of non-zero values.)
- S Liu, Z Wang, 2023, Ten lessons we have learned in the new" sparseland": A short handbook for sparse neural network researchers arXiv preprint arXiv:2302.02596, https://arxiv.org/abs/2302.02596
- Enmao Diao, Ganghua Wang, Jiawei Zhan, Yuhong Yang, Jie Ding, Vahid Tarokh, Aug 2023, Pruning Deep Neural Networks from a Sparsity Perspective, https://arxiv.org/abs/2302.05601
- B Yoon, Y Han, GE Moon, 2023, SPION: Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling arXiv preprint arXiv:2309.12578, https://arxiv.org/pdf/2309.12578.pdf
- Roberto L. Castro, Andrei Ivanov, Diego Andrade, Tal Ben-Nun, Basilio B. Fraguela, Torsten Hoefler, Oct 2023, VENOM: A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores, https://browse.arxiv.org/abs/2310.02065
- A Jaiswal, Z Gan, X Du, B Zhang, Z Wang, Y Yang, Oct 2023, Compressing LLMs: The Truth is Rarely Pure and Never Simple, arXiv preprint arXiv:2310.01382, https://browse.arxiv.org/pdf/2310.01382.pdf
- Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar, Oct 2023 ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models, https://arxiv.org/abs/2310.04564 (Recommends reinstating the simpler RELU rather than GELU or SiLU, with a focus on inference efficiency.)
- Zichang Liu, April 2024, Ph.D. Thesis, Rice University, Houston, Texas, https://repository.rice.edu/server/api/core/bitstreams/a089344e-6f6b-44d2-a1c3-6cef2c303e86/content (Using sparsity to compress the KV cache for long context windows.)
- Yubin Qin; Yang Wang; Dazheng Deng; Xiaolong Yang, Ayaka: A Versatile Transformer Accelerator With Low-Rank Estimation and Heterogeneous Dataflow, https://ieeexplore.ieee.org/abstract/document/10530252 (Cross layer random prediction to allow sparsification of attention and linear layers.)
- Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz, 6 May 2024, Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment, https://arxiv.org/abs/2405.03594 (High sparsity on Llama2 models.)
- Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren, 7 Jun 2024, MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter, https://arxiv.org/abs/2406.04984 Code: https://github.com/CURRENTF/MEFT
- Ganesh Jawahar, April 2024, Methods for design of efficient on-device natural language processing architectures, Ph.D. thesis, Computer Science, The University of British Columbia (Vancouver) https://open.library.ubc.ca/media/download/pdf/24/1.0441384/4
- Kafeng Wang, Jianfei Chen, He Li, Zhenpeng Mi, Jun Zhu, 16 Apr 2024, SparseDM: Toward Sparse Efficient Diffusion Models, https://arxiv.org/abs/2404.10445
- Je-Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini, 12 Apr 2024, CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models, https://arxiv.org/abs/2404.08763 (Sparsity with dynamic control over the thresholds with an effect that is similar to intra-model MoE.)
- Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
- Panjie Qi; Edwin Hsing-Mean Sha; Qingfeng Zhuge; Hongwu Peng; Shaoyi Hua, 2021, Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization, 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/document/9643586
- Cong Wei, Brendan Duke, Ruowei Jiang, Parham Aarabi, Graham W. Taylor, Florian Shkurti, Mar 2023, Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers, CVPR 2023, https://arxiv.org/abs/2303.13755 https://openaccess.thecvf.com/content/CVPR2023/papers/Wei_Sparsifiner_Learning_Sparse_Instance-Dependent_Attention_for_Efficient_Vision_Transformers_CVPR_2023_paper.pdf
- Rahul Chand, Yashoteja Prabhu, Pratyush Kumar, 20 Dec 2023, DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization, https://arxiv.org/abs/2312.13211
- Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
- Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun, Dec 2019, Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection, https://arxiv.org/abs/1912.11637
- Georgios Georgiadis. 2019. Accelerating Convolutional Neural Networks via Activation Map Compression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 7085–7095. https://arxiv.org/abs/1812.04056
- Bunyodbek Ibrokhimov, Cheonghwan Hur, and Sanggil Kang. 2020. Effective node selection technique towards sparse learning. APPLIED INTELLIGENCE (2020), https://dl.acm.org/doi/abs/10.1007/s10489-020-01720-5
- Zehao Huang and Naiyan Wang. 2018. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV). 304–320. https://arxiv.org/abs/1707.01213
- Shiyao Xu; Jingfei Jiang; Jinwei Xu; Chaorun Liu; Yuanhong He; Xiaohang Liu, 2022, Sparkle: A High Efficient Sparse Matrix Multiplication Accelerator for Deep Learning, 2022 IEEE 40th International Conference on Computer Design (ICCD) https://ieeexplore.ieee.org/document/9978530
- C. Deng, S. Liao, Y. Xie, K. K. Parhi, X. Qian and B. Yuan, "PermDNN: Efficient compressed DNN architecture with permuted diagonal matrices", Proc. 51st Annu. IEEE/ACM Int. Symp. Microarchitecture (MICRO), pp. 189-202, Oct. 2018. https://arxiv.org/abs/2004.10936
- Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)
- C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, “DeltaRNN: A power-efficient recurrent neural network accelerator,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018, pp. 21–30. PDF: https://dl.acm.org/doi/pdf/10.1145/3174243.3174261
- Gale, T., Elsen, E., and Hooker, S., The state of sparsity in deep neural networks, arXiv preprint arXiv:1902.09574, 2019, https://arxiv.org/abs/1902.09574
- Kwon, W., Kim, S., Mahoney, M. W., Hassoun, J., Keutzer, K., and Gholami, A., 2022, A fast post-training pruning framework for transformers, arXiv preprint arXiv:2204.09656, https://arxiv.org/abs/2204.09656
- Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré, Feb 2023, Hyena Hierarchy: Towards Larger Convolutional Language Models, https://arxiv.org/abs/2302.10866 -
- Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen, 11 Jun 2024 (v2), Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters, https://arxiv.org/abs/2406.05955
- Splash: Sparse Flash Attention, 2024, https://github.com/google/jax/blob/main/jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_kernel.py
- 1 Jun 2023, Faster Causal Attention Over Large Sequences Through Sparse Flash Attention, Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, François Fleuret, https://arxiv.org/abs/2306.01160
- Mingxuan He, Mithuna Thottethodi, T.N. Vijaykumar, 6 Apr 2024, Efficient Sparse Processing-in-Memory Architecture (ESPIM) for Machine Learning Inference, https://arxiv.org/abs/2404.04708
- Mirko Farina, Usman Ahmad, Ahmad Taha, Hussein Younes, Yusuf Mesbah, Xiao Yu, Witold Pedrycz, 2024, Sparsity in transformers: A systematic literature review, Neurocomputing, Volume 582, 14 May 2024, 127468, https://www.sciencedirect.com/science/article/abs/pii/S092523122400239X (General survey of sparsity methods, and techniques that create sparsity.)
- Reece Shuttleworth, CHARACTERIZING SPARSITY IN TRANSFORMERS https://reeceshuttle.me/assets/9.58-Final-Project-Report.pdf Code: https://github.com/reeceshuttle/958
- Jianlei Yang, Jiacheng Liao, Fanding Lei, Meichen Liu, Junyi Chen, Lingkun Long, Han Wan, Bei Yu, Weisheng Zhao, Nov 2023, TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices, https://arxiv.org/abs/2311.01759
- Jun Liu; Guohao Dai; Hao Xia; Lidong Guo; Xiangsheng Shi; Jiaming Xu; Nov 2023, TSTC: Two-Level Sparsity Tensor Core Enabling both Algorithm Flexibility and Hardware Efficiency, 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/abstract/document/10323775 (Managing sparse tensors efficiently by using two-level data structures that allows granular control of sparsity.)
- Eunji Kwon; Jongho Yoon; Seokhyeong Kang, Dec 2023, Mobile Transformer Accelerator Exploiting Various Line Sparsity and Tile-Based Dynamic Quantization, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10375766
- Luca Dordoni, Dec 2023, Sparsification of deep neural network via ternary quantization, Masters Thesis, POLITECNICO DI TORINO, Italy, https://webthesis.biblio.polito.it/29424/1/tesi.pdf
- Trevor Gale, Matei Zaharia, Cliff Young, Erich Elsen, Aug 2020, Sparse GPU Kernels for Deep Learning, https://arxiv.org/abs/2006.10901
- Ziheng Wang, Aug 2020, SparseRT: Accelerating Unstructured Sparsity on GPUs for Deep Learning Inference, https://arxiv.org/abs/2008.11849
- Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, Shiwei Liu, Oct 2023, Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity, https://arxiv.org/abs/2310.05175
- Shashank Verma and Neal Vaidya, Mastering LLM Techniques: Inference Optimization, Nov 17, 2023, NVIDIA Technical Blog, https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- Anonymous, SPARSITY IN LARGE LANGUAGE MODELS RS BACK, RELU STRIKES BACK: EXPLOITING ACTIVATION SPARSITY IN LARGE LANGUAGE MODELS, https://openreview.net/pdf?id=osoWxY8q2E
- Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Lukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, and Jonni Kanerva. Sparse is enough in scaling transformers. In Advances in Neural Information Processing Systems, 2021. https://openreview.net/forum?id=-b5OSCydOMe. https://arxiv.org/abs/2111.12763
- M Piórczyński, F Szatkowski, K Bałazy, B Wójcik, 2023, Exploiting Transformer Activation Sparsity with Dynamic Inference https://arxiv.org/pdf/2310.04361.pdf
- KAA Fuad, L Chen, 2023, A Survey on Sparsity Exploration in Transformer-Based Accelerators https://www.mdpi.com/2079-9292/12/10/2299
- Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
- Zehao Huang. 2018. Data-Driven Sparse Structure Selection for Deep Neural Networks. Papers with Code. https://paperswithcode.com/paper/data-driven-sparse-structure-selection-for (2021).
- M. A. Nasution, D. Chahyati and M. I. Fanany, 2017, "Faster R-CNN with structured sparsity learning and Ristretto for mobile environment", Proc. Int. Conf. Adv. Comput. Sci. Inf. Syst. (ICACSIS), pp. 309-314, Oct. 2017. https://ieeexplore.ieee.org/document/8355051
- 25 May 2024, Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection, Yun Zhu, Jia-Chen Gu, Caitlin Sikora, Ho Ko, Yinxiao Liu, Chu-Cheng Lin, Lei Shu, Liangchen Luo, Lei Meng, Bang Liu, Jindong Chen, https://arxiv.org/abs/2405.16178
- Sainbayar Sukhbaatar, Edouard Grave, Piotr Bojanowski, and Armand Joulin, 2019, Adaptive attention span in transformers. CoRR, abs/1905.07799, 2019, http://arxiv.org/abs/1905.07799.
- Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019, Augmenting self-attention with persistent memory. CoRR, abs/1907.01470, 2019. http://arxiv.org/abs/1907.01470
- Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. https://openai.com/blog/sparse-transformers, 2019, https://arxiv.org/abs/1904.10509
- Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve Jegou. Large memory layers with product keys. CoRR, abs/1907.05242, 2019. http://arxiv.org/abs/1907.05242
- Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman, 9 Feb 2024 (v2), SliceGPT: Compress Large Language Models by Deleting Rows and Columns, Microsoft Research, https://arxiv.org/abs/2401.15024 Code: https://github.com/microsoft/TransformerCompression (Pruning of matrices effectively prunes along the width dimension and the "fourth" internal dimension of embeddings using techniques such as low-rank matrix factorization.)
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Donghyeon Joo, Ramyad Hadidi, Soheil Feizi, Bahar Asgari, 17 Jun 2024, Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference, https://arxiv.org/abs/2406.11674
- Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, William Fedus, 29 Apr 2022 (v2), ST-MoE: Designing Stable and Transferable Sparse Expert Models, https://arxiv.org/abs/2202.08906
- Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
- Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna, 19 Jun 2024, SDQ: Sparse Decomposed Quantization for LLM Inference, https://arxiv.org/abs/2406.13868 (Combining sparsity and quantization.)
- Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu, 24 Jun 2024, Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers, https://arxiv.org/abs/2406.16747 (Sparse KV cache for memory-efficient decoding on long contexts by selecting KV pairs of salient tokens.)
- Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
- Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang, 16 Jul 2024, Scaling Diffusion Transformers to 16 Billion Parameters, https://arxiv.org/abs/2407.11633 Project: https://github.com/feizc/DiT-MoE
- Ghadeer Jaradat, Mohammed Tolba, Ghada Alsuhli, Hani Saleh, Mahmoud Al-Qutayri, Thanos Stouraitis, Baker Mohammad, 7 Jul 2024, Hybrid Dynamic Pruning: A Pathway to Efficient Transformer Inference, https://arxiv.org/abs/2407.12893
- Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song, 19 Sep 2023, Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity, https://arxiv.org/abs/2309.10285 Code: https://github.com/AlibabaResearch/flash-llm (Unstructured pruning on tensor cores in GPUs with sparse MatMul optimizations.)
- Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, 4 Jan 2024 (v2), LLM in a flash: Efficient Large Language Model Inference with Limited Memory, https://arxiv.org/abs/2312.11514 (Storing model parameters in flash memory on phones.)
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Peter Belcak, Roger Wattenhofer, Aug 2024, UltraSparseBERT: 99% Conditionally Sparse Language Modelling, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 104–108, August 11-16, 2024, https://aclanthology.org/2024.acl-short.10.pdf
- Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
- Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng, 11 Aug 2024, Post-Training Sparse Attention with Double Sparsity, https://arxiv.org/abs/2408.07092 Code: https://github.com/andy-yang-1/DoubleSparse (Combined token-level sparse attention with reduced KV data accesses.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, https://arxiv.org/abs/2312.00678
- Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, 22 Aug 2024, A Tighter Complexity Analysis of SparseGPT, https://arxiv.org/abs/2408.12151
- Pai Zeng, Zhenyu Ning, Jieru Zhao, Weihao Cui, Mengwei Xu, Liwei Guo, Xusheng Chen, Yizhou Shan, 27 May 2024 (v2), The CAP Principle for LLM Serving: A Survey of Long-Context Large Language Model Serving, https://arxiv.org/abs/2405.11299
- Kai Yang, Jan Ackermann, Zhenyu He, Guhao Feng, Bohang Zhang, Yunzhen Feng, Qiwei Ye, Di He, Liwei Wang, 21 Feb 2024, Do Efficient Transformers Really Save Computation? https://arxiv.org/abs/2402.13934
- James Liu, Pragaash Ponnusamy, Tianle Cai, Han Guo, Yoon Kim, Ben Athiwaratkun, 26 Aug 2024, Training-Free Activation Sparsity in Large Language Models, https://arxiv.org/abs/2408.14690
- Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
- Amir Basic, 2024, Sparsification with Variational Dropout, Master’s thesis, Data Science, Department of Mathematics, Faculty of Mathematics and Natural Sciences, University of Oslo, Norway, https://www.duo.uio.no/bitstream/handle/10852/112199/1/Amir_Basic_Masteroppgave.pdf
- Hongyu Wang, Shuming Ma, Ruiping Wang, Furu Wei, 24 Jul 2024 (v3), Q-Sparse: All Large Language Models can be Fully Sparsely-Activated, https://arxiv.org/abs/2407.10969
- Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, Atul Prakash, 3 Jun 2024 (v3), Learn To be Efficient: Build Structured Sparsity in Large Language Models, https://arxiv.org/abs/2402.06126
- My Social, May 17, 2024, Sparse Llama: Revolutionizing LLMs with 70% Sparsity, https://medium.com/aimonks/sparse-llama-revolutionizing-llms-with-70-sparsity-e6e9664f38e1
- Cerebras, May 15, 2024, Introducing Sparse Llama: 70% Smaller, 3x Faster, Full Accuracy, https://cerebras.ai/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy
- Neural Magic, 2024, Sparse Foundational Llama 2 Models, https://docs.neuralmagic.com/llms/models/sparse-foundational-llama-2/
- Jaxpruner: A Concise Library for Sparsity Research, Joo Hyung Lee, Wonpyo Park, Nicole Elyse Mitchell, Jonathan Pilault, Johan Samir Obando Ceron, Han-Byul Kim, Namhoon Lee, Elias Frantar, Yun Long, Amir Yazdanbakhsh, Woohyun Han, Shivani Agrawal, Suvinay Subramanian, Xin Wang, Sheng-Chun Kao, Xingyao Zhang, Trevor Gale, Aart J.C. Bik, Milen Ferev, Zhonglin Han, Hong-Seok Kim, Yann Dauphin, Gintare Karolina Dziugaite, Pablo Samuel Castro, Utku Evci, Conference on Parsimony and Learning, PMLR 234:515-528, 2024. https://proceedings.mlr.press/v234/lee24a.html https://proceedings.mlr.press/v234/lee24a/lee24a.pdf https://openreview.net/forum?id=H2rCZCfXkS https://openreview.net/pdf?id=H2rCZCfXkS
- Simla Burcu Harma, Ayan Chakraborty, Elizaveta Kostenok, Danila Mishin, Dongho Ha, Babak Falsafi, Martin Jaggi, Ming Liu, Yunho Oh, Suvinay Subramanian, Amir Yazdanbakhsh, 31 May 2024, Effective Interplay between Sparsity and Quantization: From Theory to Practice, https://arxiv.org/abs/2405.20935
- Krisna Pinasthika, Blessius Sheldo Putra Laksono, Riyandi Banovbi Putera Irsal, Syifa Hukma Shabiyya, Novanto Yudistira, 11 Sep 2023, SparseSwin: Swin Transformer with Sparse Transformer Block, https://arxiv.org/abs/2309.05224 https://www.sciencedirect.com/science/article/abs/pii/S0925231224002042
- Zhang, H., Ma, W., Yuan, W. et al. Mixed-precision block incomplete sparse approximate preconditioner on Tensor core. CCF Trans. HPC 6, 54–67 (2024). https://doi.org/10.1007/s42514-023-00165-9 https://link.springer.com/article/10.1007/s42514-023-00165-9
- Bobby Yan, Alexander J. Root, Trevor Gale, David Broman, Fredrik Kjolstad, 20 Jun 2024 (v2), Scorch: A Library for Sparse Deep Learning, https://arxiv.org/abs/2405.16883
- Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
- Yuzong Chen, Jian Meng, Jae-sun Seo, Mohamed S. Abdelfattah, 8 Sep 2024, BBS: Bi-directional Bit-level Sparsity for Deep Learning Acceleration, https://arxiv.org/abs/2409.05227
- Jordan Dotzel, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, Sep 2024, Opportunities for Post-Training Dynamic Layer Sparsity in Large Vision and Language Models, https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Dotzel_Opportunities_for_Post-Training_Dynamic_Layer_Sparsity_in_Large_Vision_and_CVPRW_2024_paper.pdf (Layerwise dynamic sparsity for vision models.)
- Y. Jin, R. Zhong, S. Long and J. Zhai, "Efficient Inference for Pruned CNN Models on Mobile Devices With Holistic Sparsity Alignment," in IEEE Transactions on Parallel and Distributed Systems, doi: 10.1109/TPDS.2024.3462092. https://ieeexplore.ieee.org/document/10682058 https://www.computer.org/csdl/journal/td/5555/01/10682058/20jHtbSkOJO https://doi.ieeecomputersociety.org/10.1109/TPDS.2024.3462092
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Elias Frantar, September, 2024, Compressing Large Neural Networks Algorithms, Systems and Scaling Laws, Ph.D. Thesis, Graduate School, Institute of Science and Technology, Austria, https://research-explorer.ista.ac.at/download/17485/17880/frantar_thesis_final.pdf
- Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
- Juan Pablo Muñoz, Jinjie Yuan, Nilesh Jain, 1 Oct 2024, SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models, https://arxiv.org/abs/2410.03750 https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning
- Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang, 9 Oct 2024 (v2), SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference, https://arxiv.org/abs/2410.04417 https://github.com/Gumpest/SparseVLMs
- C. Zhang et al., "DSTC: Dual-Side Sparsity Tensor Core for DNNs Acceleration on Modern GPU Architectures," in IEEE Transactions on Computers, doi: 10.1109/TC.2024.3475814. https://ieeexplore.ieee.org/abstract/document/10709841 (Sparse kernels in hardware.)
- Kang Zhao, Tao Yuan, Han Bao, Zhenfeng Su, Chang Gao, Zhaofeng Sun, Zichen Liang, Liping Jing, Jianfei Chen, 21 Oct 2024, Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs, https://arxiv.org/abs/2410.16135
- Pu Zhao, Fei Sun, Xuan Shen, Pinrui Yu, Zhenglun Kong, Yanzhi Wang, Xue Lin, 21 Oct 2024, Pruning Foundation Models for High Accuracy without Retraining, https://arxiv.org/abs/2410.15567 https://github.com/piuzha/APT
- Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen, 23 Oct 2024, CoreInfer: Accelerating Large Language Model Inference with Semantics-Inspired Adaptive Sparse Activation, https://arxiv.org/abs/2410.18311 https://wangqinsi1.github.io/coreinfer_page/
- Wenxuan Huang, Zijie Zhai, Yunhang Shen, Shaoshen Cao, Fei Zhao, Xiangfeng Xu, Zheyu Ye, Shaohui Lin, 3 Dec 2024 (v2), Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification, https://arxiv.org/abs/2412.00876 https://github.com/Osilly/dynamic_llava (Sparsification of the context in vision model.)
More AI Research
Read more about:
- Magnitude pruning
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Normalization pruning
- Length pruning
- Width pruning
- Channel pruning
- « Research Home