Aussie AI

Feed-Forward Network Pruning

  • Last Updated 3 December, 2024
  • by David Spuler, Ph.D.

What is FFN Pruning?

The Feed-Forward Network (FFN) is a fundamental architectural component of the Transformer. The FFN has a good reputation as a hard worker, always doing lots of computations. And yet, there are unkind people in this world who want to throw them away.

FFN optimization techniques include:

  • FFN block pruning — removing the FFN/MLP components entirely.
  • FFN approximation
  • MatMul optimizations
  • Bias pruning (i.e., removing only the bias weights)
  • FFN integer-only methods (i.e., quantization)
  • FFN-specific sparsity methods (see also LLM sparsification methods)
  • Skip connection pruning (removing the addition of residuals)
  • Bilinear blocks (i.e., removing the activation function between the two layers of the FFN).

FFN Pruning Research

In addition to attention head pruning, some research papers have examined pruning of the Feed Forward Network (FFN) components in the Transformer architecture. In this context, FFN pruning refers to the component-wise removal of the FFN, rather than pruning of the weights in the MatMuls that form the FFN. Various papers report that the entire FFN component can be pruned (removed) in the decoder without significant accuracy degradation. This is a type of structured pruning or block pruning, where the entire FFN is removed.

Research papers on FFN pruning include:

  • Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime Neural Pruning. In Advances in Neural Information Processing Systems (NeurIPS). https://dl.acm.org/doi/10.5555/3294771.3294979, https://papers.nips.cc/paper/2017/hash/a51fb975227d6640e4fe47854476d133-Abstract.html
  • Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tickets Are Winning. In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/2005.00561, https://arxiv.org/abs/2005.00561
  • Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth, Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, Oct 2020 https://arxiv.org/abs/2010.01791
  • François Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush, Block Pruning For Faster Transformers, 2021, https://arxiv.org/abs/2109.04838
  • Zhengyuan Liu; Nancy F. Chen, Picking the Underused Heads: A Network Pruning Perspective of Attention Head Selection for Fusing Dialogue Coreference Information, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4-10 June 2023, https://ieeexplore.ieee.org/abstract/document/10096717
  • Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
  • Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, and Ilya Chatsviorkin. 2020. Efficient inference for neural machine translation. CoRR, abs/2010.02416. https://arxiv.org/abs/2010.02416
  • Bapna, A., Arivazhagan, N., and Firat, O., Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106 (Conditionally controls FFN units and other model components.)
  • Ofir Press, Noah A. Smith, and Omer Levy. 2020. Improving Transformer Models by Reordering their Sublayers. In Proceedings of ACL. Online, 2996–3005. https://doi.org/10.18653/v1/2020.acl-main.270, https://arxiv.org/abs/1911.03864 (Alternates layers of attention heads and FFN units, effectively pruning attention heads and FFN components from some layers.)
  • Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight Transformer. arXiv:2008.00623 https://arxiv.org/abs/2008.00623 (Different Transformer architecture that includes removing attention heads and replaces the FFN with a simpler version.)
  • Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, and Zhaopeng Tu. 2020. On the Sub-layer Functionalities of Transformer Decoder. In Findings of EMNLP. Online, 4799–4811. https://doi.org/10.18653/v1/2020.findings-emnlp.432, https://arxiv.org/abs/2010.02648 (Removes the FFN from the decoder.)
  • Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI Open, 2022. https://arxiv.org/abs/2106.04554 (Survey paper with some analysis of FFN components; see "Section 5.3.3 Dropping FFN Layers".)
  • Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting Self-attention with Persistent Memory. arXiv:1907.01470 https://arxiv.org/abs/1907.01470 (Proposed architecture includes removing the FFN component.)
  • Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite Transformer with Long-Short Range Attention. In Proceedings of ICLR. https://openreview.net/forum?id=ByeMPlHKPH, https://arxiv.org/abs/2004.11886, Code: https://github.com/mit-han-lab/lite-transformer (Makes the FFN smaller, attention heads bigger.)
  • Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
  • Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2012.14913 (Explores how FFN's work in depth, with relevance to removing redundant FFN units.)
  • Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
  • Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (Did not remove the FFN, but reduced the size of the FFN down from 4 times hidden size to 8/3 of the hidden size.)
  • Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzed layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama (Shared FFN layers, similar to pruning several FFNs, for on-mobile small model execution.)
  • Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (Extracts smaller sub-models from a large models based on a subset of FFN parameters, which is analogous to FFN pruning or FFN approximation.)
  • Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
  • Ofir Press, Noah A. Smith, Omer Levy, Apr 2020, Improving Transformer Models by Reordering their Sublayers, https://arxiv.org/abs/1911.03864
  • Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982 (Survey paper with some coverage of FFN pruning.)
  • Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
  • Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
  • Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
  • Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
  • Shwai He, Guoheng Sun, Zheyu Shen, Ang Li, 22 Jun 2024, What Matters in Transformers? Not All Attention is Needed, https://arxiv.org/abs/2406.15786 https://github.com/Shwai-He/LLM-Drop
  • AnonymousAuthor, 2024, AdeeperlookatdepthpruningofLLMs, https://openreview.net/pdf?id=9B7ayWclwN
  • Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 11 Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707
  • Zirui Liu, Qingquan Song, Qiang Charles Xiao, Sathiya Keerthi Selvaraj, Rahul Mazumder, Aman Gupta, Xia Hu, 8 Jan 2024, FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference, https://arxiv.org/abs/2401.04044
  • Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd Bischl, Mina Rezaei, Kenji Kawaguchi, 28 May 2024, FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models, https://arxiv.org/abs/2405.18218
  • Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
  • Zhiyang Chen; Yousong Zhu; Zhaowen Li; Fan Yang et al., The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 4130-4134, doi: 10.1109/ICASSP48485.2024.10447756, https://ieeexplore.ieee.org/abstract/document/10447756
  • Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
  • Mustafa Shukor, Matthieu Cord, 12 Oct 2024, Skipping Computations in Multimodal LLMs, https://arxiv.org/abs/2410.09454 https://github.com/mshukor/ima-lmms

FFN Approximation

The FFN can be approximated in various ways. For example, there are various ways to approximate matrix multiplications, such as LoRA and other low-rank factorization methods. Quantization of the FFN weights is also a type of approximation using smaller integers (see quantization).

  • Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 14 Feb 2024, HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference, https://arxiv.org/abs/2402.09360 (Attempts to estimate the output of top-k decoding, so as to prune computations on two dimensions earlier in the inference computations.)
  • Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
  • S Peng, F Yang, N Sun, S Chen, Y Jiang, A Pan, Oct 2023, Exploring Post-Training Quantization of Protein Language Models, arXiv preprint arXiv:2310.19624, https://arxiv.org/abs/2310.19624
  • Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
  • Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
  • Zhiyang Chen; Yousong Zhu; Zhaowen Li; Fan Yang et al., The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 4130-4134, doi: 10.1109/ICASSP48485.2024.10447756, https://ieeexplore.ieee.org/abstract/document/10447756
  • Suman Sapkota, 21 Oct 2024, Metric as Transform: Exploring beyond Affine Transform for Interpretable Neural Network, https://arxiv.org/abs/2410.16159

FFN Sparsity

Sparsification of the weights in the FFN is an optimization. This is usually under the general category of whole-model sparsity (see LLM sparsity research), but there is also FFN-only sparsification.

FFN Parameter Sharing

The FFN weights can be shared in a "fused FFN" optimization, which has a similar effect to FFN pruning. Both methods are types of model compression that reduce storage of FFN weights. Research papers include:

  • Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
  • Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
  • Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama (Shared FFN layers, similar to pruning several FFNs, for on-mobile small model execution.)
  • Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)

FFN Optimization Research

FFNs can be optimized using any of the various matrix multiplication optimizations, and there are also some other FFN-specific optimizations. Research papers include:

  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
  • Jie Tang; Shuai Wang; Song Chen; Yi Kang, May 2024, DP-FFN: Block-Based Dynamic Pooling for Accelerating Feed-Forward Layers in Transformers, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/abstract/document/10558119
  • Amin Aminifar, Baichuan Huang, Azra Abtahi, Amir Aminifar, May 2024, Lightweight Inference for Forward-Forward Algorithm, https://whubaichuan.github.io/data/LightFF.pdf
  • Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
  • Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
  • Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv, 28 Nov 2024, Puzzle: Distillation-Based NAS for Inference-Optimized LLMs,NVIDIA Research, https://arxiv.org/abs/2411.19146 (This is dynamic NAS on a vast scale in a search space of size 10^138, because the optimization is applied with low granularity to each block in attention and FFN subcomponents of each layer.)

Bias Pruning

A typical FFN does matrix multiplications and then adds an extra "bias" vector. Bias pruning is the removal of the bias vectors from the Feed-Forward Network (FFN). In this way, the FFN is partially pruned, and has a reduced cost. This is only an incremental benefit in speed and cost, since bias addition is only a linear operation (adding a vector of biases), whereas the matrix-vector multiplications are quadratic.

Research papers that examine removing bias vectors from FFN components:

Integer FFN Modules

The use of integers in FFN's is usually in the realm of quantization, which typically apply to the attention weights and the FFN weights. Hence, integer FFNs or "quantized FFNs" are not a common optimization on their own. However, it is possible to quantize only the FFN weights to integers, while leaving the attention matrices as FP32 (i.e., mixed-precision quantization).

Alternatives to FFN Modules

There are various ways to approximate or otherwise use an alternative to the normal FFN modules. See also zero-multiplication models for a more general area of research.

Skip Connection Pruning and Optimization

Skip connections, or residual connections, were introduced to solve the vanishing gradient problem during training. However, they add an extra level of computation, and some recent research has been examing removal of skip connections, or other types of optimizations through modification of skip connections. Note that the efficiency gain from removing skip connections is incremental, since it is a vector or matrix addition, which costs less than matrix-vector or matrix-matrix multiplications (see also matrix multiplication optimizations).

Research papers include:

More Research on Pruning Types

More AI Research

Read more about: