Aussie AI

Feed-Forward Network Pruning

  • Last Updated 30 January, 2025
  • by David Spuler, Ph.D.

What is FFN Pruning?

The Feed-Forward Network (FFN) is a fundamental architectural component of the Transformer. The FFN has a good reputation as a hard worker, always doing lots of computations. And yet, there are unkind people in this world who want to throw them away.

FFN optimization techniques include:

  • FFN block pruning — removing the FFN/MLP components entirely.
  • FFN approximation
  • MatMul optimizations
  • Bias pruning (i.e., removing only the bias weights)
  • FFN integer-only methods (i.e., quantization)
  • FFN-specific sparsity methods (see also LLM sparsification methods)
  • Skip connection pruning (removing the addition of residuals)
  • Bilinear blocks (i.e., removing the activation function between the two layers of the FFN).

FFN Pruning Research

In addition to attention head pruning, some research papers have examined pruning of the Feed Forward Network (FFN) components in the Transformer architecture. In this context, FFN pruning refers to the component-wise removal of the FFN, rather than pruning of the weights in the MatMuls that form the FFN. Various papers report that the entire FFN component can be pruned (removed) in the decoder without significant accuracy degradation. This is a type of structured pruning or block pruning, where the entire FFN is removed.

Research papers on FFN pruning include:

  • Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime Neural Pruning. In Advances in Neural Information Processing Systems (NeurIPS). https://dl.acm.org/doi/10.5555/3294771.3294979, https://papers.nips.cc/paper/2017/hash/a51fb975227d6640e4fe47854476d133-Abstract.html
  • Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tickets Are Winning. In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/2005.00561, https://arxiv.org/abs/2005.00561
  • Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth, Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, Oct 2020 https://arxiv.org/abs/2010.01791
  • François Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush, Block Pruning For Faster Transformers, 2021, https://arxiv.org/abs/2109.04838
  • Zhengyuan Liu; Nancy F. Chen, Picking the Underused Heads: A Network Pruning Perspective of Attention Head Selection for Fusing Dialogue Coreference Information, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4-10 June 2023, https://ieeexplore.ieee.org/abstract/document/10096717
  • Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
  • Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, and Ilya Chatsviorkin. 2020. Efficient inference for neural machine translation. CoRR, abs/2010.02416. https://arxiv.org/abs/2010.02416
  • Bapna, A., Arivazhagan, N., and Firat, O., Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106 (Conditionally controls FFN units and other model components.)
  • Ofir Press, Noah A. Smith, and Omer Levy. 2020. Improving Transformer Models by Reordering their Sublayers. In Proceedings of ACL. Online, 2996–3005. https://doi.org/10.18653/v1/2020.acl-main.270, https://arxiv.org/abs/1911.03864 (Alternates layers of attention heads and FFN units, effectively pruning attention heads and FFN components from some layers.)
  • Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight Transformer. arXiv:2008.00623 https://arxiv.org/abs/2008.00623 (Different Transformer architecture that includes removing attention heads and replaces the FFN with a simpler version.)
  • Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, and Zhaopeng Tu. 2020. On the Sub-layer Functionalities of Transformer Decoder. In Findings of EMNLP. Online, 4799–4811. https://doi.org/10.18653/v1/2020.findings-emnlp.432, https://arxiv.org/abs/2010.02648 (Removes the FFN from the decoder.)
  • Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI Open, 2022. https://arxiv.org/abs/2106.04554 (Survey paper with some analysis of FFN components; see "Section 5.3.3 Dropping FFN Layers".)
  • Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting Self-attention with Persistent Memory. arXiv:1907.01470 https://arxiv.org/abs/1907.01470 (Proposed architecture includes removing the FFN component.)
  • Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite Transformer with Long-Short Range Attention. In Proceedings of ICLR. https://openreview.net/forum?id=ByeMPlHKPH, https://arxiv.org/abs/2004.11886, Code: https://github.com/mit-han-lab/lite-transformer (Makes the FFN smaller, attention heads bigger.)
  • Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
  • Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2012.14913 (Explores how FFN's work in depth, with relevance to removing redundant FFN units.)
  • Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
  • Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (Did not remove the FFN, but reduced the size of the FFN down from 4 times hidden size to 8/3 of the hidden size.)
  • Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzed layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama (Shared FFN layers, similar to pruning several FFNs, for on-mobile small model execution.)
  • Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (Extracts smaller sub-models from a large models based on a subset of FFN parameters, which is analogous to FFN pruning or FFN approximation.)
  • Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
  • Ofir Press, Noah A. Smith, Omer Levy, Apr 2020, Improving Transformer Models by Reordering their Sublayers, https://arxiv.org/abs/1911.03864
  • Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982 (Survey paper with some coverage of FFN pruning.)
  • Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
  • Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
  • Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
  • Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
  • Shwai He, Guoheng Sun, Zheyu Shen, Ang Li, 22 Jun 2024, What Matters in Transformers? Not All Attention is Needed, https://arxiv.org/abs/2406.15786 https://github.com/Shwai-He/LLM-Drop
  • AnonymousAuthor, 2024, AdeeperlookatdepthpruningofLLMs, https://openreview.net/pdf?id=9B7ayWclwN
  • Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 11 Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707
  • Zirui Liu, Qingquan Song, Qiang Charles Xiao, Sathiya Keerthi Selvaraj, Rahul Mazumder, Aman Gupta, Xia Hu, 8 Jan 2024, FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference, https://arxiv.org/abs/2401.04044
  • Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd Bischl, Mina Rezaei, Kenji Kawaguchi, 28 May 2024, FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models, https://arxiv.org/abs/2405.18218
  • Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
  • Zhiyang Chen; Yousong Zhu; Zhaowen Li; Fan Yang et al., The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 4130-4134, doi: 10.1109/ICASSP48485.2024.10447756, https://ieeexplore.ieee.org/abstract/document/10447756
  • Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
  • Mustafa Shukor, Matthieu Cord, 12 Oct 2024, Skipping Computations in Multimodal LLMs, https://arxiv.org/abs/2410.09454 https://github.com/mshukor/ima-lmms
  • Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu, 4 Jan 2025, AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference, https://arxiv.org/abs/2501.02336 (Optimally skipping sublayer components in FFN and attention during prefill and decoding phases.)
  • Jaehoon Heo, Adiwena Putra, Jieon Yoon, Sungwoong Yune, Hangyeol Lee, Ji-Hoon Kim, Joo-Young Kim, 10 Jan 2025, EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models, https://arxiv.org/abs/2501.05680
  • Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)

FFN Approximation

The FFN can be approximated in various ways. For example, there are various ways to approximate matrix multiplications, such as LoRA and other low-rank factorization methods. Quantization of the FFN weights is also a type of approximation using smaller integers (see quantization).

  • Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 14 Feb 2024, HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference, https://arxiv.org/abs/2402.09360 (Attempts to estimate the output of top-k decoding, so as to prune computations on two dimensions earlier in the inference computations.)
  • Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
  • S Peng, F Yang, N Sun, S Chen, Y Jiang, A Pan, Oct 2023, Exploring Post-Training Quantization of Protein Language Models, arXiv preprint arXiv:2310.19624, https://arxiv.org/abs/2310.19624
  • Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
  • Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
  • Zhiyang Chen; Yousong Zhu; Zhaowen Li; Fan Yang et al., The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 4130-4134, doi: 10.1109/ICASSP48485.2024.10447756, https://ieeexplore.ieee.org/abstract/document/10447756
  • Suman Sapkota, 21 Oct 2024, Metric as Transform: Exploring beyond Affine Transform for Interpretable Neural Network, https://arxiv.org/abs/2410.16159
  • Vincent-Pierre Berges, Barlas Oguz, December 12, 2024, Memory Layers at Scale, Meta, https://ai.meta.com/research/publications/memory-layers-at-scale/ https://github.com/facebookresearch/memory (Augmention of an LLM with an additional key-value associative memory, by replacing some FFNs with a "memory layer".)
  • Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)

FFN Sparsity

Sparsification of the weights in the FFN is an optimization. This is usually under the general category of whole-model sparsity (see LLM sparsity research), but there is also FFN-only sparsification.

FFN Parameter Sharing

The FFN weights can be shared in a "fused FFN" optimization, which has a similar effect to FFN pruning. Both methods are types of model compression that reduce storage of FFN weights. Research papers include:

  • Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
  • Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
  • Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama (Shared FFN layers, similar to pruning several FFNs, for on-mobile small model execution.)
  • Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
  • Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)

FFN Optimization Research

FFNs can be optimized using any of the various matrix multiplication optimizations, and there are also some other FFN-specific optimizations. Research papers include:

  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
  • Jie Tang; Shuai Wang; Song Chen; Yi Kang, May 2024, DP-FFN: Block-Based Dynamic Pooling for Accelerating Feed-Forward Layers in Transformers, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/abstract/document/10558119
  • Amin Aminifar, Baichuan Huang, Azra Abtahi, Amir Aminifar, May 2024, Lightweight Inference for Forward-Forward Algorithm, https://whubaichuan.github.io/data/LightFF.pdf
  • Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
  • Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
  • Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv, 28 Nov 2024, Puzzle: Distillation-Based NAS for Inference-Optimized LLMs,NVIDIA Research, https://arxiv.org/abs/2411.19146 (This is dynamic NAS on a vast scale in a search space of size 10^138, because the optimization is applied with low granularity to each block in attention and FFN subcomponents of each layer.)
  • Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
  • Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
  • Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, et al. (additional authors not shown), 7 May 2024 (v1), last revised 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model DeepSeek-AI, https://arxiv.org/abs/2405.04434 (Introduces MHLA attention and FFN optimizations, amongst other advances.)

Bias Pruning

Bias pruning is the removal of the extra vector addition of a "bias vector" in the FFN computations. A typical FFN does a computation, twice, of "Wx+b" where W is the weight matrix, x is the activations vector, and b is the bias vector. Hence, an FFN does a matrix multiplications and then adds an extra "bias" vector, then does an activation function (e.g., RELU or SwiGLU), and then repeats the "Wx+b" matrix operation a second time. Bias pruning is the removal of the bias vectors from the Feed-Forward Network (FFN). In this way, the FFN is partially pruned, and has a reduced cost. This is only an incremental benefit in speed and cost, since bias addition is only a linear operation (adding a vector of biases), whereas the matrix-vector multiplications are quadratic.

Research papers that examine removing bias vectors from FFN components:

  • Noam Shazeer, Feb 2020, GLU Variants Improve Transformer, https://arxiv.org/pdf/2002.05202.pdf
  • Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter Liu, 2019, Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, https://arxiv.org/abs/1910.10683 (T5 FFN code has no biases.)
  • Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Found gains from disabling biases in linear layers.)
  • Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. 2019. Understanding and Improving Layer Normalization. In Proceedings of NeurIPS. 4383–4393. https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html, https://arxiv.org/abs/1911.07013 (Found that biases were not needed.)
  • Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (Biases were removed from FFNs, but were added to QKV attention.)
  • Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture removed all biases from dense kernels and layer norms.)
  • Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
  • Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
  • Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)

Integer FFN Modules

The use of integers in FFN's is usually in the realm of quantization, which typically apply to the attention weights and the FFN weights. Hence, integer FFNs or "quantized FFNs" are not a common optimization on their own. However, it is possible to quantize only the FFN weights to integers, while leaving the attention matrices as FP32 (i.e., mixed-precision quantization).

Alternatives to FFN Modules

There are various ways to approximate or otherwise use an alternative to the normal FFN modules. See also zero-multiplication models for a more general area of research.

Skip Connection Pruning and Optimization

Skip connections, or residual connections, were introduced to solve the vanishing gradient problem during training. However, they add an extra level of computation, and some recent research has been examing removal of skip connections, or other types of optimizations through modification of skip connections. Note that the efficiency gain from removing skip connections is incremental, since it is a vector or matrix addition, which costs less than matrix-vector or matrix-matrix multiplications (see also matrix multiplication optimizations).

Research papers include:

More Research on Pruning Types

More AI Research

Read more about: