Aussie AI

Feed-Forward Network Pruning

Last Updated 24 June, 2025

by David Spuler, Ph.D.

What is FFN Pruning?

The Feed-Forward Network (FFN) is a fundamental architectural component of the Transformer. The FFN has a good reputation as a hard worker, always doing lots of computations. And yet, there are unkind people in this world who want to throw them away.

FFN optimization techniques include:

FFN block pruning — removing the FFN/MLP components entirely.
FFN approximation
MatMul optimizations
Bias pruning (i.e., removing only the bias weights)
FFN integer-only methods (i.e., quantization)
FFN-specific sparsity methods (see also LLM sparsification methods)
Skip connection pruning (removing the addition of residuals)
Bilinear blocks (i.e., removing the activation function between the two layers of the FFN).

FFN Pruning Research

In addition to attention head pruning, some research papers have examined pruning of the Feed Forward Network (FFN) components in the Transformer architecture. In this context, FFN pruning refers to the component-wise removal of the FFN, rather than pruning of the weights in the MatMuls that form the FFN. Various papers report that the entire FFN component can be pruned (removed) in the decoder without significant accuracy degradation. This is a type of structured pruning or block pruning, where the entire FFN is removed.

Research papers on FFN pruning include:

Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. 2017. Runtime Neural Pruning. In Advances in Neural Information Processing Systems (NeurIPS). https://dl.acm.org/doi/10.5555/3294771.3294979, https://papers.nips.cc/paper/2017/hash/a51fb975227d6640e4fe47854476d133-Abstract.html
Sai Prasanna, Anna Rogers, and Anna Rumshisky. 2020. When BERT Plays the Lottery, All Tickets Are Winning. In Conference on Empirical Methods in Natural Language Processing (EMNLP). arXiv:cs.CL/2005.00561, https://arxiv.org/abs/2005.00561
Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth, Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior, Oct 2020 https://arxiv.org/abs/2010.01791
François Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush, Block Pruning For Faster Transformers, 2021, https://arxiv.org/abs/2109.04838
Zhengyuan Liu; Nancy F. Chen, Picking the Underused Heads: A Network Pruning Perspective of Attention Head Selection for Fusing Dialogue Coreference Information, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4-10 June 2023, https://ieeexplore.ieee.org/abstract/document/10096717
Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
Yi-Te Hsu, Sarthak Garg, Yi-Hsiu Liao, and Ilya Chatsviorkin. 2020. Efficient inference for neural machine translation. CoRR, abs/2010.02416. https://arxiv.org/abs/2010.02416
Bapna, A., Arivazhagan, N., and Firat, O., Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106 (Conditionally controls FFN units and other model components.)
Ofir Press, Noah A. Smith, and Omer Levy. 2020. Improving Transformer Models by Reordering their Sublayers. In Proceedings of ACL. Online, 2996–3005. https://doi.org/10.18653/v1/2020.acl-main.270, https://arxiv.org/abs/1911.03864 (Alternates layers of attention heads and FFN units, effectively pruning attention heads and FFN components from some layers.)
Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2020. DeLighT: Very Deep and Light-weight Transformer. arXiv:2008.00623 https://arxiv.org/abs/2008.00623 (Different Transformer architecture that includes removing attention heads and replaces the FFN with a simpler version.)
Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, and Zhaopeng Tu. 2020. On the Sub-layer Functionalities of Transformer Decoder. In Findings of EMNLP. Online, 4799–4811. https://doi.org/10.18653/v1/2020.findings-emnlp.432, https://arxiv.org/abs/2010.02648 (Removes the FFN from the decoder.)
Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. AI Open, 2022. https://arxiv.org/abs/2106.04554 (Survey paper with some analysis of FFN components; see "Section 5.3.3 Dropping FFN Layers".)
Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Herve Jegou, and Armand Joulin. 2019. Augmenting Self-attention with Persistent Memory. arXiv:1907.01470 https://arxiv.org/abs/1907.01470 (Proposed architecture includes removing the FFN component.)
Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han. 2020. Lite Transformer with Long-Short Range Attention. In Proceedings of ICLR. https://openreview.net/forum?id=ByeMPlHKPH, https://arxiv.org/abs/2004.11886, Code: https://github.com/mit-han-lab/lite-transformer (Makes the FFN smaller, attention heads bigger.)
Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2012.14913 (Explores how FFN's work in depth, with relevance to removing redundant FFN units.)
Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (Did not remove the FFN, but reduced the size of the FFN down from 4 times hidden size to 8/3 of the hidden size.)
Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzed layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama (Shared FFN layers, similar to pruning several FFNs, for on-mobile small model execution.)
Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 2024, MatFormer: Nested Transformer for Elastic Inference https://openreview.net/pdf?id=93BaEweoRg (Extracts smaller sub-models from a large models based on a subset of FFN parameters, which is analogous to FFN pruning or FFN approximation.)
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
Ofir Press, Noah A. Smith, Omer Levy, Apr 2020, Improving Transformer Models by Reordering their Sublayers, https://arxiv.org/abs/1911.03864
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982 (Survey paper with some coverage of FFN pruning.)
Megan Flynn, Alexander Wang, Dean Edward Alvarez, Christopher De Sa, Anil Damle, 29 May 2024, STAT: Shrinking Transformers After Training, https://arxiv.org/abs/2406.00061
Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Shwai He, Guoheng Sun, Zheyu Shen, Ang Li, 22 Jun 2024, What Matters in Transformers? Not All Attention is Needed, https://arxiv.org/abs/2406.15786 https://github.com/Shwai-He/LLM-Drop
AnonymousAuthor, 2024, AdeeperlookatdepthpruningofLLMs, https://openreview.net/pdf?id=9B7ayWclwN
Devvrit, Sneha Kudugunta, Aditya Kusupati, Tim Dettmers, Kaifeng Chen, Inderjit Dhillon, Yulia Tsvetkov, Hannaneh Hajishirzi, Sham Kakade, Ali Farhadi, Prateek Jain, 11 Oct 2023, MatFormer: Nested Transformer for Elastic Inference, https://arxiv.org/abs/2310.07707
Zirui Liu, Qingquan Song, Qiang Charles Xiao, Sathiya Keerthi Selvaraj, Rahul Mazumder, Aman Gupta, Xia Hu, 8 Jan 2024, FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference, https://arxiv.org/abs/2401.04044
Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd Bischl, Mina Rezaei, Kenji Kawaguchi, 28 May 2024, FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models, https://arxiv.org/abs/2405.18218
Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
Zhiyang Chen; Yousong Zhu; Zhaowen Li; Fan Yang et al., The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 4130-4134, doi: 10.1109/ICASSP48485.2024.10447756, https://ieeexplore.ieee.org/abstract/document/10447756
Junhui He, Shangyu Wu, Weidong Wen, Chun Jason Xue, Qingan Li, 2 Sep 2024, CHESS: Optimizing LLM Inference via Channel-Wise Thresholding and Selective Sparsification, https://arxiv.org/abs/2409.01366
Mustafa Shukor, Matthieu Cord, 12 Oct 2024, Skipping Computations in Multimodal LLMs, https://arxiv.org/abs/2410.09454 https://github.com/mshukor/ima-lmms
Zhuomin He, Yizhen Yao, Pengfei Zuo, Bin Gao, Qinya Li, Zhenzhe Zheng, Fan Wu, 4 Jan 2025, AdaSkip: Adaptive Sublayer Skipping for Accelerating Long-Context LLM Inference, https://arxiv.org/abs/2501.02336 (Optimally skipping sublayer components in FFN and attention during prefill and decoding phases.)
Jaehoon Heo, Adiwena Putra, Jieon Yoon, Sungwoong Yune, Hangyeol Lee, Ji-Hoon Kim, Joo-Young Kim, 10 Jan 2025, EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models, https://arxiv.org/abs/2501.05680
Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
Jiujun He, Huazhen Lin, 10 Jun 2025, Olica: Efficient Structured Pruning of Large Language Models without Retraining, https://arxiv.org/abs/2506.08436

FFN Approximation

The FFN can be approximated in various ways. For example, there are various ways to approximate matrix multiplications, such as LoRA and other low-rank factorization methods. Quantization of the FFN weights is also a type of approximation using smaller integers (see quantization).

Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 14 Feb 2024, HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference, https://arxiv.org/abs/2402.09360 (Attempts to estimate the output of top-k decoding, so as to prune computations on two dimensions earlier in the inference computations.)
Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
S Peng, F Yang, N Sun, S Chen, Y Jiang, A Pan, Oct 2023, Exploring Post-Training Quantization of Protein Language Models, arXiv preprint arXiv:2310.19624, https://arxiv.org/abs/2310.19624
Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
Zhiyang Chen; Yousong Zhu; Zhaowen Li; Fan Yang et al., The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 4130-4134, doi: 10.1109/ICASSP48485.2024.10447756, https://ieeexplore.ieee.org/abstract/document/10447756
Suman Sapkota, 21 Oct 2024, Metric as Transform: Exploring beyond Affine Transform for Interpretable Neural Network, https://arxiv.org/abs/2410.16159
Vincent-Pierre Berges, Barlas Oguz, December 12, 2024, Memory Layers at Scale, Meta, https://ai.meta.com/research/publications/memory-layers-at-scale/ https://github.com/facebookresearch/memory (Augmention of an LLM with an additional key-value associative memory, by replacing some FFNs with a "memory layer".)
Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)

FFN Sparsity

Sparsification of the weights in the FFN is an optimization. This is usually under the general category of whole-model sparsity (see LLM sparsity research), but there is also FFN-only sparsification.

Lilian Weng, January 10, 2023, Large Transformer Model Inference Optimization, https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
Jie Tang; Shuai Wang; Song Chen; Yi Kang, May 2024, DP-FFN: Block-Based Dynamic Pooling for Accelerating Feed-Forward Layers in Transformers, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/abstract/document/10558119
Yanjun Zhao, Tian Zhou, Chao Chen, Liang Sun, Yi Qian, Rong Jin, 8 Feb 2024, Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting, https://arxiv.org/abs/2402.05830
Zhiyang Chen; Yousong Zhu; Zhaowen Li; Fan Yang et al., The Devil is in Details: Delving Into Lite FFN Design for Vision Transformers, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, Republic of, 2024, pp. 4130-4134, doi: 10.1109/ICASSP48485.2024.10447756, https://ieeexplore.ieee.org/abstract/document/10447756

FFN Parameter Sharing

The FFN weights can be shared in a "fused FFN" optimization, which has a similar effect to FFN pruning. Both methods are types of model compression that reduce storage of FFN weights. Research papers include:

Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama (Shared FFN layers, similar to pruning several FFNs, for on-mobile small model execution.)
Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
Asif Razzaq, March 29, 2025, NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized, https://www.marktechpost.com/2025/03/29/nvidia-ai-researchers-introduce-ffn-fusion-a-novel-optimization-technique-that-demonstrates-how-sequential-computation-in-large-language-models-llms-can-be-effectively-parallelized/
Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908

FFN Optimization Research

FFNs can be optimized using any of the various matrix multiplication optimizations, and there are also some other FFN-specific optimizations. Research papers include:

8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
Jie Tang; Shuai Wang; Song Chen; Yi Kang, May 2024, DP-FFN: Block-Based Dynamic Pooling for Accelerating Feed-Forward Layers in Transformers, 2024 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/abstract/document/10558119
Amin Aminifar, Baichuan Huang, Azra Abtahi, Amir Aminifar, May 2024, Lightweight Inference for Forward-Forward Algorithm, https://whubaichuan.github.io/data/LightFF.pdf
Nils Graef, Matthew Clapp, Andrew Wasielewski, 12 Jul 2024, Flash normalization: fast RMSNorm for LLMs, https://arxiv.org/abs/2407.09577 Code: https://huggingface.co/open-machine/FlashNorm
Yulei Qian, Fengcun Li, Xiangyang Ji, Xiaoyu Zhao, Jianchao Tan, Kefeng Zhang, Xunliang Cai, 16 Oct 2024, EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference, https://arxiv.org/abs/2410.12247
Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv, 28 Nov 2024, Puzzle: Distillation-Based NAS for Inference-Optimized LLMs,NVIDIA Research, https://arxiv.org/abs/2411.19146 (This is dynamic NAS on a vast scale in a search space of size 10^138, because the optimization is applied with low granularity to each block in attention and FFN subcomponents of each layer.)
Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng-Ann Heng, Minyi Guo, Chao Li, 18 Dec 2024, A Survey on Inference Optimization Techniques for Mixture of Experts Models, https://arxiv.org/abs/2412.14219 (Broad survey of MoE inference optimization from hardware to model compression to expert parallelism.)
Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen, 17 Jan 2025, Accelerating Large Language Models through Partially Linear Feed-Forward Network, https://arxiv.org/abs/2501.10054 (Inspired by constant folding, the optimization is merging the two MatMuls in an FFN by approximating the itervening non-linear activation function (e.g., RELU or GELU), with linear functions and merging the two matrices using matrix-multiplication associativity.)
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Hanwei Xu, Hao Yang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J.L. Cai, Jian Liang, et al. (additional authors not shown), 7 May 2024 (v1), last revised 19 Jun 2024 (v5), DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model DeepSeek-AI, https://arxiv.org/abs/2405.04434 (Introduces MHLA attention and FFN optimizations, amongst other advances.)
Asif Razzaq, March 29, 2025, NVIDIA AI Researchers Introduce FFN Fusion: A Novel Optimization Technique that Demonstrates How Sequential Computation in Large Language Models LLMs can be Effectively Parallelized, https://www.marktechpost.com/2025/03/29/nvidia-ai-researchers-introduce-ffn-fusion-a-novel-optimization-technique-that-demonstrates-how-sequential-computation-in-large-language-models-llms-can-be-effectively-parallelized/
Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv, 24 Mar 2025, FFN Fusion: Rethinking Sequential Computation in Large Language Models, https://arxiv.org/abs/2503.18908
Isaac Gerber, 10 May 2025, Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models, https://arxiv.org/abs/2505.06633

Bias Pruning

Bias pruning is the removal of the extra vector addition of a "bias vector" in the FFN computations. A typical FFN does a computation, twice, of "Wx+b" where W is the weight matrix, x is the activations vector, and b is the bias vector. Hence, an FFN does a matrix multiplications and then adds an extra "bias" vector, then does an activation function (e.g., RELU or SwiGLU), and then repeats the "Wx+b" matrix operation a second time. Bias pruning is the removal of the bias vectors from the Feed-Forward Network (FFN). In this way, the FFN is partially pruned, and has a reduced cost. This is only an incremental benefit in speed and cost, since bias addition is only a linear operation (adding a vector of biases), whereas the matrix-vector multiplications are quadratic.

Research papers that examine removing bias vectors from FFN components:

Noam Shazeer, Feb 2020, GLU Variants Improve Transformer, https://arxiv.org/pdf/2002.05202.pdf
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter Liu, 2019, Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, https://arxiv.org/abs/1910.10683 (T5 FFN code has no biases.)
Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Found gains from disabling biases in linear layers.)
Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. 2019. Understanding and Improving Layer Normalization. In Proceedings of NeurIPS. 4383–4393. https://proceedings.neurips.cc/paper/2019/hash/2f4fe03d77724a7217006e5d16728874-Abstract.html, https://arxiv.org/abs/1911.07013 (Found that biases were not needed.)
Alibaba Qwen Team, Sep 2023, Qwen Technical Report, https://arxiv.org/pdf/2309.16609.pdf (Biases were removed from FFNs, but were added to QKV attention.)
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. https://arxiv.org/abs/2204.02311 (Google Palm architecture removed all biases from dense kernels and layer norms.)
Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli, 19 Dec 2024 (v2)], Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference, https://arxiv.org/abs/2412.13663 (Encoder-only BERT model updated with modern optimizations including Flash attention, bias removal, RoPE, pre-norm, and GeGLU, a GELU varaint, hybrid local-global attention, and zero padding removal.)
Tong Xiao, Jingbo Zhu, 16 Jan 2025, Foundations of Large Language Models, https://arxiv.org/abs/2501.09223 (Huge 230 page paper on many topics such as training, prompting, alignment, and long context.)

Integer FFN Modules

The use of integers in FFN's is usually in the realm of quantization, which typically apply to the attention weights and the FFN weights. Hence, integer FFNs or "quantized FFNs" are not a common optimization on their own. However, it is possible to quantize only the FFN weights to integers, while leaving the attention matrices as FP32 (i.e., mixed-precision quantization).

Alternatives to FFN Modules

There are various ways to approximate or otherwise use an alternative to the normal FFN modules. See also zero-multiplication models for a more general area of research.

Zhanpeng Zeng, Michael Davies, Pranav Pulijala, Karthikeyan Sankaralingam, Vikas Singh, 2023, LookupFFN: Making Transformers Compute-lite for CPU inference, Proceedings of the 40th International Conference on Machine Learning, PMLR 202:40707-40718, https://proceedings.mlr.press/v202/zeng23a.html https://proceedings.mlr.press/v202/zeng23a/zeng23a.pdf https://openreview.net/forum?id=MmYoDC7dH9 https://github.com/mlpen/LookupFFN
Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre, 24 Jun 2024, Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers, https://arxiv.org/abs/2406.16450 Code: https://github.com/CLAIRE-Labo/StructuredFFN/tree/main
Vincent-Pierre Berges, Barlas Oguz, December 12, 2024, Memory Layers at Scale, Meta, https://ai.meta.com/research/publications/memory-layers-at-scale/ https://github.com/facebookresearch/memory (Augmention of an LLM with an additional key-value associative memory, by replacing some FFNs with a "memory layer".)

Skip Connection Pruning and Optimization

Skip connections, or residual connections, were introduced to solve the vanishing gradient problem during training. However, they add an extra level of computation, and some recent research has been examing removal of skip connections, or other types of optimizations through modification of skip connections. Note that the efficiency gain from removing skip connections is incremental, since it is a vector or matrix addition, which costs less than matrix-vector or matrix-matrix multiplications (see also matrix multiplication optimizations).

Research papers include:

Olivia Weng, Gabriel Marcano, Vladimir Loncar, Alireza Khodamoradi, Nojan Sheybani, Andres Meza, Farinaz Koushanfar, Kristof Denolf, Javier Mauricio Duarte, Ryan Kastner, 15 Sep 2023 (v2), Tailor: Altering Skip Connections for Resource-Efficient Inference, https://arxiv.org/abs/2301.07247
Hong Su, 19 Aug 2024, AdaResNet: Enhancing Residual Networks with Dynamic Weight Adjustment for Improved Feature Integration, https://arxiv.org/abs/2408.09958
Seung Park, Yong-Goo Shin, 8 Jul 2024, Rethinking Image Skip Connections in StyleGAN2, https://arxiv.org/abs/2407.05527
James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz, 5 Oct 2021, Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping, https://arxiv.org/abs/2110.01765
Fanxu Meng, Hao Cheng, Jiaxin Zhuang, Ke Li, Xing Sun, 1 Nov 2021, RMNet: Equivalently Removing Residual Connection from Networks, https://arxiv.org/abs/2111.00687 https://github.com/fxmeng/RMNet
Jian-Hao Luo, Jianxin Wu, 25 Apr 2020 (v3), Neural Network Pruning with Residual-Connections and Limited-Data, https://arxiv.org/abs/1911.08114
Monti, R.P., Tootoonian, S., Cao, R. (2018). Avoiding Degradation in Deep Feed-Forward Networks by Phasing Out Skip-Connections. In: Kůrková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds) Artificial Neural Networks and Machine Learning – ICANN 2018. ICANN 2018. Lecture Notes in Computer Science(), vol 11141. Springer, Cham. https://doi.org/10.1007/978-3-030-01424-7_44 https://openreview.net/pdf?id=BJQPG5lR- https://link.springer.com/chapter/10.1007/978-3-030-01424-7_44
Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, Abhinav Gupta, 19 Sep 2017 (v2), Beyond Skip Connections: Top-Down Modulation for Object Detection, https://arxiv.org/abs/1612.06851
Wang, H., Cao, P., Wang, J., & Zaiane, O. R. (2022). UCTransNet: Rethinking the Skip Connections in U-Net from a Channel-Wise Perspective with Transformer. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3), 2441-2449. https://doi.org/10.1609/aaai.v36i3.20144 https://github.com/McGregorWwww/UCTransNet
Sergey Zagoruyko, Nikos Komodakis, 26 Jan 2018 (v2), DiracNets: Training Very Deep Neural Networks Without Skip-Connections, https://arxiv.org/abs/1706.00388
J. -R. Lee and Y. -H. Moon, "Revisiting Layer-level Residual Connections for Efficient Object Detection," 2024 IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Niagara Falls, ON, Canada, 2024, pp. 1-8, doi: 10.1109/AVSS61716.2024.10672579. https://ieeexplore.ieee.org/abstract/document/10672579
Liang Jia, Jin Tan, Lijin Qi, Mingwen Lin, 5 Sept 2024, A More Efficient Inference Model for Multimodal Emotion Recognition, https://openreview.net/forum?id=vA0IDtOARE https://openreview.net/pdf?id=vA0IDtOARE
Anonymous authors, Oct 2024, Dense Attention: No-CCompromise Exact All NxN Interactions Algorithm with O(N) Space and Time Complexity, https://openreview.net/pdf?id=2bIQBDSfRk
Felix Jimenez, Matthias Katzfuss, 8 Jan 2025, Probabilistic Skip Connections for Deterministic Uncertainty Quantification in Deep Neural Networks, https://arxiv.org/abs/2501.04816
Yiping Ji, Hemanth Saratchandran, Peyman Moghaddam, Simon Lucey, 4 May 2025, Always Skip Attention, AIML, https://arxiv.org/abs/2505.01996

More Research on Pruning Types

Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance
Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning
Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal
Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings)
Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning