Aussie AI

Layer Fusion

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

Layer fusion is closely related to weight sharing (parameter sharing), which is one common type, where two layers of the same structures are made identical by also having the same weights. Parameter sharing reduces the data size of the model, improving storage utilization but not necessarily reducing the number of arithmetic operations. However, Transformers can be memory-bound, so that having fewer parameters can speed up latency as well, dependent on the hardware architecture.

Another type of layer fusion is merging two different types of model structures in the kernel, such as merging matmuls with LayerNorm (i.e. "fused LayerNorm"). This involves combining their programmatic algorithms (rather than combining their "data" into shared weights), and thereby improves computation speed, but doesn't change model storage size. See kernel operator fusion research.

Layers can also be removed instead of fused. Layer fusion does not necessarily reduce the number of layers executed, but it can also be combined with this strategy. See layer pruning, layer skipping, and early exiting layers. This is also called "depth pruning". A common Transformer architecture with layer pruning is the "shallow decoder" architecture. There is also FFN pruning, which removes the FFN from layers.

Research on Layer Fusion

Research papers on layer fusion include:

  • Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
  • Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
  • Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4081–4090, Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2101.00234 (Parameter sharing across layers.)
  • Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In International Conference on Learning Representations. https://arxiv.org/abs/1807.03819 (Optimizes Transformers with weight sharing and other ways.)
  • Sho Takase and Shun Kiyono. 2023. Lessons on parameter sharing across layers in transformers. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 78–90, Toronto, Canada (Hybrid). Association for Computational Linguistics. https://arxiv.org/abs/2104.06022
  • Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations. In Proceedings of ICLR. https://arxiv.org/abs/1909.11942 (Parameter sharing across layers in the BERT Transformer architecture.)
  • Raj Dabre and Atsushi Fujita. 2019. Recurrent stacking of layers for compact neural machine translation models. Proceedings of AAAI, 33:6292–6299. https://arxiv.org/abs/1807.05353 (Parameter sharing across layers of a Transformer.)
  • Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer. In Proceedings of IJCAI, pages 5292–5298, https://arxiv.org/abs/1906.11024 (Parameter sharing of attention heads.)
  • Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied transformers: Neural machine translation with shared encoder and decoder. Proceedings of AAAI, 33(01):5466–5473. PDF: https://taoqin.github.io/papers/tiedT.AAAI2019.pdf
  • Osorio, J.; Armejach, A.; Petit, E.; Henry, G.; Casas, M., A BF16 FMA is All You Need for DNN Training. IEEE Trans. Emerg. Top. Comput. 2022, 10, 1302–1314. http://dx.doi.org/10.1109/TETC.2022.3187770, https://ieeexplore.ieee.org/document/9823406 (Special fused operators to allow full training using BF16 number representations.)
  • Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll, Feb 2019, Convolutional neural networks with layer reuse, 2019 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/8802998/, https://arxiv.org/pdf/1901.09615 (The method of repeatedly re-using the same entire layers.)
  • M. Alwani, H. Chen, M. Ferdman and P. A. Milder, "Fused-layer CNN accelerators", 49th Annual IEEE/ACM International Symposium on Microarchitecture MICRO 2016, pp. 22:1-22:12, October 15-19, 2016, https://doi.org/10.1109/MICRO.2016.7783725, https://ieeexplore.ieee.org/document/7783725
  • E. Georganas, S. Avancha, K. Banerjee, D. Kalamkar, G. Henry, H. Pabst, and A. Heinecke, “Anatomy of high-performance deep learning convolutions on simd architectures,” in Accepted to Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’18. IEEE Press, 2018, https://arxiv.org/abs/1808.05567 (Investigates layer fusion and sublayer fusion, i.e. kernel fusion.)
  • L Waeijen, S Sioutas, M Peemen, M Lindwer, 2021, ConvFusion: A model for layer fusion in convolutional neural networks, IEEE Access (Volume: 9), https://ieeexplore.ieee.org/abstract/document/9646923/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09646923.pdf (Analysis of loop tiling, loop reordering, data flow, recomputation, and layer fusion.)
  • Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
  • Haoyi Wu, Kewei Tu, 17 May 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Use the KV cache for only the final layer as the KV cache for all other layers, or alternatively, use only the cache from a few layers, also possibly using a few standard layers as "warmup layers". This idea is conceptuatlly similar to "propagation" of the KV cache in early exit methods or to layer fusion of weights.)
  • Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
  • Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, Nov 2022, Efficiently Scaling Transformer Inference, Google Research, https://arxiv.org/abs/2211.05102
  • Y Hu, J Zhang, C Zhao, C Li, H Chen, 2023, Transformer Compression via Subspace Projection, arXiv preprint arXiv:2308.16475, https://arxiv.org/abs/2308.16475
  • Raj Dabre, Raphael Rubino, and Atsushi Fujita. 2020. Balancing cost and benefit with tied-multi transformers. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 24–34, Online. Association for Computational Linguistics. https://arxiv.org/abs/2002.08614 (Choose number of layers for encoder and decoder based on input; dynamic layer pruning)
  • Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
  • Meng Wang; Liang Qian; Na Meng; Yusong Cheng; Weiwei Fang, Nov 2023, Model Parallelism Optimization for Distributed DNN Inference on Edge Devices, 2023 IEEE 14th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), https://ieeexplore.ieee.org/abstract/document/10391646 (Distributes inference across multiple edge devices at the layer level, with further optimization using layer fusion.)
  • William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981
  • Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
  • NVIDIA, 2023, NVIDIA FasterTransformer, https://github.com/NVIDIA/FasterTransformer
  • Z Gong, H Ji, Y Yao, CW Fletcher, CJ Hughes, 2022, Graphite: optimizing graph neural networks on CPUs through cooperative software-hardware techniques, https://dl.acm.org/doi/abs/10.1145/3470496.3527403 https://dl.acm.org/doi/pdf/10.1145/3470496.3527403
  • Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, https://arxiv.org/abs/2405.14366 (Compresses the KV cache on the depth dimension of layers, analogous to layer fusion.)
  • Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
  • David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied Transformers: Neural machine translation with shared encoder and decoder. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5466–5473, Honolulu, USA. https://aaai.org/ojs/index.php/AAAI/article/view/4487
  • S. Sun, Y. Cheng, Z. Gan, J. Liu, Patient knowledge distillation for BERT model compression, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 4322–4331. URL: https://www.aclweb.org/anthology/D19-1441. doi:10.18653/v1/D19-1441.
  • J Yang, Y Yin, L Yang, S Ma, H Huang, 2022, Gtrans: Grouping and fusing transformer layers for neural machine translation, https://arxiv.org/pdf/2207.14467
  • Y Zheng, L Lin, Z Lai, B Wang, S Liu, B Fu, 2023, Layer-wise Representation Fusion for Compositional Generalization, https://arxiv.org/abs/2307.10799
  • Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
  • Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro, 19 Jun 2024 (v2), TroL: Traversal of Layers for Large Language and Vision Models, https://arxiv.org/abs/2406.12246 https://arxiv.org/pdf/2406.12246 (To achieve higher accuracy, this model re-traverses some of the layers, which achieves higher model accuracy from the same size model without more memory.)
  • Francesco Daghero, Alessio Burrello, Massimo Poncino, Enrico Macii, Daniele Jahier Pagliari, 18 Jun 2024, Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices, SAMOS2024 conference, https://arxiv.org/abs/2406.12478 Code: https://github.com/eml-eda/depthwise-separable-fusion
  • Jiachen Jiang, Jinxin Zhou, Zhihui Zhu, 20 Jun 2024, On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier, https://arxiv.org/abs/2406.14479 (Using layer similarity for early exit classifiers, which is also related to layer fusion.)
  • Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Xi Chen, Cunhang Fan, Zhao Lv, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui, 24 Jun 2024, Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging, https://arxiv.org/abs/2406.16330
  • Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
  • Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 27 Jun 2024 (v2), MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, Meta Research, https://arxiv.org/abs/2402.14905 Code: https://github.com/facebookresearch/MobileLLM
  • Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, Tianlong Chen, 3 Jul 2024, DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs. https://arxiv.org/abs/2407.11030
  • Jinuk Kim, Marwa El Halabi, Mingi Ji, Hyun Oh Song, July 2024, LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:23825-23842, 2024, https://proceedings.mlr.press/v235/kim24c.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/kim24c/kim24c.pdf Code: https://github.com/snu-mllab/LayerMerge
  • Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
  • Hossein Rajabzadeh, Aref Jafari, Aman Sharma, Benyamin Jami, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 22 Sep 2024, EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models, https://arxiv.org/abs/2409.14595
  • Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
  • Anonymous authors, Oct 2024, Forget the Data and Fine-Tuning! Just Fold the Network to Compress, https://openreview.net/pdf?id=W2Wkp9MQsF
  • Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan, 23 Oct 2024, Value Residual Learning For Alleviating Attention Concentration In Transformers, https://arxiv.org/abs/2410.17897
  • Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster, 28 Oct 2024, Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA, https://arxiv.org/abs/2410.20672
  • David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela Wang, Mingran Wang, Raghu Prabhakar, 31 Oct 2024, Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance, https://arxiv.org/abs/2410.23668
  • Y Zhou, C Zhou, W Xie, X Wang, J Chen, Z Ni, J Li, 2024, The Benefits in Shallow: Merge Decoding Across Large Language Model Layers. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15360. Springer, Singapore. https://doi.org/10.1007/978-981-97-9434-8_30 https://link.springer.com/chapter/10.1007/978-981-97-9434-8_30
  • Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
  • Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
  • Seul-Ki Yeom, Tae-Ho Kim, 3 Dec 2024, UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices, https://arxiv.org/abs/2412.02344 (Shared attention matrix generalizes MHA with fused attention matrixes across layers.)

KV Caching and Layer Fusion

There are analogous optimization techniques for the KV cache data. KV caching involves storing a per-layer set of data, and many optimizations have been researched. Read more about these KV cache research areas:

More Research on Pruning Types

More AI Research

Read more about: