Aussie AI

Layer Fusion

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Layer Fusion

Layer fusion is a type of weight sharing (parameter sharing), where two layers are made identical by also having the same weights. Training of a multi-layer model will create a different set of weights for each layer. However, some layers can end up being very similar if the weight matrices are close enough. Hence, some layers can simply have their own weight matrix thrown away, and use the same set of weights. This is conceptually the same as having the same layer run twice.

Parameter sharing reduces the data size of the model, improving storage utilization and in-memory size, but not necessarily reducing the number of arithmetic operations. However, Transformers can be memory-bound, so that having fewer parameters being transferred from memory can speed up latency as well in some architectures.

Layer fusion and kernel fusion are two different optimizations. Whereas layer fusion merges two layers at the high-level by sharing weights, kernel fusion merges two lower-level operations into one. For example, two Transformer components can be merged into a single kernel, such as merging MatMuls with LayerNorm (i.e. “fused LayerNorm”). Kernel fusion involves combining their programmatic algorithms, and thereby improves computation speed, but doesn't change model storage size, whereas layer fusion does compress the model.

Research papers on layer fusion:

  1. Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
  2. Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation, In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes “shared layers” with shared decoder FFN weights.)
  3. Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Subformer: Exploring weight sharing for parameter efficiency in generative transformers, In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4081–4090, Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2101.00234 (Parameter sharing across layers.)
  4. Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers, In International Conference on Learning Representations. https://arxiv.org/abs/1807.03819 (Optimizes Transformers with weight sharing and other ways.)
  5. Sho Takase and Shun Kiyono. 2023. Lessons on parameter sharing across layers in transformers, In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 78–90, Toronto, Canada (Hybrid). Association for Computational Linguistics. https://arxiv.org/abs/2104.06022
  6. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations, In Proceedings of ICLR. https://arxiv.org/abs/1909.11942 (Parameter sharing across layers in the BERT Transformer architecture.)
  7. Raj Dabre and Atsushi Fujita. 2019. Recurrent stacking of layers for compact neural machine translation models, Proceedings of AAAI, 33:6292–6299. https://arxiv.org/abs/1807.05353 (Parameter sharing across layers of a Transformer.)
  8. Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer, In Proceedings of IJCAI, pages 5292–5298, https://arxiv.org/abs/1906.11024 (Parameter sharing of attention heads.)
  9. Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied transformers: Neural machine translation with shared encoder and decoder, Proceedings of AAAI, 33(01):5466–5473. PDF: https://taoqin.github.io/papers/tiedT.AAAI2019.pdf
  10. Osorio, J.; Armejach, A.; Petit, E.; Henry, G.; Casas, M., 2022, A BF16 FMA is All You Need for DNN Training, IEEE Trans. Emerg. Top. Comput. 2022, 10, 1302–1314. http://dx.doi.org/10.1109/TETC.2022.3187770, https://ieeexplore.ieee.org/document/9823406 (Special fused operators to allow full training using BF16 number representations.)
  11. Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll, Feb 2019, Convolutional neural networks with layer reuse, 2019 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/8802998/, https://arxiv.org/pdf/1901.09615 (The method of repeatedly re-using the same entire layers.)
  12. M. Alwani, H. Chen, M. Ferdman and P. A. Milder, 2016, Fused-layer CNN accelerators, 49th Annual IEEE/ACM International Symposium on Microarchitecture MICRO 2016, pp. 22:1-22:12, October 15-19, 2016, https://doi.org/10.1109/MICRO.2016.7783725, https://ieeexplore.ieee.org/document/7783725
  13. E. Georganas, S. Avancha, K. Banerjee, D. Kalamkar, G. Henry, H. Pabst, and A. Heinecke, 2018, Anatomy of high-performance deep learning convolutions on simd architectures, in Accepted to Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’18. IEEE Press, 2018, https://arxiv.org/abs/1808.05567 (Investigates layer fusion and sublayer fusion, i.e. kernel fusion.)
  14. L Waeijen, S Sioutas, M Peemen, M Lindwer, 2021, ConvFusion: A model for layer fusion in convolutional neural networks, IEEE Access (Volume: 9), https://ieeexplore.ieee.org/abstract/document/9646923/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09646923.pdf (Analysis of loop tiling, loop reordering, data flow, recomputation, and layer fusion.)

For more research on the layer fusion, refer to https://www.aussieai.com/research/layer-fusion.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++