Aussie AI

Parameter and Weight Sharing

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Parameter and Weight Sharing

Parameter sharing, also called “weight sharing”, is the re-use of the same parameters by different structures of the Transformer. Parameters can be shared for attention heads and feed-forward networks. There are fewer weights to store than the original model, because some are shared.

Parameter sharing and pruning are similar techniques, both being forms of model compression, but they are not the same. For example, consider the layers. Each layer of the default Transformer typically has its own set of weights for each structure. When the same set of weights is used across multiple layers, this is called layer fusion, and is conceptually similar to layer pruning. However, note that layer pruning reduces the number of layers that are executed, whereas layerwise parameter sharing does not, although the two ideas can be combined.

Parameter sharing reduces the total number of weights to be stored, thereby reducing model size. The model file is smaller and model loading is faster with reduced overhead. However, the bigger gain is that Transformers are often memory-bound rather than CPU-bound, so reduced cost of data transfers from memory can also reduce latency and improve inference throughput.

Training time can also be improved by parameter sharing, as there are fewer parameters to train. Obviously, this architecture requires a non-standard extension to the normal Transformer training algorithms.

Parameter sharing has mainly been seen in the literature with “layer fusion” as an alternative to layer pruning. However, there is conceptually no reason why any other type of structured pruning could not be modified to use parameter sharing for that structure. Share weights instead of pruning weights. If the FFN weights are shared, this is similar to FFN pruning. Similarly, sharing attention head weights is akin to head pruning. All parameter sharing approaches are not faster arithmetically since they perform an identical number of operations, but would reduce model size and memory usage, which also improves speed on a memory-bound engine. These other types of “structured parameter sharing” seem an interesting under-researched area in need of contributions.

Research papers on parameter sharing:

Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
Qian Lou, Ting Hua, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. 2022. Dictformer: Tiny transformer with shared dictionary, In International Conference on Learning Representations. https://sra.samsung.com/publications/dictformer-tiny-transformer-with-shared-dictionary/ (Effectively shares parameters by using dictionary lookups.)
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories, In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2012.14913 (Explores how FFN's work in depth, with relevance to sharing FFN weights.)
Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation, In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes “shared layers” with shared decoder FFN weights.)
Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Subformer: Exploring weight sharing for parameter efficiency in generative transformers, In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4081–4090, Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2101.00234 (Parameter sharing across layers.)
Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. 2020. Analyzing redundancy in pretrained transformer models, In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–4926, Online. Association for Computational Linguistics. https://arxiv.org/abs/2004.04010 (Detailed analysis finding redundancy in 85% of parameters, with relevance to pruning and sharing.)
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In International Conference on Learning Representations, https://arxiv.org/abs/1807.03819 (Optimizes Transformers with weight sharing and other ways.)
Sho Takase and Shun Kiyono. 2023. Lessons on parameter sharing across layers in transformers, In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 78–90, Toronto, Canada (Hybrid). Association for Computational Linguistics. https://arxiv.org/abs/2104.06022
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations, In Proceedings of ICLR. https://arxiv.org/abs/1909.11942 (Parameter sharing across layers in the BERT Transformer architecture.)
Raj Dabre and Atsushi Fujita. 2019. Recurrent stacking of layers for compact neural machine translation models, Proceedings of AAAI, 33:6292–6299. https://arxiv.org/abs/1807.05353 (Parameter sharing across layers of a Transformer.)
Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer, In Proceedings of IJCAI, pages 5292–5298, https://arxiv.org/abs/1906.11024 (Parameter sharing of attention heads.)
Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied transformers: Neural machine translation with shared encoder and decoder, Proceedings of AAAI, 33(01):5466–5473. PDF: https://taoqin.github.io/papers/tiedT.AAAI2019.pdf
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, Nov 2021, A Survey on Green Deep Learning, https://arxiv.org/abs/2111.05193 (Contains several sections surveying weight sharing.)
Chu, X.; Zhang, B.; Xu, R., 2021, FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search, In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12219–12228. http://dx.doi.org/10.1109/ICCV48922.2021.01202, https://arxiv.org/abs/1907.01845 (NAS in the context of weight sharing architectures.)
Aich, S.; Yamazaki, M.; Taniguchi, Y.; Stavness, I., 2020, Multi-Scale Weight Sharing Network for Image Recognition, Pattern Recognit. Lett. 2020, 131, 348–354. http://dx.doi.org/10.1016/j.patrec.2020.01.011, https://arxiv.org/abs/2001.02816
Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll, Feb 2019, Convolutional neural networks with layer reuse, 2019 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/8802998/, https://arxiv.org/pdf/1901.09615 (The method of repeatedly re-using the same entire layers.)
M Mary Shanthi Rani, P Chitra, S Lakshmanan, M Kalpana Devi, R Sangeetha, S Nithya, 2022, DeepCompNet: A novel neural net model compression architecture, Comput Intell Neurosci. 2022 Feb 22;2022:2213273. https://pubmed.ncbi.nlm.nih.gov/35242176/, https://www.hindawi.com/journals/cin/2022/2213273/ (Combines quantization and pruning with weight sharing.)
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp, In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. https://arxiv.org/abs/1902.00751
X Wang, P Guo, Y Zhang, 2023, Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer, ECML PKDD 2023: Machine Learning and Knowledge Discovery in Databases: Research Track pp 309–325, https://arxiv.org/abs/2201.05887 (Attention optimization method that uses weight sharing.)
Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 (Multi-query attention shares KV tensors across multiple attention heads.)
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal Transformers, In Proceedings of ICLR. https://openreview.net/forum?id=HyzdRiR9Y7, PDF: https://openreview.net/pdf?id=HyzdRiR9Y7
C Fu, 2023, Machine Learning Algorithm and System Co-design for Hardware Efficiency, Ph.D. thesis, Computer Science, University of California San Diego, https://escholarship.org/content/qt52q368p3/qt52q368p3.pdf
S Tan, Y Shen, Z Chen, A Courville, C Gan, Oct 2023, Sparse Universal Transformer, arXiv preprint arXiv:2310.07096, https://arxiv.org/pdf/2310.07096.pdf

For more research on parameter sharing, see https://www.aussieai.com/research/parameter-sharing and also on layer fusion at https://www.aussieai.com/research/layer-fusion.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Parameter and Weight Sharing

Parameter and Weight Sharing

Quick Links

Product

New to Writing?

Writing Styles