Aussie AI
Parameter Sharing
-
Last Updated 7 December, 2024
-
by David Spuler, Ph.D.
What is Parameter Sharing?
Parameter sharing, also called "weight sharing", is the use of the same parameters by different structures of the Transformer. Parameter sharing and pruning are similar techniques, both being forms of model compression, but they are not the same. Pruning avoids doing some computations, whereas parameter sharing still does all the computations, but with shared parameters (reducing the total number of stored weights).
Each layer of the Transformer typically has its own set of weights for each structure. When the same set of weights is used across multiple layers, this is a type of layer fusion, and is conceptually similar to layer pruning. However, note that layer pruning reduces the number of layers that are executed, whereas layerwise parameter sharing does not (although the two ideas can be combined).
Parameter sharing reduces the total number of weights to be stored, thereby reducing model size. Since the operation of loading the model into memory for arithmetic operations is itself costly (sometimes called "overhead"), and Transformers are sometimes memory-bound rather than CPU-bound (i.e, in the decoding phase), sharing the data can also sometimes reduce latency and improve inference throughput, even though it is not actually reducing the number of computations.
Training time can also be improved by parameter sharing, as there are fewer parameters to train. Obviously, this architecture requires a non-standard extension to the normal Transformer training algorithms.
Types of Parameter Sharing
Parameters can be shared for structures such as:
- Layer fusion (sharing all weights in a layer or "layerwise parameter sharing")
- Attention head fusion (a type of "width-wise parameter sharing")
- Feed-forward network (FFN) parameter sharing
- Lengthwise parameter sharing (see token pruning and token merging)
- KV cache data layer fusion
A variant is more granular weight sharing limited to subcomponents within a layer, rather than every weight in a layer. If the FFN weights are shared, this is similar to FFN pruning. Similarly, sharing attention head weights is akin to head pruning.
The ideas of layer fusion can also be used for the KV cache data layers in the method of KV cache layer fusion, as a type of KV cache compression that reduces the size of the KV cache. It is also possible to fuse the KV heads, using weight sharing on the width dimension.
Whole Model Weight Sharing
Large-scale parameter sharing can occur for an entire set of weights in an LLM. The following techniques aren't really considered to be types of parameter sharing, and yet, they really should be!
- Multi-LoRA architectures (they share an entire set of weights of a big model).
- Self-speculative decoding (the drafter and verifier share model weights).
- Submodels (many-in-one)
- Mixture-of-experts (where it's within a large model).
Layer Fusion
Layer fusion is the sharing of weights for entire layers of a model. See also layer fusion and also layer pruning. Research papers include:
- Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov, 19 May 2024, Your Transformer is Secretly Linear, https://arxiv.org/abs/2405.12250 (Replacing model layers in the decoder with linear approximations.)
- Haoyi Wu, Kewei Tu, 17 May 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Use the KV cache for only the final layer as the KV cache for all other layers, or alternatively, use only the cache from a few layers, also possibly using a few standard layers as "warmup layers". This idea is conceptuatlly similar to "propagation" of the KV cache in early exit methods or to layer fusion of weights.)
- Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
- Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, Jeff Dean, Nov 2022, Efficiently Scaling Transformer Inference, Google Research, https://arxiv.org/abs/2211.05102
- Osorio, J.; Armejach, A.; Petit, E.; Henry, G.; Casas, M., A BF16 FMA is All You Need for DNN Training. IEEE Trans. Emerg. Top. Comput. 2022, 10, 1302–1314. http://dx.doi.org/10.1109/TETC.2022.3187770 https://ieeexplore.ieee.org/document/9823406 (Special fused operators to allow full training using BF16 number representations.)
- Y Hu, J Zhang, C Zhao, C Li, H Chen, 2023, Transformer Compression via Subspace Projection, arXiv preprint arXiv:2308.16475, https://arxiv.org/abs/2308.16475
- Raj Dabre, Raphael Rubino, and Atsushi Fujita. 2020. Balancing cost and benefit with tied-multi transformers. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 24–34, Online. Association for Computational Linguistics. https://arxiv.org/abs/2002.08614 (Choose number of layers for encoder and decoder based on input; dynamic layer pruning)
- Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
- Meng Wang; Liang Qian; Na Meng; Yusong Cheng; Weiwei Fang, Nov 2023, Model Parallelism Optimization for Distributed DNN Inference on Edge Devices, 2023 IEEE 14th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP), https://ieeexplore.ieee.org/abstract/document/10391646 (Distributes inference across multiple edge devices at the layer level, with further optimization using layer fusion.)
- William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981
- Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
- NVIDIA, 2023, NVIDIA FasterTransformer, https://github.com/NVIDIA/FasterTransformer
- Z Gong, H Ji, Y Yao, CW Fletcher, CJ Hughes, 2022, Graphite: optimizing graph neural networks on CPUs through cooperative software-hardware techniques, https://dl.acm.org/doi/abs/10.1145/3470496.3527403 https://dl.acm.org/doi/pdf/10.1145/3470496.3527403
- Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, https://arxiv.org/abs/2405.14366 (Compresses the KV cache on the depth dimension of layers, analogous to layer fusion.)
- Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
- David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied Transformers: Neural machine translation with shared encoder and decoder. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 5466–5473, Honolulu, USA. https://aaai.org/ojs/index.php/AAAI/article/view/4487
- S. Sun, Y. Cheng, Z. Gan, J. Liu, Patient knowledge distillation for BERT model compression, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 4322–4331. URL: https://www.aclweb.org/anthology/D19-1441. doi:10.18653/v1/D19-1441.
- J Yang, Y Yin, L Yang, S Ma, H Huang, 2022, Gtrans: Grouping and fusing transformer layers for neural machine translation, https://arxiv.org/pdf/2207.14467
- Y Zheng, L Lin, Z Lai, B Wang, S Liu, B Fu, 2023, Layer-wise Representation Fusion for Compositional Generalization, https://arxiv.org/abs/2307.10799
- Ruisi Cai1, Saurav Muralidharan, Greg Heinrich, Hongxu Yin, Zhangyang Wang, Jan Kautz, Pavlo Molchanov, 2024, FLEXTRON: Many-in-One Flexible Large Language Model, https://openreview.net/pdf?id=9vKRhnflAs (Using one model to act in different ways by making it "elastic" with parameters, effectively using slimming via techniques such as layer fusion in MLPs and MHA Attention Heads.)
- Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro, 19 Jun 2024 (v2), TroL: Traversal of Layers for Large Language and Vision Models, https://arxiv.org/abs/2406.12246 https://arxiv.org/pdf/2406.12246 (To achieve higher accuracy, this model re-traverses some of the layers, which achieves higher model accuracy from the same size model without more memory.)
- Francesco Daghero, Alessio Burrello, Massimo Poncino, Enrico Macii, Daniele Jahier Pagliari, 18 Jun 2024, Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices, SAMOS2024 conference, https://arxiv.org/abs/2406.12478 Code: https://github.com/eml-eda/depthwise-separable-fusion
- Jiachen Jiang, Jinxin Zhou, Zhihui Zhu, 20 Jun 2024, On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier, https://arxiv.org/abs/2406.14479 (Using layer similarity for early exit classifiers, which is also related to layer fusion.)
- Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Xi Chen, Cunhang Fan, Zhao Lv, Zhiying Tu, Dianhui Chu, Bo Li, Dianbo Sui, 24 Jun 2024, Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging, https://arxiv.org/abs/2406.16330
- Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
- Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 27 Jun 2024 (v2), MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, Meta Research, https://arxiv.org/abs/2402.14905 Code: https://github.com/facebookresearch/MobileLLM
- Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, Tianlong Chen, 3 Jul 2024, DLO: Dynamic Layer Operation for Efficient Vertical Scaling of LLMs. https://arxiv.org/abs/2407.11030
- Jinuk Kim, Marwa El Halabi, Mingi Ji, Hyun Oh Song, July 2024, LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:23825-23842, 2024, https://proceedings.mlr.press/v235/kim24c.html PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/kim24c/kim24c.pdf Code: https://github.com/snu-mllab/LayerMerge
- Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
- Hossein Rajabzadeh, Aref Jafari, Aman Sharma, Benyamin Jami, Hyock Ju Kwon, Ali Ghodsi, Boxing Chen, Mehdi Rezagholizadeh, 22 Sep 2024, EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models, https://arxiv.org/abs/2409.14595
- Douglas C. Youvan, September 27, 2024, Building and Running Large-Scale Language Models: The Infrastructure and Techniques Behind GPT-4 , https://www.researchgate.net/profile/Douglas-Youvan/publication/384398902_Building_and_Running_Large-Scale_Language_Models_The_Infrastructure_and_Techniques_Behind_GPT-4/links/66f6f4d3906bca2ac3d20e68/Building-and-Running-Large-Scale-Language-Models-The-Infrastructure-and-Techniques-Behind-GPT-4.pdf
- Anonymous authors, Oct 2024, Forget the Data and Fine-Tuning! Just Fold the Network to Compress, https://openreview.net/pdf?id=W2Wkp9MQsF
- Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan, 23 Oct 2024, Value Residual Learning For Alleviating Attention Concentration In Transformers, https://arxiv.org/abs/2410.17897
- Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster, 28 Oct 2024, Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA, https://arxiv.org/abs/2410.20672
- David Koeplinger, Darshan Gandhi, Pushkar Nandkar, Nathan Sheeley, Matheen Musaddiq, Leon Zhang, Reid Goodbar, Matthew Shaffer, Han Wang, Angela Wang, Mingran Wang, Raghu Prabhakar, 31 Oct 2024, Kernel Looping: Eliminating Synchronization Boundaries for Peak Inference Performance, https://arxiv.org/abs/2410.23668
- Y Zhou, C Zhou, W Xie, X Wang, J Chen, Z Ni, J Li, 2024, The Benefits in Shallow: Merge Decoding Across Large Language Model Layers. In: Wong, D.F., Wei, Z., Yang, M. (eds) Natural Language Processing and Chinese Computing. NLPCC 2024. Lecture Notes in Computer Science(), vol 15360. Springer, Singapore. https://doi.org/10.1007/978-981-97-9434-8_30 https://link.springer.com/chapter/10.1007/978-981-97-9434-8_30
- Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen, 11 Nov 2024, Anchor Attention, Small Cache: Code Generation with Large Language Models, https://arxiv.org/abs/2411.06680
- Zehua Pei, Hui-Ling Zhen, Xianzhi Yu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu, 21 Nov 2024, FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers, https://arxiv.org/abs/2411.14507
- Seul-Ki Yeom, Tae-Ho Kim, 3 Dec 2024, UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices, https://arxiv.org/abs/2412.02344 (Shared attention matrix generalizes MHA with fused attention matrixes across layers.)
KV Cache Layer Fusion
Layer fusion applied to KV cache data is another type of parameter sharing within the KV data. Read more about KV caching optimizations.
Research papers on KV cache layer fusion:
- Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, Bohan Zhuang, 23 May 2024, MiniCache: KV Cache Compression in Depth Dimension for Large Language Models, https://arxiv.org/abs/2405.14366 (Compresses the KV cache on the depth dimension of layers, analogous to layer fusion.)
- Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Haoyi Wu, Kewei Tu, 4 Jun 2024 (v2), Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Only computes the KV cache of some layers.)
- Qichen Fu, Minsik Cho, Thomas Merth, Sachin Mehta, Mohammad Rastegari, Mahyar Najibi, 19 Jul 2024, LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference, https://arxiv.org/abs/2407.14057
- William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
- AIModels.FYI, 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://www.aimodels.fyi/papers/arxiv/layer-condensed-kv-cache-efficient-inference-large
- Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao, 13 Aug 2024 (v3), Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption, https://arxiv.org/abs/2407.18003 https://github.com/zcli-charlie/Awesome-KV%20Cache
- Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
- Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao, 18 Sep 2024, Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview, https://arxiv.org/abs/2409.11650 (Extensive survey of quantization from the basics to SOTA approaches, with also some coverage of knowledge distillation and KV cache compression.)
- Shashank Rajput, Ying Sheng, Sean Owen, Vitaliy Chiley, 23 Sep 2024, Inference-Friendly Models With MixAttention, https://arxiv.org/abs/2409.15012 (Attention optimization with sliding window attention and KV cache layer fusion, inspired by the approach of Character AI.)
- Aurick Qiao, Zhewei Yao, Samyam Rajbhandari, Yuxiong He, 4 Oct 2024, SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation, https://arxiv.org/abs/2410.03960
- You Wu, Haoyi Wu, Kewei Tu, 18 Oct 2024, A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference, https://arxiv.org/abs/2410.14442
- Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
- Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, Zhenzhong Lan, 23 Oct 2024, Value Residual Learning For Alleviating Attention Concentration In Transformers, https://arxiv.org/abs/2410.17897
- Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen, 24 Oct 2024, KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing, https://arxiv.org/abs/2410.18517
- Xingwu Sun, Yanfeng Chen, Yiqing Huang, Ruobing Xie, Jiaqi Zhu, Kai Zhang, Shuaipeng Li, Zhen Yang, Jonny Han, Xiaobo Shu, Jiahao Bu, (and many more authors), 4 Nov 2024, Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent, https://arxiv.org/abs/2411.02265 https://github.com/Tencent/Hunyuan-Large https://huggingface.co/tencent/Tencent-Hunyuan-Large
- Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
- AI: Alan Wake, Albert Wang, Bei Chen, C.X. Lv, Chao Li, Chengen Huang, Chenglin Cai, Chujie Zheng, Daniel Cooper, Ethan Dai, Fan Zhou, Feng Hu, Heng Ji, Howard Qiu, Jiangcheng Zhu, Jun Tian, Katherine Su, Lihuan Zhang, Liying Li, Ming Song, Mou Li, Peng Liu, Qichen Hu, Shawn Wang, Shijun Zhou, Shiyong Li, Tianhang Zhu, Wen Xie, Xiang He, Xiaobo Chen, Xiaohui Hu, Xiaoyi Ren, Xinyao Niu, Yanpeng Li, Yongke Zhao, Yongzhen Luo, Yuchi Xu, Yuxuan Sha, Zhaodong Yan, Zhiyuan Liu, Zirui Zhang, 3 Dec 2024 (v2), Yi-Lightning Technical Report, https://arxiv.org/abs/2412.01253 https://platform.lingyiwanwu.com/ (MoE architecture with model expert routing optimizations, also with hybrid global-local attention and fused layers in the KV caching.)
- Da Ma, Lu Chen, Situo Zhang, Yuxun Miao, Su Zhu, Zhi Chen, Hongshen Xu, Hanqi Li, Shuai Fan, Lei Pan, Kai Yu, 3 Dec 2024, Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity, https://arxiv.org/abs/2412.02252
Fused Head Attention Research
Research papers on widthwise parameter sharing via fused attention heads:
- Jielin Jiang, Hongxiang Xu, Xiaolong Xu, Yan Cui, Jintao Wu. Transformer-Based Fused Attention Combined with CNNs for Image Classification. Neural Processing Letters (2023). https://doi.org/10.1007/s11063-023-11402-1 https://link.springer.com/article/10.1007/s11063-023-11402-1
- Xiang Zhang and Lijun Yin, Mar 2022, Multi-Modal Learning for AU Detection Based on Multi-Head Fused Transformers, https://arxiv.org/pdf/2203.11441.pdf
- NVIDIA, 2023, fused_attn.h: Enums and functions for fused attention, https://nvidia.github.io/TransformerEngine/api/c/fused_attn.html
- Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
- Hui Wu, Yi Gan, Feng Yuan, Jing Ma, Wei Zhu, Yutao Xu, Hong Zhu, Yuhua Zhu, Xiaoli Liu, Jinghui Gu, Peng Zhao, 23 Jun 2024 (v2), Efficient LLM inference solution on Intel GPU, https://arxiv.org/abs/2401.05391 (Disaggregated the KV cache between prefill and decoding tokens, since theh KV cache size is known for prefill, thereby reducing memory fragmentation, and also applying kernel fusion to several modules include the scaled dot product attention.)
- Agarwal, Saurabh, Aug 2024, Minimizing Data Movement in Machine Learning Systems, Ph.D. Thesis, Computer Sciences, University of Wisconsin--Madison, https://digital.library.wisc.edu/1711.dl/MKLIYRPB24A5R9D https://search.library.wisc.edu/digital/AMKLIYRPB24A5R9D PDF: https://asset.library.wisc.edu/1711.dl/QXSTVAIXECHQA8L/R/file-62b54.pdf?dl https://www.proquest.com/openview/c1ae2a92106d7ec681a7296cd163e0c1/1 (Dataflow optimization in training and also "clustered head attention" for memory-efficient inference, an extension of multi-head attention similar to layer-wise head fusion/pruning.)
- Janek Haberer, Ali Hojjat, Olaf Landsiedel, 26 Sep 2024, HydraViT: Stacking Heads for a Scalable ViT, https://arxiv.org/abs/2409.17978 https://github.com/ds-kiel/HydraViT
- Anonymous authors, Oct 2024, Forget the Data and Fine-Tuning! Just Fold the Network to Compress, https://openreview.net/pdf?id=W2Wkp9MQsF
- Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
- Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, Pavlo Molchanov, 20 Nov 2024, Hymba: A Hybrid-head Architecture for Small Language Models, https://arxiv.org/abs/2411.13676
KV Fused Head Research
Research papers on KV head fusion or merging:
- Hao Yu, Zelan Yang, Shen Li, Yong Li, Jianxin Wu, 11 Jun 2024, Effectively Compress KV Heads for LLM, https://arxiv.org/abs/2406.07056
- Zhen Yang, J.N.Han, Kan Wu, Ruobing Xie, An Wang, Xingwu Sun, Zhanhui Kang, 20 Oct 2024, Lossless KV Cache Compression to 2%, https://arxiv.org/abs/2410.15252
- Yao Yao, Zuchao Li, Hai Zhao, 21 May 2024, SirLLM: Streaming Infinite Retentive LLM, https://arxiv.org/abs/2405.12528 (Low-rank decomposition to compress KV cache heads.)
- Isaac Rehg, 7 Oct 2024 (v2), KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head, https://arxiv.org/abs/2410.00161
- Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao, 28 Oct 2024 (v2), Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning, https://arxiv.org/abs/2410.19258
- Yuxiang Huang, Binhang Yuan, Xu Han, Chaojun Xiao, Zhiyuan Liu, 2 Oct 2024, Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads, https://arxiv.org/abs/2410.01805
- Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji, 13 Jun 2024, MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding, https://arxiv.org/abs/2406.09297 Code: https://github.com/zaydzuhri/pythia-mlkv (Extends cross-head KV sharing in MQA to also share KV data between layers, analogous to layer fusion of weights.)
FFN Parameter Sharing
The FFN weights can be shared in a "fused FFN" optimization, similar to FFN pruning. Research papers include:
- Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
- Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
- Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama (Shared FFN layers, similar to pruning several FFNs, for on-mobile small model execution.)
- Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
General Research on Parameter Sharing
There have been many attempts to speed up models using parameter sharing:
- Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
- Qian Lou, Ting Hua, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. 2022. Dictformer: Tiny transformer with shared dictionary. In International Conference on Learning Representations. https://sra.samsung.com/publications/dictformer-tiny-transformer-with-shared-dictionary/ (Effectively shares parameters by using dictionary lookups.)
- Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2012.14913 (Explores how FFN's work in depth, with relevance to sharing FFN weights.)
- Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes "shared layers" with shared decoder FFN weights.)
- Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Subformer: Exploring weight sharing for parameter efficiency in generative transformers. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4081–4090, Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2101.00234 (Parameter sharing across layers.)
- Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. 2020. Analyzing redundancy in pretrained transformer models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–4926, Online. Association for Computational Linguistics. https://arxiv.org/abs/2004.04010 (Detailed analysis finding redundancy in 85% of parameters, with relevance to pruning and sharing.)
- Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In International Conference on Learning Representations. https://arxiv.org/abs/1807.03819 (Optimizes Transformers with weight sharing and other ways.)
- Sho Takase and Shun Kiyono. 2023. Lessons on parameter sharing across layers in transformers. In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 78–90, Toronto, Canada (Hybrid). Association for Computational Linguistics. https://arxiv.org/abs/2104.06022
- Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations. In Proceedings of ICLR. https://arxiv.org/abs/1909.11942 (Parameter sharing across layers in the BERT Transformer architecture.)
- Raj Dabre and Atsushi Fujita. 2019. Recurrent stacking of layers for compact neural machine translation models. Proceedings of AAAI, 33:6292–6299. https://arxiv.org/abs/1807.05353 (Parameter sharing across layers of a Transformer.)
- Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer. In Proceedings of IJCAI, pages 5292–5298, https://arxiv.org/abs/1906.11024 (Parameter sharing of attention heads.)
- Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied transformers: Neural machine translation with shared encoder and decoder. Proceedings of AAAI, 33(01):5466–5473. PDF: https://taoqin.github.io/papers/tiedT.AAAI2019.pdf
- Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Contains several sections surveying weight sharing.)
- Chu, X.; Zhang, B.; Xu, R. FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12219–12228. http://dx.doi.org/10.1109/ICCV48922.2021.01202, https://arxiv.org/abs/1907.01845 (NAS in the context of weight sharing architectures.)
- Aich, S.; Yamazaki, M.; Taniguchi, Y.; Stavness, I., Multi-Scale Weight Sharing Network for Image Recognition. Pattern Recognit. Lett. 2020, 131, 348–354. http://dx.doi.org/10.1016/j.patrec.2020.01.011, https://arxiv.org/abs/2001.02816
- Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll, Feb 2019, Convolutional neural networks with layer reuse, 2019 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/8802998/, https://arxiv.org/pdf/1901.09615 (The method of repeatedly re-using the same entire layers.)
- M Mary Shanthi Rani, P Chitra, S Lakshmanan, M Kalpana Devi, R Sangeetha, S Nithya, 2022, DeepCompNet: A novel neural net model compression architecture, Comput Intell Neurosci. 2022 Feb 22;2022:2213273. https://pubmed.ncbi.nlm.nih.gov/35242176/, https://www.hindawi.com/journals/cin/2022/2213273/ (Combines quantization and pruning with weight sharing.)
- Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. https://arxiv.org/abs/1902.00751
- X Wang, P Guo, Y Zhang, 2023, Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer, ECML PKDD 2023: Machine Learning and Knowledge Discovery in Databases: Research Track pp 309–325, https://arxiv.org/abs/2201.05887 (Attention optimization method that uses weight sharing.)
- Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 (Multi-query attention shares KV tensors across multiple attention heads.)
- Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal Transformers. In Proceedings of ICLR. https://openreview.net/forum?id=HyzdRiR9Y7, PDF: https://openreview.net/pdf?id=HyzdRiR9Y7
- C Fu, 2023, Machine Learning Algorithm and System Co-design for Hardware Efficiency, Ph.D. thesis, Computer Science, University of California San Diego, https://escholarship.org/content/qt52q368p3/qt52q368p3.pdf
- S Tan, Y Shen, Z Chen, A Courville, C Gan, Oct 2023, Sparse Universal Transformer, arXiv preprint arXiv:2310.07096, https://arxiv.org/pdf/2310.07096.pdf
- William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
- Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
- 3 Jan 2024 (v2), SPEED: Speculative Pipelined Execution for Efficient Decoding, Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao, https://arxiv.org/abs/2310.12072 (Speculatively executing multiple future tokens in parallel to the current token, by using multiple tokens with high probability from the early layers of inference of the current token in the model. This allows multiple speculations of the autoregressive inference of the next token to start before the current token is finished.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- David R. So, Wojciech Mańke, Hanxiao Liu, Zihang Dai, Noam Shazeer, Quoc V. Le, Jan 2022, Primer: Searching for Efficient Transformers for Language Modeling, https://arxiv.org/abs/2109.08668
- Hesen Chen, Ming Lin, Xiuyu Sun, Qian Qi, Hao Li, and Rong Jin. 2019. Muffnet: Multi-layer feature federation for mobile deep learning. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. https://ieeexplore.ieee.org/document/9022559 PDF: https://openaccess.thecvf.com/content_ICCVW_2019/papers/CEFRL/Chen_MuffNet_Multi-Layer_Feature_Federation_for_Mobile_Deep_Learning_ICCVW_2019_paper.pdf
- Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
- Rene Bidart, Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks, 2023, Ph.D. thesis, University of Waterloo, https://uwspace.uwaterloo.ca/bitstream/handle/10012/19682/Bidart_Rene.pdf?sequence=1
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, 2016, Eie: Efficient inference engine on compressed deep neural network, in Proceedings of the 43rd International Symposium on Computer Architecture, ser. ISCA ’16. Piscataway, NJ, USA, IEEE Press, 2016, pp. 243–254, https://doi.org/10.1109/ISCA.2016.30 https://arxiv.org/abs/1602.01528
- Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
- Salar Shakibhamedan, Amin Aminifar, Nima TaheriNejad, Axel Jantsch, 2024, EASE: Energy Optimization through Adaptation — A Review of Runtime Energy-Aware Approximate Deep Learning Algorithms, https://eclectx.org/Publications/2024_M13.pdf (Survey paper on techniques for adaptive inference with a focus on approximations of inference, including loop performance, stochastic algorithms, approximate arithmetic, quantization, pruning and low-rank.)
- C Hooper, S Kim, H Mohammadzadeh, H Genc, Oct 2023, SPEED: Speculative Pipelined Execution for Efficient Decoding https://arxiv.org/pdf/2310.12072.pdf
- Chen, Yilong ; Zhang, Linhao ; Shang, Junyuan ; Zhang, Zhenyu ; Liu, Tingwen ; Wang, Shuohuan ; Sun, Yu, June 2024, DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion, https://arxiv.org/abs/2406.06567 https://ui.adsabs.harvard.edu/abs/2024arXiv240606567C/abstract
- David Spuler, March 2024, Chapter 46. Structured Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Songwei Liu, Chao Zeng, Lianqiang Li, Chenqian Yan, Lean Fu, Xing Mei, Fangmin Chen, 1 Jul 2024, FoldGPT: Simple and Effective Large Language Model Compression Scheme, https://arxiv.org/abs/2407.00928 (Identifies block-level similariy in model layers.)
- Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 27 Jun 2024 (v2), MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, Meta Research, https://arxiv.org/abs/2402.14905 Code: https://github.com/facebookresearch/MobileLLM
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Bingli Liao, Danilo Vasconcellos Vargas, 13 Jul 2024, Beyond KV Caching: Shared Attention for Efficient LLMs, https://arxiv.org/abs/2407.12866 (Layerwise weight sharing in attention.)
- Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, Ziyuan Ling, 26 Aug 2024, On-Device Language Models: A Comprehensive Review, https://arxiv.org/abs/2409.00088 https://github.com/NexaAI/Awesome-LLMs-on-device https://www.nexaai.com/models
- Sangmin Bae, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Seungyeon Kim, Tal Schuster, 28 Oct 2024, Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA, https://arxiv.org/abs/2410.20672
- Seul-Ki Yeom, Tae-Ho Kim, 3 Dec 2024, UniForm: A Reuse Attention Mechanism Optimized for Efficient Vision Transformers on Edge Devices, https://arxiv.org/abs/2412.02344 (Shared attention matrix generalizes MHA with fused attention matrixes across layers.)
More AI Research
Read more about:
- Layer fusion
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Normalization pruning
- Length pruning
- Width pruning
- Channel pruning
- « Research Home