Aussie AI

46. Structured Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Do, or do not. There is no try.”

— Yoda, The Empire Strikes Back, 1980.

What is Structured Pruning?

Structured pruning is removal of whole “structures” in a model. For example, “layer pruning” removes whole layers, and “attention head pruning” removes attention heads. This is different from unstructured pruning, which randomly removes the smaller weights no matter where they are, but many of the goals are the same:

Smaller model (model compression)
Reduced memory usage
Faster inference

Structured pruning differs from unstructured pruning (e.g. magnitude pruning) in that we don't care about the values of the weights. All of the weights in a pruned structure are removed, regardless of their magnitude.

That being said, we might analyze the weights to decide which structure to prune, in some types of structured pruning algorithms. So, the value of the weights may be considered in the pruning decision, but once we've decided to prune a particular structure from the model, then all of its weights are gone.

However, generally speaking, most of the research papers use more sophisticated decision making. The number of zero or tiny weights in a structure is a fixed static metric that isn't very useful. If doing static structural pruning, it is more powerful to instrument tests of inference execution, so as to determine which of the structures are being most used in determining inference results, and pruning any structures that aren't pulling their weight. For dynamic structural pruning, there are various algorithms to decide which structures to skip for a particular user query.

Why Structured Pruning?

As we saw in Chapter 33, the downside of unstructured magnitude pruning was that it was inherently unpredictable which weights would be zero. This motivates “structured pruning” where we prune whole structures in the model, such as layers, filters, or channels. In structured pruning, we always know which parts of the model will be zeroed. It is a much more controlled type of pruning in a sense.

Smaller and Faster. There is no problem with storing a smaller model or running faster with structured pruning. If we remove a whole layer, as in layer pruning, then we simply don't store that entire layer in the model file. The speedup from structured pruning is also relatively easy, and proportional to what we've pruned. For example, with layer pruning, we simply don't run an entire layer at runtime. Changes to the Transformer's runtime inference algorithm are actually quite minor to implement.

Disadvantages. The downside of structured pruning is that it lacks flexibility, and the model cannot always overcome the limitations from pruned structured by re-training (i.e. post-pruning fine-tuning). There are various mitigations whereby we choose not to prune the most important structures. For example, research shows that the first layers of a Transformer are usually more important in doing the main analysis, whereas the final few layers do the finessing. However, even with such mitigations, the model inherently lacks as many options to re-train itself to overcome the removed weights. Hence, any type of structured pruning may make the model smaller and run faster, but also less accurate. Nevertheless, various types of structured pruning in research papers have achieved impressive results in terms of speedup with minimal accuracy degradation.

Combined structured and unstructured pruning. It is theoretically possible to combine both structured and unstructured pruning, since they are mostly orthogonal techniques. Structured pruning removes whatever structure is being pruned (with a dozen options!), and the remaining weights in the non-pruned structures could also be subjected to unstructured magnitude pruning, by simply zeroing any tiny fractional weights. I really don't recall having seen this yet in the research papers, but it may have been done, or someone might want to try it.

Types of Structured Pruning

Pick a structure, any structure. Open up the standard vanilla Transformer research paper (Vaswani, et. al., 2017) and find the diagram of the architecture. Close your eyes, and poke your finger somewhere in that diagram. Open your eyes again. I can show you research papers on pruning of whatever structure you're pointing at, and sometimes hundreds of papers (e.g. early exit).

There's an odd thing, though: none of those types of structured pruning have gone mainstream. The vast majority of pruning capabilities in open source frameworks are simply for training-based unstructured pruning, such as magnitude pruning or movement pruning. I find this surprising since several of the structured pruning techniques show significant efficiency gains with modest loss of model accuracy.

The main types of structured pruning with significant research papers are:

Layer pruning
Early exit (i.e., dynamic layer pruning)
Attention head pruning
Channel pruning
Filter pruning
Token pruning

Some of the less commonly pruned Transformer components include:

Bias pruning
Embeddings pruning
FFN pruning
Normalization pruning
Softmax pruning
Positional encoding pruning

Did I miss any?

There are also some other notable techniques with the same goal of reducing the total number of weights, with some similarity to pruning:

Parameter sharing and layer fusion
Low-rank matrices

Smaller matrices have fewer weights, so another technique is to cut weights by using smaller matrices. Advanced matrix algebra can be used to factorize the large matrices into smaller “low-rank” matrices, with fewer rows and columns (hence, less weights). This idea applied to tensors is called “tensor decomposition.”

Dynamic Structured Pruning

Dynamic pruning refers to pruning of network weights, links, or entire layers at runtime during inference. This differs from “static pruning” that is done offline during training, or in a post-training optimization, to create a modified model. The types of dynamic pruning may include:

Dynamic depth pruning: Skipping of inference of entire layers of the model using an “early exit” of the inference loop. See also depth pruning, layer pruning, layer skipping, layer fusion, and shallow decoders.
Dynamic width pruning: Dynamically reducing the “width” of the model based on the input. See width pruning, attention head pruning, channel pruning, filter pruning.
Dynamic length pruning: Adaptive to the input to modify internal dimensions related to tokens, embeddings, etc. See length pruning, token pruning, embeddings pruning, autoregressive algorithms.

Note that all types of dynamic pruning suffer some extra inference cost in the calculations that decide whether to prune or not. The hope is that the benefit of pruning will exceed the cost of decision logic. For example, choosing an “early exit” criterion for layer pruning will require extra computation at each layer, which is hopefully recouped by skipping layers often enough.

Triple Axis Pruning

Structured pruning methods are often categorized according to the crosswise dimension of the model that they aim to reduce. Weights can be structurally pruned across the three major axes of the models: depth, width, and length.

Depth pruning. The weights are pruned by removing layers to make the model “shallower”. Techniques include layer pruning, inference loop early exit, and “shallow decoder” Transformer architectures. Note that choosing the model meta-parameter of the number of layers via neural architecture search (NAS) is conceptually very similar to static layer pruning. Also, dynamic early exit with a decision condition based only on a fixed number of layers (e.g. always exit after 10 layers) is also effectively static layer pruning, but with wasted storage space for unused layers of weights. See Chapter 47 for more details on depth pruning, such as early exit inference and layer pruning.

Width pruning. The fanning out of incoming embeddings data across multiple attention heads or internal neural nodes is the “width” of the model. Width pruning is sometimes called “thinning” or “slimming” of the model (see slimmable networks). Width pruning strategies include: attention head pruning, filter pruning, channel pruning. See Chapter 48 for more on width pruning.

Length pruning. The third dimension of the model is actually the model size, which decides the fixed size of vectors (embeddings) that propagate through the width and depth of the model. Note that choosing the meta-parameters of embedding size and context window (e.g. via NAS) are conceptually similar to static length pruning. Length pruning strategies include token pruning and embeddings pruning. Also related is autoregression research. Of the three axes, length pruning has had the least research. Note that “length” is mainly applicable to text transformers. In vision transformers, the third dimension is the image, or patches of the image. See Chapter 49 for more on length pruning.

Dual pruning. Dual pruning is the combination of width pruning and depth pruning. Depth pruning involves pruning or skipping the layers of the model, such as in layer pruning, early exiting or the shallow decoder architecture. Width pruning techniques include attention head pruning, slimmable networks, channel pruning, filter pruning, and other strategies. Some papers also describe combinations of multiple pruning strategies as “hybrid pruning”.

Research papers on dual pruning:

Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard G. Baraniuk, Zhangyang Wang, and Yingyan Lin. 2020, Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference, 2020, JSTSP, https://arxiv.org/abs/1907.04523
X. Xu, M. S. Park, and C. Brick, 2018, Hybrid pruning: Thinner sparse networks for fast inference on edge devices, in ICLR, 2018, https://arxiv.org/abs/1811.00482
Wenhan Xia, Hongxu Yin, Xiaoliang Dai, and Niraj K Jha, 2021, Fully dynamic inference with deep neural networks, 2021, IEEE Transactions on Emerging Topics in Computing, https://arxiv.org/abs/2007.15151
Ali Ehteshami Bejnordi and Ralf Krestel, 2020, Dynamic channel and layer gating in convolutional neural networks, In KI, https://dl.acm.org/doi/10.1007/978-3-030-58285-2_3
Fang Yu, Kun Huang, Meng Wang, Yuan Cheng, Wei Chu, Li Cui, 2022, Width & Depth Pruning for Vision Transformers, Vol. 36 No. 3: AAAI-22 Technical Tracks 3 / AAAI Technical Track on Computer Vision III, DOI: https://doi.org/10.1609/aaai.v36i3.20222, https://ojs.aaai.org/index.php/AAAI/article/view/20222, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/20222/19981
Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. 2020. DynaBERT: Dynamic BERT with Adaptive Width and Depth, arXiv preprint arXiv:2004.04037 (Oct 2020), https://arxiv.org/abs/2004.04037, Code: https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/DynaBERT
H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han. 2020, Once for all: Train one network and specialize it for efficient deployment, In International Conference on Learning Representations, 2020. https://arxiv.org/abs/1908.09791
H. Wang, Z. Wu, Z. Liu, H. Cai, L. Zhu, C. Gan, and S. Han. 2020, HAT: Hardware-Aware Transformers for Efficient Natural Language Processing, In Annual Meeting of the Association for Computational Linguistics, 2020. https://aclanthology.org/2020.acl-main.686/
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. 2014, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550, 2014. https://arxiv.org/abs/1412.6550
Mojtaba Valipour, Mehdi Rezagholizadeh, Hossein Rajabzadeh, Marzieh Tahaei, Boxing Chen, Ali Ghodsi, 2023, SortedNet, a Place for Every Network and Every Network in its Place: Towards a Generalized Solution for Training Many-in-One Neural Networks, https://arxiv.org/abs/2309.00255 (Generalization of multi-dimensional pruning, by training a large neural network with many sub-networks across different width and depth dimensions.)
T Hu, C Meinel, H Yang, 2023, Flexible BERT with Width-and Depth-dynamic Inference, 2023 International Joint Conference on Neural Networks (IJCNN), https://ieeexplore.ieee.org/abstract/document/10191515/ (A 2023 version of BERT that does dual pruning with early exit and width gating.)
Y Liu, Z Lin, F Yuan, 2021, Rosita: Refined bert compression with integrated techniques, The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), https://arxiv.org/abs/2103.11367, Code: https://github.com/llyx97/Rosita (Pruning on multiple dimensions of layer, FFN outputs, and embeddings, also combined with distillation.)
Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. 2017. Adaptive neural networks for efficient inference, In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Proceedings of Machine Learning Research, pages 527–536. https://arxiv.org/abs/1702.07811, http://proceedings.mlr.press/v70/bolukbasi17a.html (Early exit of layers and network selection.)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017, Mobilenets: Efficient convolutional neural networks for mobile vision applications (2017), https://doi.org/10.48550/ARXIV.1704.04861, https://arxiv.org/abs/1704.04861 (Combines depthwise separable convolutions and thinning at each layer.)
Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. 2016, Learning structured sparsity in deep neural networks, In NIPS 2016. https://github.com/wenwei202/caffe/tree/scnn (Combined filter and layer pruning.)
Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., and Doermann, D. S. 2019, Towards optimal structured CNN pruning via generative adversarial learning, In CVPR, 2019. https://arxiv.org/abs/1903.09291 (Similar to a combined filter and layer pruning algorithm.)
Zehao Huang and Naiyan Wang, 2018, Data-driven sparse structure selection for deep neural networks, in ECCV, 2018. https://arxiv.org/abs/1707.01213, Code: https://github.com/huangzehao/sparse-structure-selection (Not typical width-depth pruning, but a combined pruning that uses sparsification to force weight structures to zero, allowing pruning of whole structures.)
Tianli Zhao, Xi Sheryl Zhang, Wentao Zhu, Jiaxing Wang, Sen Yang, Ji Liu, Jian Cheng, Nov 2021, Joint Channel and Weight Pruning for Model Acceleration on Mobile Devices, https://arxiv.org/abs/2110.08013
H Litao, X Fei, G Xiaoyang, Y Tingting, June 2023, Research on Model Compression Method Based on Deep Separable Convolutional and Pruning, Available at SSRN, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4478190 (Somewhat related to dual pruning; also adds quantization.)
Xiaoying Zhi, Varun Babbar, Pheobe Sun, Fran Silavong, Ruibo Shi, Sean Moran, Sep 2023, A New Baseline for Green AI: Finding the Optimal Sub-Network via Layer and Channel Pruning, https://arxiv.org/abs/2302.10798 (Layer and channel pruning combined.)

See research papers on dual pruning (hybrid pruning) at https://www.aussieai.com/research/dual-pruning

Triple Pruning. Research papers on triple axis pruning:

W. Wang, M. Chen, S. Zhao, L. Chen, J. Hu, H. Liu, D. Cai, X. He, and W. Liu, 2020, Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework, https://arxiv.org/abs/2010.04879
J Guo, J Liu, D Xu, 2021, JointPruning: Pruning networks along multiple dimensions for efficient point cloud processing, IEEE Transactions on Circuits and Systems for Video Technology (Volume 32, Issue 6, June 2022), https://ieeexplore.ieee.org/abstract/document/9516010/
H Kong, X Luo, S Huai, D Liu, 2023, EMNAPE: Efficient Multi-Dimensional Neural Architecture Pruning for Edge AI, 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE), https://ieeexplore.ieee.org/document/10137122, https://dr.ntu.edu.sg/bitstream/10356/167488/2/camera-ready-DATE.pdf (Triple-pruning algorithm with comparison to various dual pruning algorithms.)
Z Hou, SY Kung, 2022, Multi-dimensional model compression of vision transformer, Conference on Multimedia and Expo (ICME), https://arxiv.org/pdf/2201.00043 (Pruning of attention heads, neurons, and sequence dimensions jointly.)
Zechun Liu, Xiangyu Zhang, Zhiqiang Shen, Zhe Li, Yichen Wei, Kwang-Ting Cheng, Jian Sun, Sep 2021, Joint Multi-Dimension Pruning via Numerical Gradient Update, https://arxiv.org/abs/2005.08931
Zechun Liu, Xiangyu Zhang, Zhiqiang Shen, Zhe Li, Yichen Wei, Kwang-Ting Cheng, and Jian Sun, 2020, Joint multi-dimension pruning, arXiv preprint arXiv:2005.08931, https://arxiv.org/abs/2005.08931v1
Z Hou, SY Kung, 2022, Multi-Dimensional Vision Transformer Compression via Dependency Guided Gaussian Process Search, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), https://ieeexplore.ieee.org/document/9857488, PDF: https://openaccess.thecvf.com/content/CVPR2022W/EVW/papers/Hou_Multi-Dimensional_Vision_Transformer_Compression_via_Dependency_Guided_Gaussian_Process_Search_CVPRW_2022_paper.pdf

For more research on triple pruning, see https://www.aussieai.com/research/triple-pruning.

Quadruple pruning? Is there a way to do four? For example, can you combine layer, width, token, and model dimension pruning? I haven't yet seen a research paper with this.

Hybrid pruning. There are various hybrid pruning strategies, where it is possible to combine pruning with other non-pruning optimizations such as quantization. You can even combine structured and unstructured pruning, by doing structured pruning of a particular part of the model (e.g. a layer), but then unstructured weight pruning (magnitude pruning) of that structure, such that low-value unimportant weights are pruned.

Vector-Level Pruning

There is an intermediate type of pruning between low-level unstructured magnitude pruning of random weights and structured pruning of components at a high-level. This is a research-only technique so far, but involves looking at the vector level. The idea is to skip individual vector dot product computations.

It is even possible to do the pruning at an even lower level of “sub-vector pruning” where we look at the vector sub-segments that are sent to the GPU in parallel. If our model size is 4096, then we might not be sending all 4096 vector elements to the GPU at once, but they are split into sub-vectors. If we can skip an entire sub-vector computation often enough, that's a win.

Static vector pruning. Obviously, we can prune an entire vector or sub-vector if unstructured magnitude pruning happens to prune a vector to all-zeros. And at high sparsity of 80% or 90%, this will occur sometimes. This type of “static vector pruning” optimization can be detected offline in analysis of the weights, or as an optimized node in an ML compiler.

Dynamic vector pruning. Researchers have gone beyond this simplistic method with dynamic vector pruning. There are various ways to dynamically determine which vectors are not having any effect, and prune them from current or future computations. This optimization involves detecting vectors that result in zero dot products, or near-zero results. Also possible is “negative skipping” where we prune vectors that often result in negative dot product values, if they would then be zeroed by the RELU activation function. These ideas are promising and there remains much research opportunity here. See Chapter 50 for papers on zero skipping and negative skipping.

Parameter and Weight Sharing

Parameter sharing, also called “weight sharing”, is the re-use of the same parameters by different structures of the Transformer. Parameters can be shared for attention heads and feed-forward networks. There are fewer weights to store than the original model, because some are shared.

Parameter sharing and pruning are similar techniques, both being forms of model compression, but they are not the same. For example, consider the layers. Each layer of the default Transformer typically has its own set of weights for each structure. When the same set of weights is used across multiple layers, this is called layer fusion, and is conceptually similar to layer pruning. However, note that layer pruning reduces the number of layers that are executed, whereas layerwise parameter sharing does not, although the two ideas can be combined.

Parameter sharing reduces the total number of weights to be stored, thereby reducing model size. The model file is smaller and model loading is faster with reduced overhead. However, the bigger gain is that Transformers are often memory-bound rather than CPU-bound, so reduced cost of data transfers from memory can also reduce latency and improve inference throughput.

Training time can also be improved by parameter sharing, as there are fewer parameters to train. Obviously, this architecture requires a non-standard extension to the normal Transformer training algorithms.

Parameter sharing has mainly been seen in the literature with “layer fusion” as an alternative to layer pruning. However, there is conceptually no reason why any other type of structured pruning could not be modified to use parameter sharing for that structure. Share weights instead of pruning weights. If the FFN weights are shared, this is similar to FFN pruning. Similarly, sharing attention head weights is akin to head pruning. All parameter sharing approaches are not faster arithmetically since they perform an identical number of operations, but would reduce model size and memory usage, which also improves speed on a memory-bound engine. These other types of “structured parameter sharing” seem an interesting under-researched area in need of contributions.

Research papers on parameter sharing:

Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
Qian Lou, Ting Hua, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. 2022. Dictformer: Tiny transformer with shared dictionary, In International Conference on Learning Representations. https://sra.samsung.com/publications/dictformer-tiny-transformer-with-shared-dictionary/ (Effectively shares parameters by using dictionary lookups.)
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories, In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2012.14913 (Explores how FFN's work in depth, with relevance to sharing FFN weights.)
Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation, In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes “shared layers” with shared decoder FFN weights.)
Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Subformer: Exploring weight sharing for parameter efficiency in generative transformers, In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4081–4090, Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2101.00234 (Parameter sharing across layers.)
Fahim Dalvi, Hassan Sajjad, Nadir Durrani, and Yonatan Belinkov. 2020. Analyzing redundancy in pretrained transformer models, In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4908–4926, Online. Association for Computational Linguistics. https://arxiv.org/abs/2004.04010 (Detailed analysis finding redundancy in 85% of parameters, with relevance to pruning and sharing.)
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers. In International Conference on Learning Representations, https://arxiv.org/abs/1807.03819 (Optimizes Transformers with weight sharing and other ways.)
Sho Takase and Shun Kiyono. 2023. Lessons on parameter sharing across layers in transformers, In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 78–90, Toronto, Canada (Hybrid). Association for Computational Linguistics. https://arxiv.org/abs/2104.06022
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations, In Proceedings of ICLR. https://arxiv.org/abs/1909.11942 (Parameter sharing across layers in the BERT Transformer architecture.)
Raj Dabre and Atsushi Fujita. 2019. Recurrent stacking of layers for compact neural machine translation models, Proceedings of AAAI, 33:6292–6299. https://arxiv.org/abs/1807.05353 (Parameter sharing across layers of a Transformer.)
Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer, In Proceedings of IJCAI, pages 5292–5298, https://arxiv.org/abs/1906.11024 (Parameter sharing of attention heads.)
Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied transformers: Neural machine translation with shared encoder and decoder, Proceedings of AAAI, 33(01):5466–5473. PDF: https://taoqin.github.io/papers/tiedT.AAAI2019.pdf
Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li, Nov 2021, A Survey on Green Deep Learning, https://arxiv.org/abs/2111.05193 (Contains several sections surveying weight sharing.)
Chu, X.; Zhang, B.; Xu, R., 2021, FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search, In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12219–12228. http://dx.doi.org/10.1109/ICCV48922.2021.01202, https://arxiv.org/abs/1907.01845 (NAS in the context of weight sharing architectures.)
Aich, S.; Yamazaki, M.; Taniguchi, Y.; Stavness, I., 2020, Multi-Scale Weight Sharing Network for Image Recognition, Pattern Recognit. Lett. 2020, 131, 348–354. http://dx.doi.org/10.1016/j.patrec.2020.01.011, https://arxiv.org/abs/2001.02816
Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll, Feb 2019, Convolutional neural networks with layer reuse, 2019 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/8802998/, https://arxiv.org/pdf/1901.09615 (The method of repeatedly re-using the same entire layers.)
M Mary Shanthi Rani, P Chitra, S Lakshmanan, M Kalpana Devi, R Sangeetha, S Nithya, 2022, DeepCompNet: A novel neural net model compression architecture, Comput Intell Neurosci. 2022 Feb 22;2022:2213273. https://pubmed.ncbi.nlm.nih.gov/35242176/, https://www.hindawi.com/journals/cin/2022/2213273/ (Combines quantization and pruning with weight sharing.)
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for nlp, In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019. https://arxiv.org/abs/1902.00751
X Wang, P Guo, Y Zhang, 2023, Unsupervised Domain Adaptation via Bidirectional Cross-Attention Transformer, ECML PKDD 2023: Machine Learning and Knowledge Discovery in Databases: Research Track pp 309–325, https://arxiv.org/abs/2201.05887 (Attention optimization method that uses weight sharing.)
Noam Shazeer, Nov 2019, Fast Transformer Decoding: One Write-Head is All You Need, https://arxiv.org/abs/1911.02150 (Multi-query attention shares KV tensors across multiple attention heads.)
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal Transformers, In Proceedings of ICLR. https://openreview.net/forum?id=HyzdRiR9Y7, PDF: https://openreview.net/pdf?id=HyzdRiR9Y7
C Fu, 2023, Machine Learning Algorithm and System Co-design for Hardware Efficiency, Ph.D. thesis, Computer Science, University of California San Diego, https://escholarship.org/content/qt52q368p3/qt52q368p3.pdf
S Tan, Y Shen, Z Chen, A Courville, C Gan, Oct 2023, Sparse Universal Transformer, arXiv preprint arXiv:2310.07096, https://arxiv.org/pdf/2310.07096.pdf