Aussie AI

47. Early Exit and Layer Pruning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“A quitter never wins
and a winner never quits.”

— Napoleon Hill.

What is Depth Pruning?

Depth pruning is removal of layers in Transformers to adjust the “depth” to which computation proceeds. Transformers have a stack of layers in their encoder and/or decoder, which can be “deep” with many layers, or “shallow” with only a few. Layers can be statically pruned from the model file, or skipped at runtime via early exiting.

The most common type of depth pruning is layer pruning, of which the dynamic form is called early exit inference. However, there are other types of depth pruning in non-Transformer architectures, such as cascades in DNNs/CNNs.

Like all types of pruning, depth pruning can be performed statically or dynamically. The main type of dynamic depth pruning is called “early exit” and is one type of dynamic layer pruning, along with layer skipping. Static depth pruning is a type of model compression, such as static layer pruning or layer fusion, where entire layers of weights are removed from the model file.

Types of Depth Pruning. Various subtypes of depth pruning include:

Static layer pruning
Early exit (dynamic layer pruning)
Layer skipping
Layer fusion
Layer reordering
Cascades (in DNNs/CNNs)
Shallow decoder Transformer architecture

There are multiple dimensions along which to prune a model. Depth pruning is orthogonal to pruning in other model dimensions: width pruning, length pruning. As such, depth pruning can be combined with other types of pruning, such as in dual pruning and triple pruning (generally called “multi-dimensional pruning”).

Layer Pruning

Layer pruning is a type of structured pruning because it prunes entire layers. More precisely, it is a type of “depth pruning” because it reduces the depth of the stacks of encoders and/or decoders in the Transformer architecture. This technique can sometimes be called “layer compaction”.

Dynamic layer pruning is also called “early exiting” if all remaining layers are skipped, or “layer skipping” if only the current layer is skipped. Reducing layers in the decoder is called a “shallow decoder”, which has been found to be effective, because encoder layers are more important in a Transformer than decoder layers. Layer pruning is also related to “layer fusion” (usually via parameter sharing) and layer reordering.

Layer pruning refers to removing one or more entire layers from the model, which is a subtype of “depth pruning”. Most AI models have multiple hidden layers of nodes, and sometimes a layer can be removed without too great of a loss in model accuracy. The layer can be removed statically to create a new model file or dynamically via some adaptive criteria. Most of the literature focuses on dynamic layer pruning via early exit of the inference algorithm, when it detects a threshold of accuracy has been achieved.

Layer pruning can be combined with many other methods to create hybrid optimizations. For example, it is orthogonal to quantization, width pruning, (e.g. attention head pruning) and length pruning (e.g. token pruning, embeddings pruning).

Static Layer Pruning

Static layer pruning is removal of entire layers of weights from a model file. This which would involve detecting layers that add minimal value during training (or post-training but pre-inference), but it seems to have less chance to succeed, and seems relatively under-researched. This is related to the training design choice of how many layers to use in a model, which was once more art than science, but has received research attention more recently as Neural Architecture Search (NAS). Interestingly, some of the “early exit” and “layer skipping” inference techniques are effectively changing the choice of the number of model layers from a static constant to a dynamic choice, and the generalization of that to dynamic layer management may warrant some research.

Pruning beats re-training. If you discover that the last few layers of your model can be pruned completely, why wouldn't you just re-train your model with a smaller number of layers? Well, firstly there's the high cost of training. Furthermore, even if these layers were redundant at the end of training, it doesn't necessarily mean they were unnecessary during training. It's quite possible that weights have propagated through these levels in the early stages of training, becoming unimportant only in the later stages.

Layer pruning research. Research papers on static layer pruning or layer pruning in general:

Sabina Pokhrel, 2022, 4 Popular Model Compression Techniques Explained, January 19, 2022, https://xailient.com/blog/4-popular-model-compression-techniques-explained/
Wei Deng, Junwei Pan, Tian Zhou, Deguang Kong, Aaron Flores, and Guang Lin. 2021. DeepLight: Deep Lightweight Feature Interactions for Accelerating CTR Predictions in Ad Serving, In Proceedings of the 14th ACM international conference on Web search and data mining. 922–930, https://arxiv.org/abs/2002.06987
H Sajjad, F Dalvi, N Durrani, P Nakov, 2020, Poor Man's BERT: Smaller and Faster Transformer Models, arXiv preprint arXiv:2004.03844 https://arxiv.org/abs/2004.03844v1
E Youn, S Prabhu, S Chen, 2023, Compressing Vision Transformers for Low-Resource Visual Learning, arXiv preprint arXiv:2309.02617, PDF: https://arxiv.org/pdf/2309.02617.pdf
Jaeho Lee, Sejun Park, Sangwoo Mo, Sungsoo Ahn, and Jinwoo Shin. 2020, Layer-adaptive sparsity for the magnitude-based pruning, In International Conference on Learning Representations, 2020. https://arxiv.org/abs/2010.07611
Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)

For more research on the layer pruning, refer to https://www.aussieai.com/research/layer-pruning.

Early Exit of Inference Layers

Early exit is quitting the main inference loop at one of the layers in the encoder and/or decoder using the results up to that point if there is a high degree of confidence. Implementing this method is surprisingly simple, because a model layer's input and output have the same dimensions. You can simply exit the layer loop early and use whatever the currently computed logits are as inputs into the final Softmax layer.

The speedup from early exit is obvious in that it avoids one or more layers entirely. Early exiting is possible in a significant number of inference evaluations (e.g. reportedly 40% in DeeBERT [Xin et al. 2020]), but not always. By skipping all layers thereafter, it avoids a significant amount of computation.

Early exit is terminology that refers specifically to an inference optimization. The general idea of stopping early has also been applied in training, many years prior. The terms “dropout” and “early stopping” have also occasionally been used to mean inference early exit, but usually refer to training method optimizations with a similar goal to reduce training times.

Early exit is effectively dynamic pruning of multiple layers of the model at runtime during inference. The motivation is a speedup by exiting before all of the layers have been completed, using the results up to that point, if there is a high enough degree of confidence. Early exit means avoiding the calculations for all layers after the just-finished one, whereas “layer skipping” can skip one layer to continue on the following layer, and “layer reordering” is a strange generalization where layers can get executed or skipped in any order.

Why does this even work? Technically, it works because the layer inputs and outputs have the same format. But why does it work accurately? Early exit relies on an assumption that each successive layer makes the results more accurate, but with reducing changes. Some research supports this idea, showing that the initial layers do more to prioritize output tokens, and the last few layers tend to “finesse” between multiple reasonable options. After a few layers, the results are often “good enough” to decide on the output token, or at least a reasonably good choice, without finishing all the layers.

Types of Early Exit

Early exit is a form of dynamic layer pruning, since it skips (prunes) some of the model layers. Early exit means avoiding the calculations for all layers after the just-finished one, whereas “layer skipping” can skip one to continue on the following layer, and “layer reordering” is a strange generalization where layers can get executed or skipped in any order.

There are different ways to do early exiting. Early exiting algorithms have to use a decision method, usually called a “classifier” to choose whether or not to exit at a given layer. Xu and McAuley (2022) categorize three different subtypes of early exits, based on the criteria used to decide when to exit:

Confidence estimation
Internal ensemble
Learning to exit

Confidence estimation is the use of a metric predicting that confidence is high enough to exit; internal ensemble uses multiple metrics with a requirement for enough metrics to agree; learning to exit has the model attempting to learn when to exit.

A special type of early exit is the “shallow decoder” Transformer architecture. The idea has also been applied in training, many years prior. The terms “dropout” and “early stopping” have also occasionally been used to mean inference early exit, but usually refer to training method optimizations with a similar goal to reduce training times.

Always-exit test: Why not early exit on 100% of inference calculations? For starters, you get a 100% guarantee of inaccuracy, whereas varying the number of layers helps optimize between easy and hard queries. Furthermore, this strategy is not really early exit! Always exiting with a simplistic decision test, such as always exiting at layer N=5, is effectively the same as static layer pruning of layers N>=6, but without the benefit of reduced model storage space. However, implementing this always-exit test dynamically can still be beneficial during testing of the efficacy of the model in terms of the layout count, such as when deciding the number of layers to use. Accuracy of the model for different values for N can be tested dynamically without rebuilding the model file.

Early exit is also one of multiple strategies for “dynamic inference”. Some papers refer to dynamic inference changes as “adaptive neural networks”, where they change execution depending on the inputs. Some types of early exit, such as hierarchical early exit, are similar to research on cascades for DNNs and CNNs.

Early Exit Research

There are numerous papers on “early exit” of the inference algorithm without processing all the layers, and show no sign of abating. This overall technique can be categorized as “dynamic layer pruning,” “dynamic depth pruning” or “dynamic depth models”.

Early exit is also one of multiple strategies for adaptive inference, where the engine changes execution depending on the user inputs. Some types of early exit, such as hierarchical early exit, are similar to research on cascades for DNNs and CNNs.

Survey papers on early exit (dynamic layer pruning) include:

Canwen Xu, Julian McAuley, 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
Y. Matsubara, M. Levorato, and F. Restuccia, 2022, Split computing and early exiting for deep learning applications: Survey and research challenges, ACM Comput. Surveys, Mar 2022, https://arxiv.org/abs/2103.04505
Stefanos Laskaridis, Alexandros Kouris, Nicholas D. Lane, 2021, Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions, EMDL'21: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, June 2021, Pages 1–6, https://doi.org/10.1145/3469116.3470012, https://dl.acm.org/doi/abs/10.1145/3469116.3470012

Specific research papers on early exit (dynamic layer pruning):

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin, 2020, DeeBERT: Dynamic early exiting for accelerating bert inference, arXiv preprint arXiv:2004.12993, 2020, https://arxiv.org/abs/2004.12993 (Code: https://github.com/castorini/DeeBERT
Angela Fan, Edouard Grave, and Armand Joulin, 2019, Reducing transformer depth on demand with structured dropout, arXiv:1909.11556, https://arxiv.org/abs/1909.11556
Surat Teerapittayanon, Bradley McDanel, and Hsiang Tsung Kung, BranchyNet: Fast inference via early exiting from deep neural networks, 2017, arXiv:1709.01686, https://arxiv.org/abs/1709.01686
S. Teerapittayanon, B. McDanel, H.T. Kung, 2017, Distributed deep neural networks over the cloud, the edge and end devices, IEEE, Atlanta, GA, USA, 2017, pp. 328–339, doi:10.1109/ICDCS.2017.226., 5–8 June, https://doi.org/10.1109/ICDCS.2017.226
Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu, Xuanjing Huang, 2021, Accelerating BERT Inference for Sequence Labeling via Early-Exit, May 2021, https://arxiv.org/abs/2105.13878
Arian Bakhtiarnia, Qi Zhang, Alexandros Iosifidis, 2021, Multi-Exit Vision Transformer for Dynamic Inference, June 2021, https://arxiv.org/abs/2106.15183
Nikolaos Passalis, Jenni Raitoharju, Anastasios Tefas, Moncef Gabbouj, 2020, Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits, Pattern Recognition Volume 105, September 2020, 107346, https://doi.org/10.1016/j.patcog.2020.107346
Xiangjie Li, Chenfei Lou, Yuchi Chen, Zhengping Zhu, Yingtao Shen, Yehan Ma, An Zou, 2022, Predictive Exit: Prediction of Fine-Grained Early Exits for Computation- and Energy-Efficient Inference, DOI: https://doi.org/10.1609/aaai.v37i7.26042, https://ojs.aaai.org/index.php/AAAI/article/view/26042
Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He, 2022, A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models, Proceedings of the 29th International Conference on Computational Linguistics, October 2022, https://aclanthology.org/2021.naacl-main.162/, https://aclanthology.org/2021.naacl-main.162.pdf
Vanderlei Bonato and Christos Bouganis, 2021, Class-specific early exit design methodology for convolutional neural networks, Applied Soft Computing (2021), https://www.sciencedirect.com/science/article/abs/pii/S1568494621002398, https://doi.org/10.1016/j.asoc.2021.107316, https://spiral.imperial.ac.uk/bitstream/10044/1/90316/2/Paper___Early_Exit___Applied_Soft_Computing.pdf
E. Baccarelli, S. Scardapane, M. Scarpiniti, A. Momenzadeh, A. Uncini, 2020, Optimized training and scalable implementation of Conditional Deep Neural Networks with early exits for Fog-supported IoT applications, Information Sciences 521 (June 2020), 107–143, DOI: https://doi.org/10.1016/j.ins.2020.02.041, http://www.sciencedirect.com/science/article/pii/
S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, 2018, When edge meets learning: Adaptive control for resource-constrained distributed machine learning, in: IEEE Conference on Computer Communications (IEEE INFOCOM 2018), 2018, pp. 63–71, doi:10.1109/INFOCOM.2018.8486403, https://doi.org/10.1109/INFOCOM.2018.8486403, Honolulu, HI, USA, 16–19 April, 2018.
Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, Furu Wei, 2020, BERT Loses Patience: Fast and Robust Inference with Early Exit, https://doi.org/10.48550/arXiv.2006.04152, https://arxiv.org/abs/2006.04152
S. Scardapane, M. Scarpiniti, E. Baccarelli, A. Uncini, 2020, Why should we add early exits to neural networks?, Cognitive Computation 12 (5) (2020), 954–966, doi:10.1007/s12559-020-09734-4, http://dx.doi.org/10.1007/s12559-020-09734-4
Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He, 2021, A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2021, DOI: 10.18653/v1/2021.naacl-main.162 https://aclanthology.org/2021.naacl-main.162/
Zizhao Wang, Wei Bao, Dong Yuan, Liming Ge, Nguyen H. Tran, Albert Y. Zomaya, 2019, SEE: Scheduling Early Exit for Mobile DNN Inference during Service Outage, MSWIM '19: Proceedings of the 22nd International ACM Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, November 2019, Pages 279–288, https://doi.org/10.1145/3345768.3355917, https://dl.acm.org/doi/abs/10.1145/3345768.3355917
Xinrui Tan, Hongjia Li, Liming Wang, Xueqing Huang, Zhen Xu, 2021, Empowering Adaptive Early-Exit Inference with Latency Awareness, DOI: https://doi.org/10.1609/aaai.v35i11.17181, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/17181/16988
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, and Donald Metzler. 2022, Confident adaptive language modeling, arXiv preprint arXiv:2207.07061, 2022, https://arxiv.org/abs/2104.08803
Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Weinberger. 2018. Multi-scale dense networks for resource efficient image classification, In International Conference on Learning Representations (ICLR), https://arxiv.org/abs/1703.09844
Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. 2019. Shallow-deep networks: Understanding and mitigating network overthinking, In International Conference on Machine Learning (ICML), volume 97, pages 3301–3310. PMLR, https://arxiv.org/abs/1810.07052
Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, and Noah A. Smith. 2020. The right tool for the job: Matching model and instance complexities, In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651, Association for Computational Linguistics, https://arxiv.org/abs/2004.07453
Ji Xin, Rodrigo Nogueira, Yaoliang Yu, and Jimmy Lin. 2020. Early exiting BERT for efficient document ranking, In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 83–88, Online. Association for Computational Linguistics. PDF: https://cs.uwaterloo.ca/~jimmylin/publications/Xin_etal_SustaiNLP2020.pdf
Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. 2021. BERxiT: Early exiting for BERT with better fine-tuning and extension to regression, In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 91–104, Association for Computational Linguistics, https://aclanthology.org/2021.eacl-main.8/, Code: https://github.com/castorini/berxit
V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh, 2018, SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks, In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 662–673, https://doi.org/10.1109/ISCA.2018.00061, https://ieeexplore.ieee.org/document/8416863
D Li, W Wu, L Zeng, K Li - Wentai and Zeng, Lan and Li, Keqin, Es, 2023, Es-Fedavg: Early-Exit Personalized Federated Learning with Sparse Weight for Adaptive Computation, January 2023, SSRN Electronic Journal, DOI:10.2139/ssrn.4361705, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4361705, https://www.researchgate.net/publication/368592513_Es-Fedavg_Early-Exit_Personalized_Federated_Learning_with_Sparse_Weight_for_Adaptive_Computation
Ting-Kuei Hu, Tianlong Chen, Haotao Wang, and Zhangyang Wang. 2020, Triple wins: Boosting accuracy, robustness and efficiency together by enabling input-adaptive inference, In ICLR, Feb 2020, https://arxiv.org/abs/2002.10025
Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. 2018, Skipnet: Learning dynamic routing in convolutional networks, In ECCV, 2018, https://arxiv.org/abs/1711.09485
Weiyu Ju; Wei Bao; Dong Yuan; Liming Ge; Bing Bing Zhou, 2021, Learning Early Exit for Deep Neural Network Inference on Mobile Devices through Multi-Armed Bandits, 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 10-13 May 2021, https://ieeexplore.ieee.org/abstract/document/9499356, https://doi.org/10.1109/CCGrid51090.2021.00011
Weiyu Ju, Wei Bao, Liming Ge, Dong Yuan, 2021, Dynamic Early Exit Scheduling for Deep Neural Network Inference through Contextual Bandits, CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, October 2021, Pages 823–832, https://doi.org/10.1145/3459637.3482335, https://dl.acm.org/doi/abs/10.1145/3459637.3482335
Andong Li; Chengshi Zheng; Lu Zhang; Xiaodong Li, 2021, Learning to Inference with Early Exit in the Progressive Speech Enhancement, 2021 29th European Signal Processing Conference (EUSIPCO), 23-27 August 2021, https://ieeexplore.ieee.org/abstract/document/9616248, https://doi.org/10.23919/EUSIPCO54536.2021.9616248, https://arxiv.org/abs/2106.11730
Polina Karpikova, Ekaterina Radionova, Anastasia Yaschenko, Andrei Spiridonov, Leonid Kostyushko, Riccardo Fabbricatore, Aleksei Ivakhnenko; 2023, FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), April 2023, pp. 12032-12043 https://openaccess.thecvf.com/content/CVPR2023/html/Karpikova_FIANCEE_Faster_Inference_of_Adversarial_Networks_via_Conditional_Early_Exits_CVPR_2023_paper.html, https://arxiv.org/abs/2304.10306
Rongkang Dong, Yuyi Mao, and Jun Zhang. 2022, Resource-Constrained Edge AI with Early Exit Prediction, Journal of Communications and Information Networks, 7(2):122–134, June 2022, https://arxiv.org/abs/2206.07269
Qunliang Xing, Mai Xu, Tianyi Li, and Zhenyu Guan. 2020, Early exit or not: Resource-efficient blind quality enhancement for compressed images, In Computer Vision – ECCV 2020, pages 275–292. Springer International Publishing. 2020, https://arxiv.org/abs/2006.16581
M. Wołczyk et al., 2021, Zero time waste: Recycling predictions in early exit neural networks, in Proc. 35th Conf. Neural Inf. Process. Syst. (NeurIPS), Virtual Conference, Dec. 2021, https://arxiv.org/abs/2106.05409
M. Phuong and C. H. Lampert, 2019, Distillation-based training for multi-exit architectures, in Proc. IEEE/CVF Int. Conf. Comput. Vision (ICCV), Seoul, Korea (South), Oct.-Nov. 2019, https://ieeexplore.ieee.org/document/9009834
S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, 2020, SPINN: Synergistic progressive inference of neural networks over device and cloud, in Proc. Annu. Inf. Conf. Mobile Comput. Netw. (MobiCom), London, UK, Sep. 2020, https://arxiv.org/abs/2008.06402
M. Wang, J. Mo, J. Lin, Z. Wang, and L. Du, 2019, DynExit: A dynamic early-exit strategy for deep residual networks, in Proc. IEEE Int. Wkshop. Signal Process. Syst. (SiPS), Nanjing, China, Oct. 2019, https://ieeexplore.ieee.org/abstract/document/9020551
Qunliang Xing, Mai Xu, Tianyi Li, and Zhenyu Guan. 2020, Early exit or not: Resource-efficient blind quality enhancement for compressed images, In Computer Vision – ECCV 2020, pages 275–292. Springer International Publishing. 2020, https://arxiv.org/abs/2006.16581
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015, Going deeper with convolutions, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, https://arxiv.org/abs/1409.4842
Maciej Woł czyk, Bartosz Wojcik, Klaudia Bał azy, Igor T Podolak, Jacek Tabor, Marek Smieja, and Tomasz Trzcinski. 2021, Zero time waste: Recycling predictions in early exit neural networks, In Advances in Neural Information Processing Systems, volume 34, pages 2516–2528. Curran Associates, Inc. 2021, https://arxiv.org/abs/2106.05409
Enrique S. Marquez, Jonathon S. Hare, and Mahesan Niranjan. 2018, Deep cascade learning, IEEE Transactions on Neural Networks and Learning Systems, 29(11):5475–5485. 2018, https://ieeexplore.ieee.org/document/8307262
Simone Scardapane, Danilo Comminiello, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. 2020, Differentiable branching in deep networks for fast inference, In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4167–4171. 2020, https://ieeexplore.ieee.org/document/9054209
Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, Bert Vankeirsbilck, Pieter Simoens, and Bart Dhoedt, Feb 2017, The cascading neural network: building the internet of smart things, Knowledge and Information Systems, 52(3):791–814, https://link.springer.com/article/10.1007/s10115-017-1029-1
Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, and Joseph E Gonzalez. 2017, Idk cascades: Fast deep learning by learning not to overthink, arXiv preprint arXiv:1706.00885. 2017, https://arxiv.org/abs/1706.00885
Simone Scardapane, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. 2020, Why should we add early exits to neural networks?, Cognitive Computation, 12(5):954–966. 2020, https://arxiv.org/abs/2004.12814
Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. 2017, Adaptive neural networks for efficient inference, In International Conference on Machine Learning, pages 527–536. PMLR. 2017, https://arxiv.org/abs/1702.07811
Xin Dai, Xiangnan Kong, and Tian Guo. 2020, Epnet: Learning to exit with flexible multi-branch network, In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, CIKM ’20, page 235–244, New York, NY, USA. Association for Computing Machinery. 2020, https://dl.acm.org/doi/10.1145/3340531.3411973
Xinshi Chen, Hanjun Dai, Yu Li, Xin Gao, and Le Song. 2020, Learning to stop while learning to predict, In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1520–1530. PMLR. 2020, https://arxiv.org/abs/2006.05082
P. Panda, A. Sengupta, K. Roy, 2016, Conditional deep learning for energy-efficient and enhanced pattern recognition, in: 2016 Design, Automation Test in Europe Conference Exhibition (DATE), 2016, pp. 475–480, https://arxiv.org/abs/1509.08971
Francesco Busolin, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Salvatore Trani, May 2021, Learning Early Exit Strategies for Additive Ranking Ensembles, https://arxiv.org/abs/2105.02568
B. Barla Cambazoglu, Hugo Zaragoza, Olivier Chapelle, Jiang Chen, Ciya Liao, Zhaohui Zheng, and Jon Degenhardt. 2010. Early exit optimizations for additive machine learned ranking systems, In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), pages 411–420, New York, New York, https://dl.acm.org/doi/10.1145/1718487.1718538
Eunhyeok Park, Dongyoung Kim, Soobeom Kim, Yong-Deok Kim, Gunhee Kim, Sungroh Yoon, Sungjoo Yoo, 2015, Big/little deep neural network for ultra low power inference, 2015. In CODES '15: Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis, October 2015, Pages 124–132, https://dl.acm.org/doi/10.5555/2830840.2830854
Geng, S.; Gao, P.; Fu, Z.; and Zhang, Y., 2021, RomeBERT: Robust Training of Multi-Exit BERT, arXiv preprint arXiv:2101.09755, https://arxiv.org/abs/2101.09755
Zhou, W.; Xu, C.; and McAuley, J. J., 2022. BERT Learns to Teach: Knowledge Distillation with Meta Learning, In ACL. https://arxiv.org/abs/2106.04570
Tianxiang Sun, Yunhua Zhou, Xiangyang Liu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing Huang, Xipeng Qiu, 2021. Early Exiting with Ensemble Internal Classifiers, arXiv preprint arXiv:2105.137, https://arxiv.org/abs/2105.13792
Zhu, W. 2021. LeeBERT: Learned Early Exit for BERT with cross-level optimization, In ACL-IJCNLP, PDF: https://aclanthology.org/2021.acl-long.231.pdf
Zhang, Z.; Zhu, W.; Zhang, J.; et al. 2022. PCEE-BERT: Accelerating BERT Inference via Patient and Confident Early Exiting, In NAACL-HLT (Findings), https://aclanthology.org/2022.findings-naacl.25/
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-adaptive transformer, ArXiv, abs/1910.10073, https://arxiv.org/abs/1910.10073
Tal Schuster, Adam Fisch, Tommi Jaakkola, Regina Barzilay, 2021. Consistent accelerated inference via confident adaptive transformers, arXiv preprint arXiv:2104.08803. https://arxiv.org/abs/2104.08803
Guan, Y.; Li, Z.; Leng, J.; et al. 2022. Transkimmer: Transformer Learns to Layer-wise Skim, In AC, https://arxiv.org/abs/2205.07324
H. Tann, S. Hashemi, R. I. Bahar, and S. Reda. Runtime configurable deep neural networks for energy-accuracy trade-off, In CODES + ISSS, pages 34:1–34:10, 2016. https://ieeexplore.ieee.org/document/9166549
G Li, X Ma, Q Yu, L Liu, H Liu, X Wang, 2023, CoAxNN: Optimizing on-device deep learning with conditional approximate neural networks, Journal of Systems Architecture, https://www.sciencedirect.com/science/article/abs/pii/S1383762123001571
X Gao, Y Liu, T Huang, Z Hou, 2023, PF-BERxiT: Early Exiting for BERT with Parameter-efficient Fine-tuning and Flexible early exiting strategy, Neurocomputing, https://www.sciencedirect.com/science/article/abs/pii/S0925231223008135
Z Zeng, Y Hong, H Dai, H Zhuang, C Chen, August 2023, ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference, PDF: https://www.researchgate.net/publication/373392419_ConsistentEE_A_Consistent_and_Hardness-Guided_Early_Exiting_Method_for_Accelerating_Language_Models_Inference
Duggal, R., Freitas, S., Dhamnani, S., Chau, D.H., Sun, J., 2020, ELF: an early-exiting framework for long-tailed classification, Arxiv Preprint Arxiv:2006.11979 (2020) https://arxiv.org/abs/2006.11979
H Yu, D Liu, Z Zhang, J Wang, 2023, A Dynamic Transformer Network with Early Exit Mechanism for Fast Detection of Multiscale Surface Defects, IEEE Transactions on Instrumentation and Measurement (Early Access), https://ieeexplore.ieee.org/document/10242087
A Zniber, O Karrakchou, M Ghogho, 2023, Dynamic Early Exiting Predictive Coding Neural Networks, arXiv preprint arXiv:2309.02022, https://arxiv.org/pdf/2309.02022.pdf
Y. Long, I. Chakraborty, and K. Roy, 2020, Conditionally deep hybrid neural networks across edge and cloud, arXiv:2005.10851, https://arxiv.org/abs/2005.10851
Berestizshevsky, K., Even, G., 2019, Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence, In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q., 2018, Multi-scale dense networks for resource efficient image classification, In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844, https://arxiv.org/abs/1703.09844 (Has multiple models combined in an early-exit configuration.)
A Moos, 2023, Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs, arXiv preprint arXiv:2309.03530, https://arxiv.org/pdf/2309.03530.pdf (Fast inference for a soccer-playing robot with cascade-like hierarchical early exits.)
Francesco Daghero, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini, Enrico Macii, Massimo Poncino, 2020, Energy-Efficient Adaptive Machine Learning on IoT End-Nodes With Class-Dependent Confidence, 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp.1-4, 2020. https://ieeexplore.ieee.org/document/9294863, https://arxiv.org/abs/2204.03431v1 (An improved stopping policy for early exits on easy-input classification tasks.)
Kyungchul Park, Chanyoung Oh, Youngmin Yi, 2020, BPNet: Branch-pruned Conditional Neural Network for Systematic Time-accuracy Tradeoff, 2020 57th ACM/IEEE Design Automation Conference (DAC), pp.1-6, 2020. https://ieeexplore.ieee.org/document/9218545
T Shen, C Lee, V Narayanan, Oct 2023, Multi-Exit Vision Transformer with Custom Fine-Tuning for Fine-Grained Image Recognition, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222298 (Early exit from multiple places, combined with self-distillation.)
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, John Canny, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer, Sep 2023, Speculative Decoding with Big Little Decoder, https://arxiv.org/abs/2302.07863 (Early exiting in the context of speculative decoder optimizations.)
Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A., 2020, The right tool for the job: Matching model and instance complexities, In Annual Meeting of the Association for Computational Linguistics, 2020. https://arxiv.org/abs/2004.07453 (Early exit with “wisdom of committees” decisions.)
X Li, Y Shen, A Zou, Y Ma, 2023, EENet: Energy Efficient Neural Networks with Run-time Power Management, 2023 60th ACM/IEEE Design Automation Conference (DAC), https://ieeexplore.ieee.org/abstract/document/10247701 (Learns early exit characteristics and decision methods over time.)
K Liu, S Moon, 2023, Self-supervised efficient sample weighting for multi-exit networks, Knowledge-Based Systems, https://www.sciencedirect.com/science/article/abs/pii/S0950705123007530 (Early exiting during both training and inference to reduce the disparity.)
Divya J. Bajpai, Vivek K. Trivedi, Sohan L. Yadav, Manjesh K. Hanawal, 2023, SplitEE: Early Exit in Deep Neural Networks with Split Computing, arXiv preprint arXiv:2309.09195, https://arxiv.org/abs/2309.09195
George August Wright, Umberto Cappellazzo, Salah Zaiem, Desh Raj, Lucas Ondel Yang, Daniele Falavigna, Alessio Brutti, Sep 2023, Training dynamic models using early exits for automatic speech recognition on resource-constrained devices, https://arxiv.org/abs/2309.09546
X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks, IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387
J Wang, B Li, GL Zhang, 2023, Early Classification for Dynamic Inference of Neural Networks, arXiv preprint arXiv:2309.13443, https://arxiv.org/pdf/2309.13443.pdf
S Tang, Y Wang, C Ding, Y Liang, Y Li, D Xu, 2023, DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation, arXiv preprint arXiv:2309.17074, https://arxiv.org/pdf/2309.17074.pdf (Uses uncertainty-based confidence to decide on early-exit in diffusion models.)
S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a “shallow-deep module” and parallel decoding.)
F Regol, J Chataoui, M Coates, Oct 2023, Jointly-Learned Exit and Inference for a Dynamic Neural Network: JEI-DNN, arXiv preprint arXiv:2310.09163, http://export.arxiv.org/abs/2310.09163
Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a small subset of the input to speed up inference with early-exit based on confidence level.)
Hajin Shim, Sung Ju Hwang, and Eunho Yang. 2018, Joint active feature acquisition and classification with variable-size set encoding, NeurIPS, pages 1368–1378, 2018. https://papers.nips.cc/paper/2018/file/e5841df2166dd424a57127423d276bbe-Paper.pdf

For more research on the early exit, refer to https://www.aussieai.com/research/early-exit.

Layer Skipping

Layer skipping refers to bypassing the processing of a single layer and moving onto the next, rather than “early exiting” to skip all the layers. This is a form of dynamic depth pruning, because it reduces the number of layers that the model will execute, using some criteria.

Although much of the existing research is about early exit to skip all further layers, there is some research on choosing to skip a single layer. Note that layer skipping is a dynamic inference optimization, because static layer skipping is effectively the same as static layer pruning.

Research papers on layer skipping (selective dynamic layer pruning):

Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez, 2018, Skipnet: Learning dynamic routing in convolutional networks, In ECCV, 2018, https://arxiv.org/abs/1711.09485
Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov, 2020, On the Effect of Dropping Layers of Pre-trained Transformer Models, arXiv preprint arXiv:2004.03844 (2020), https://arxiv.org/pdf/2004.03844v2.pdf
Alex Graves. 2016, Adaptive computation time for recurrent neural networks, arXiv preprint arXiv:1603.08983, 2016, https://arxiv.org/abs/1603.08983
Jianghao Shen, Yue Wang, Pengfei Xu, Yonggan Fu, Zhangyang Wang, Yingyan Lin, 2020, Fractional Skipping: Towards Finer-Grained Dynamic CNN Inference, January 2020, DOI: https://doi.org/10.1609/aaai.v34i04.6025, https://arxiv.org/abs/2001.00705
YG Jiang, C Cheng, H Lin, Y Fu, 2020, Learning layer-skippable inference network, IEEE Transactions on Image Processing, Volume 29, pp. 8747-8759, 28 August 2020, https://ieeexplore.ieee.org/abstract/document/9180094
H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, 2015, A convolutional neural network cascade for face detection, 2015, in CVPR, https://paperswithcode.com/paper/a-convolutional-neural-network-cascade-for
F. Yang, W. Choi, and Y. Lin, 2016, Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers, 2016, in CVPR, https://ieeexplore.ieee.org/document/7780603
Andreas Veit and Serge Belongie, 2018, Convolutional networks with adaptive inference graphs, In ECCV, 2018, https://arxiv.org/abs/1711.11503
X. Dong, J. Huang, Y. Yang, and S. Yan, 2017, More is less: A more complicated network with less inference complexity, in CVPR, 2017. https://arxiv.org/abs/1703.08651
Hassan Sajjad, Fahim Dalvi, Nadir Durrani, Preslav Nakov, 2020, On the Effect of Dropping Layers of Pre-trained Transformer Models, arXiv preprint arXiv:2004.03844, 2020 (revised Aug 2022), https://arxiv.org/abs/2004.03844 (Examined dropping alternative layers, layer fusion, and other layer pruning strategies.)
Andreas Veit and Serge Belongie. 2018, Convolutional networks with adaptive inference graphs, ECCV, pages 3–18, 2018. https://arxiv.org/abs/1711.11503

For more research on the layer skipping, refer to https://www.aussieai.com/research/layer-pruning#skipping.

Layer Fusion

Layer fusion is a type of weight sharing (parameter sharing), where two layers are made identical by also having the same weights. Training of a multi-layer model will create a different set of weights for each layer. However, some layers can end up being very similar if the weight matrices are close enough. Hence, some layers can simply have their own weight matrix thrown away, and use the same set of weights. This is conceptually the same as having the same layer run twice.

Parameter sharing reduces the data size of the model, improving storage utilization and in-memory size, but not necessarily reducing the number of arithmetic operations. However, Transformers can be memory-bound, so that having fewer parameters being transferred from memory can speed up latency as well in some architectures.

Layer fusion and kernel fusion are two different optimizations. Whereas layer fusion merges two layers at the high-level by sharing weights, kernel fusion merges two lower-level operations into one. For example, two Transformer components can be merged into a single kernel, such as merging MatMuls with LayerNorm (i.e. “fused LayerNorm”). Kernel fusion involves combining their programmatic algorithms, and thereby improves computation speed, but doesn't change model storage size, whereas layer fusion does compress the model.

Research papers on layer fusion:

Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation, In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes “shared layers” with shared decoder FFN weights.)
Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo. 2021. Subformer: Exploring weight sharing for parameter efficiency in generative transformers, In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4081–4090, Punta Cana, Dominican Republic. Association for Computational Linguistics. https://arxiv.org/abs/2101.00234 (Parameter sharing across layers.)
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal transformers, In International Conference on Learning Representations. https://arxiv.org/abs/1807.03819 (Optimizes Transformers with weight sharing and other ways.)
Sho Takase and Shun Kiyono. 2023. Lessons on parameter sharing across layers in transformers, In Proceedings of The Fourth Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 78–90, Toronto, Canada (Hybrid). Association for Computational Linguistics. https://arxiv.org/abs/2104.06022
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A lite bert for self-supervised learning of language representations, In Proceedings of ICLR. https://arxiv.org/abs/1909.11942 (Parameter sharing across layers in the BERT Transformer architecture.)
Raj Dabre and Atsushi Fujita. 2019. Recurrent stacking of layers for compact neural machine translation models, Proceedings of AAAI, 33:6292–6299. https://arxiv.org/abs/1807.05353 (Parameter sharing across layers of a Transformer.)
Tong Xiao, Yinqiao Li, Jingbo Zhu, Zhengtao Yu, and Tongran Liu. 2019. Sharing attention weights for fast transformer, In Proceedings of IJCAI, pages 5292–5298, https://arxiv.org/abs/1906.11024 (Parameter sharing of attention heads.)
Yingce Xia, Tianyu He, Xu Tan, Fei Tian, Di He, and Tao Qin. 2019. Tied transformers: Neural machine translation with shared encoder and decoder, Proceedings of AAAI, 33(01):5466–5473. PDF: https://taoqin.github.io/papers/tiedT.AAAI2019.pdf
Osorio, J.; Armejach, A.; Petit, E.; Henry, G.; Casas, M., 2022, A BF16 FMA is All You Need for DNN Training, IEEE Trans. Emerg. Top. Comput. 2022, 10, 1302–1314. http://dx.doi.org/10.1109/TETC.2022.3187770, https://ieeexplore.ieee.org/document/9823406 (Special fused operators to allow full training using BF16 number representations.)
Okan Köpüklü, Maryam Babaee, Stefan Hörmann, Gerhard Rigoll, Feb 2019, Convolutional neural networks with layer reuse, 2019 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/8802998/, https://arxiv.org/pdf/1901.09615 (The method of repeatedly re-using the same entire layers.)
M. Alwani, H. Chen, M. Ferdman and P. A. Milder, 2016, Fused-layer CNN accelerators, 49th Annual IEEE/ACM International Symposium on Microarchitecture MICRO 2016, pp. 22:1-22:12, October 15-19, 2016, https://doi.org/10.1109/MICRO.2016.7783725, https://ieeexplore.ieee.org/document/7783725
E. Georganas, S. Avancha, K. Banerjee, D. Kalamkar, G. Henry, H. Pabst, and A. Heinecke, 2018, Anatomy of high-performance deep learning convolutions on simd architectures, in Accepted to Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’18. IEEE Press, 2018, https://arxiv.org/abs/1808.05567 (Investigates layer fusion and sublayer fusion, i.e. kernel fusion.)
L Waeijen, S Sioutas, M Peemen, M Lindwer, 2021, ConvFusion: A model for layer fusion in convolutional neural networks, IEEE Access (Volume: 9), https://ieeexplore.ieee.org/abstract/document/9646923/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09646923.pdf (Analysis of loop tiling, loop reordering, data flow, recomputation, and layer fusion.)

For more research on the layer fusion, refer to https://www.aussieai.com/research/layer-fusion.

Layer Reordering

An interesting technique, that generalizes the use of layers, is “layer reordering.” The idea is motivated by the realization that Transformer layers are building blocks which output the same format as their input. Hence, not only can you remove a layer (early exit or layer pruning) or skip a layer (layer skipping), or run the same layer twice (layer fusion), but it can be generalized in any way. You can pick and choose which layers to run, and in what order, and how often. You could even run each layer twice, or run all the layers in reverse, or whatever.

Layer reordering usually refers to entire Transformer layers. For other types of merging or reordering of separate sub-layer structures within Transformer layers, see kernel operator fusion. For discussion of the order of layer normalization subcomponents, see normalization reordering.

Layer reordering seems like it shouldn't work. After all, didn't we expend all those GPU cycles to carefully work out the correct weights for each layer? Isn't it true that the first layers do the broad analysis and the upper layers do the finessing? So, early exiting makes some kind of sense, because it just skips the finer details at the end, but randomly reordering things seems weird. Nevertheless, there are some research papers that explore layer reordering and its generalizations.

Research papers on layer reordering:

Ofir Press, Noah A. Smith, Omer Levy, 2019, Improving Transformer Models by Reordering their Sublayers, arXiv preprint arXiv:1911.03864, 2019, https://arxiv.org/abs/1911.03864 (Layer reordering! Includes analysis of multiple layers, and also reordering self-attention and feed-forward sub-components in a “sandwich” architecture.)
Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu, Mar 2021, IOT: Instance-wise Layer Reordering for Transformer Structures, https://arxiv.org/abs/2103.03457
Elicia Ye, March 2023, Greedy Ordering of Layer Weight Matrices in Transformers Improves Translation, https://arxiv.org/abs/2302.02123

For more research on the layer reordering, refer to https://www.aussieai.com/research/layer-pruning#reordering.

Shallow Decoder Transformer Architecture

Various research has discovered that it's fine for an AI engine to be shallow, but mostly in its decoder. For the encoder, it is more important that it runs all its layers. What this might suggest, is that it's hard to read, and easy to write, if you're an AI engine.

The discovery of the relative importance of layers has been related to research into “layer pruning” and “early exit” architectures. Finding that a Transformer's encoder layers are far more important than layers in the decoder suggested the concept of a “deep encoder, shallow decoder” architecture, where the encoder retains many layers, but the decoder has fewer, or even only a single layer. The “shallow decoder” terminology seems to have been introduced by Kasai et al. (2020), but is based on earlier research examining layer dependency in Transformers.

The shallow decoder architecture is a Transformer-specific type of layer pruning, which can be implemented as either static layer pruning (removing some layers permanently from the model) or dynamic layer pruning (skipping layers adaptively during inference execution).

An interesting question about early exiting relates to decoder-only architectures. Although the early 2017 Transformers were encoder-decoder, many modern Transformers such as the GPT family are decoder-only. Can the deep encoder, shallow decoder architecture be emulated in decoder-only architectures by doing dynamic early exit to different levels in the prefill phase versus the later decoding phases? I'm not sure if I've seen a research paper on that.

Note that this shallow decoder research is also related to the papers that show that pruning attention heads in the decoder still leads to a useable transformer (see “attention head pruning” in Chapter 48). Some papers have even suggested that removing the Feed Forward Network (FFN) from the decoder was possible (see “FFN pruning” in Chapter 34). Again, there's a question here as to whether pruning these components can generalize to decoder-only architectures.

Research papers on shallow-decoder architectures:

Jungo Kasai, Nikolaos Pappas, Hao Peng, James Cross, and Noah A. Smith. 2020. Deep encoder, shallow decoder: Reevaluating the speed-quality tradeoff in machine translation, CoRR, abs/2006.10369. https://arxiv.org/abs/2006.10369 Code: https://github.com/jungokasai/deep-shallow
Ye Lin, Yanyang Li, Tong Xiao, Jingbo Zhu, 2021, Bag of Tricks for Optimizing Transformer Efficiency, Findings of the Association for Computational Linguistics: EMNLP 2021, November 2021, https://aclanthology.org/2021.findings-emnlp.357/
Wenxuan Wang and Zhaopeng Tu. 2020. Rethinking the value of transformer components, In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pages 6019– 6029. International Committee on Computational Linguistics. https://arxiv.org/abs/2011.03803v1 (This paper primarily does measurement of the importance of Transformer components.)
Wangchunshu Zhou, Ronan Le Bras, Yejin Choi, June 2023, Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference, https://arxiv.org/abs/2306.02379 (An interesting paper that considers using two or more layers as “modules” that can be weaved into a new model somehow, which somewhat generalizes layer pruning or shallow decoder architectures.)
Cristóbal Eyzaguirre, Felipe del Río, Vladimir Araujo, Álvaro Soto, 2021, DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference, Sep 2021, ArXiv preprint, abs/2109.11745, https://arxiv.org/abs/2109.11745
Antonio Valerio Miceli Barone, Jindrich Helcl, Rico Sennrich, Barry Haddow, and Alexandra Birch. 2017, Deep architectures for neural machine translation, In Proc. of WMT, 2017. https://arxiv.org/abs/1707.07631 (Different stacked architectures in RNNs.)
Young Jin Kim, Marcin Junczys-Dowmunt, Hany Hassan, Alham Fikri Aji, Kenneth Heafield, Roman Grundkiewicz, and Nikolay Bogoychev. 2019, From research to production and back: Ludicrously fast neural machine translation, In Proc. of WNGT, 2019. https://www.aclweb.org/anthology/D19-5632/, Code: https://github.com/marian-nmt/marian
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz. 2019, Universal transformers, In Proc. of ICLR, 2019. https://arxiv.org/abs/1807.03819
Raj Dabre and Atsushi Fujita. 2019, Recurrent stacking of layers for compact neural machine translation models, In Proc. of AAAI, 2019. https://arxiv.org/abs/1807.05353 (Examines stacking layers of Transformers, including increasing the layers with parameter sharing.)
Shazeer, N. M. 2019, Fast transformer decoding: One write-head is all you need, ArXiv, abs/1911.02150, 2019, https://arxiv.org/abs/1911.02150
Sun, X., Ge, T., Wei, F., and Wang, H., 2021, Instantaneous grammatical error correction with shallow aggressive decoding, ArXiv, abs/2106.04970, 2021, https://arxiv.org/abs/2106.04970
Bapna, A., Arivazhagan, N., and Firat, O., 2020, Controlling computation versus quality for neural sequence models, ArXiv, abs/2002.07106, Apr 2020, https://arxiv.org/abs/2002.07106
Xiang Kong, Adithya Renduchintala, James Cross, Yuqing Tang, Jiatao Gu, Xian Li, 2022, Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders, https://arxiv.org/abs/2206.02079
Yilin Yang, Longyue Wang, Shuming Shi, Prasad Tadepalli, Stefan Lee, and Zhaopeng Tu. 2020. On the Sub-layer Functionalities of Transformer Decoder, In Findings of EMNLP. Online, 4799–4811. https://doi.org/10.18653/v1/2020.findings-emnlp.432, https://arxiv.org/abs/2010.02648 (Investigates the depth of decoders; also concludes that the FFN can be removed from the decoder.)
Telmo Pessoa Pires, António V. Lopes, Yannick Assogba, Hendra Setiawan, 2023, One Wide Feedforward is All You Need, arXiv preprint arXiv:2309.01826, https://arxiv.org/abs/2309.01826 (Removes the decoder FFNs entirely and shares a single encoder FFN across multiple encoder layers, and also increases the single FFN's size.)
Tao Ge, Si-Qing Chen, and Furu Wei. 2022. EdgeFormer: A parameter-efficient transformer for on-device seq2seq generation, In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10786–10798, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics, https://arxiv.org/abs/2202.07959 (Includes “shared layers” with shared decoder FFN weights.)
Q Wei, G Zeng, B Zeng, 2023, Efficient Training for Visual Tracking with Deformable Transformer, arXiv preprint arXiv:2309.02676, https://arxiv.org/pdf/2309.02676.pdf (Optimization and also investigated effects of number of decoder layers.)
Xin Sun, Tao Ge, Furu Wei, and Houfeng Wang. 2021, Instantaneous grammatical error correction with shallow aggressive decoding, In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5937–5947, 2021. https://arxiv.org/abs/2106.04970, Code: https://github.com/AutoTemp/Shallow-Aggressive-Decoding (Aggressive decoding emits as many tokens as possible, combined with a shallow decoder architecture.)
J Kasai, 2023, Towards Efficient, Customizable, and Communal Natural Language Processing, Ph.D. thesis, Computer Science and Engineering, University of Washington, https://www.proquest.com/openview/604084b574dcd05e41eb6e33682a3537/1 (Shallow decoding is only part of this wide-ranging and impressive Ph.D. thesis, by one of the early proponents of shallow decoding architectures.)
S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a “shallow-deep module” and parallel decoding.)
Kaya Y., Hong S., Dumitras T., 2019, Shallow-deep networks: Understanding and mitigating network overthinking, Proceedings of the international conference on machine learning, ICML (2019), pp. 3301-3310, https://arxiv.org/abs/1810.07052 (Shallow-deep method in a single model.)

For more research on the shallow decoder architecture, refer to https://www.aussieai.com/research/shallow-decoder.

• Next: Chapter 48. Width Pruning

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++