Aussie AI

Early Exit

Last Updated 10 June, 2025

by David Spuler, Ph.D.

One inference loop optimization is to exit before all of the layers have been completed, using the results up to that point if there is a high degree of confidence. This method has been called "early exit", or "dynamic layer pruning" (see layer pruning). Early exit means avoiding the calculations for all layers afer the just-finished one, whereas "layer skipping" can skip one to continue on the following layer, and "layer reordering" is a strange generalization where layers can get executed or skipped in any order.

The idea has also been applied in training, many years prior. The terms "dropout" and "early stopping" have also occasionally been used to mean inference early exit, but usually refer to training method optimizations with a similar goal to reduce training times.

Early exiting is possible in a significant number of inference evaluations (e.g. reportedly 40% in DeeBERT [Xin et al. 2020]), but not always. By skipping all layers thereafter, it avoids a significant amount of computation.

Deciding to exit. The simplest idea is to run a fixed number of layers, but this is not especially theoretically interesting. There are different more accurate ways to do early exiting. Xu and McAuley (2022) categorize three different subtypes of early exits, based on the criteria used to decide when to exit:

Confidence estimation
Internal ensemble
Learning to exit

Confidence estimation is the use of a metric predicting that confidence is high enough to exit; internal ensemble uses multiple metrics with a requirement for enough metrics to agree; learning to exit has the model attempting to learn when to exit.

Computing Confidence. There are numerous possible variations on deciding when the layer is "certain" or "confident" enough about its prediction:

A single token has a high-enough probability (i.e. a single probability threshold) at a single layer.
The highest probability token has a probability higher enough than the second-highest token (comparative probability threshold).
Where multiple layer probability predictions are high for a single token (multi-layer threshold).
Where multiple layer predictions are the same or similar enough (i.e. stable multi-layer thresholds).
Per-layer thresholds. Where a token meets different thresholds for each layer (per-layer threshold), which assumes that early layers are more likely to be wrong, whereas deeper laters are more likely to be accurate and their threshold could be lower.
Combinations of any of the above in various ways.

There are also other methods for choosing the exit point based on metrics such as:

Entropy
Patience
Computation budget
Desired speedup
Latency (response time needed)

Computation overhead of unembedding. A practical point is that computing confidence estimates requires the token probabilities or logits. These are not usually required at all layers, being done once at the end of the layer stack. Hence, analyzing the confidence based on logit probabilities introduces the overhead of extra "unembedding" computations (i.e. multiplication by the inverse matrix or transpose of the embeddings matrix). There is also the cost of analyzing the logit probabilities, such as a "max" or "top-k" type computation on the vector of token probabilities, but this is likely less significant than an unembedding matrix multiplication.

Accuracy benefits. Although most papers view early exit as an approximation that reduces accuracy of a model, some papers have noted advantages of early exiting. For example, executing fewer layers can reduce overfitting and the vanishing gradient problem.

Related Depth Techniques. Effectively, early exit of the inference loop is a form of "dynamic layer pruning" at runtime, and is therefore a type of "dynamic depth pruning" of the model. Early exiting is a special type of layer skipping, which is the more general idea of skipping some layers. Early exit is the special case of skipping all remaining layers at one exit point. Read more about other types of dynamic pruning on the layer or depth dimension of a model:

Early exit is also one of multiple strategies for "dynamic inference". Some papers refer to dynamic inference changes as "adaptive neural networks", where they change execution depending on the inputs. Some types of early exit, such as hierarchical early exit, are similar to research on cascades for DNNs and CNNs.

Survey Papers for Early Exit

Various survey papers on early exit specifically, or review papers on model compression or inference optimization that include coverage of early exit methods.

Early exit surveys. Papers that specifically survey the SOTA for early-exit:

Y. Matsubara, M. Levorato, and F. Restuccia, “Split computing and early exiting for deep learning applications: Survey and research challenges,” ACM Comput. Surveys, Mar 2022, https://arxiv.org/abs/2103.04505
Stefanos Laskaridis, Alexandros Kouris, Nicholas D. Lane, Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions, EMDL'21: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, June 2021, Pages 1–6, https://doi.org/10.1145/3469116.3470012, https://dl.acm.org/doi/abs/10.1145/3469116.3470012, https://arxiv.org/pdf/2106.05022.pdf
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)

General surveys. Papers that cover many topics and include a section on early exit:

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Canwen Xu, Julian McAuley, 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including early exit.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)

Research on Early Exit

Early exit was one of the earliest optimization techniques discovered for neural networks. There is no shortage of research papers.

Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin, DeeBERT: Dynamic early exiting for accelerating bert inference, arXiv preprint arXiv:2004.12993, 2020, https://arxiv.org/abs/2004.12993 (Code: https://github.com/castorini/DeeBERT
Angela Fan, Edouard Grave, and Armand Joulin, Reducing transformer depth on demand with structured dropout, 2019, arXiv:1909.11556, https://arxiv.org/abs/1909.11556
Surat Teerapittayanon, Bradley McDanel, and Hsiang Tsung Kung, BranchyNet: Fast inference via early exiting from deep neural networks, 2017, arXiv:1709.01686, https://arxiv.org/abs/1709.01686
S. Teerapittayanon, B. McDanel, H.T. Kung, Distributed deep neural networks over the cloud, the edge and end devices, IEEE, Atlanta, GA, USA, 2017, pp. 328–339, doi:10.1109/ICDCS.2017.226., 5–8 June, https://doi.org/10.1109/ICDCS.2017.226
Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu, Xuanjing Huang, Accelerating BERT Inference for Sequence Labeling via Early-Exit, May 2021, https://arxiv.org/abs/2105.13878
Arian Bakhtiarnia, Qi Zhang, Alexandros Iosifidis, Multi-Exit Vision Transformer for Dynamic Inference, June 2021, https://arxiv.org/abs/2106.15183
Nikolaos Passalis, Jenni Raitoharju, Anastasios Tefas, Moncef Gabbouj, Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits, Pattern Recognition Volume 105, September 2020, 107346, https://doi.org/10.1016/j.patcog.2020.107346
Xiangjie Li, Chenfei Lou, Yuchi Chen, Zhengping Zhu, Yingtao Shen, Yehan Ma, An Zou, Predictive Exit: Prediction of Fine-Grained Early Exits for Computation- and Energy-Efficient Inference, DOI: https://doi.org/10.1609/aaai.v37i7.26042, https://ojs.aaai.org/index.php/AAAI/article/view/26042
A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He, Proceedings of the 29th International Conference on Computational Linguistics, October 2022, https://aclanthology.org/2021.naacl-main.162/, https://aclanthology.org/2021.naacl-main.162.pdf
Vanderlei Bonato and Christos Bouganis, 2021, Class-specific early exit design methodology for convolutional neural networks, Applied Soft Computing (2021), https://www.sciencedirect.com/science/article/abs/pii/S1568494621002398, https://doi.org/10.1016/j.asoc.2021.107316, https://spiral.imperial.ac.uk/bitstream/10044/1/90316/2/Paper___Early_Exit___Applied_Soft_Computing.pdf
E. Baccarelli, S. Scardapane, M. Scarpiniti, A. Momenzadeh, A. Uncini, Optimized training and scalable implementation of Conditional Deep Neural Networks with early exits for Fog-supported IoT applications, Information Sciences 521 (June 2020), 107–143, DOI: https://doi.org/10.1016/j.ins.2020.02.041, http://www.sciencedirect.com/science/article/pii/
S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, When edge meets learning: Adaptive control for resource-constrained distributed machine learning, in: IEEE Conference on Computer Communications (IEEE INFOCOM 2018), 2018, pp. 63–71, doi:10.1109/INFOCOM.2018.8486403, https://doi.org/10.1109/INFOCOM.2018.8486403, Honolulu, HI, USA, 16–19 April, 2018.
Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, Furu Wei, BERT Loses Patience: Fast and Robust Inference with Early Exit, https://doi.org/10.48550/arXiv.2006.04152, https://arxiv.org/abs/2006.04152
S. Scardapane, M. Scarpiniti, E. Baccarelli, A. Uncini, Why should we add early exits to neural networks?, Cognitive Computation 12 (5) (2020), 954–966, doi:10.1007/s12559-020-09734-4, http://dx.doi.org/10.1007/s12559-020-09734-4, https://arxiv.org/pdf/2004.12814.pdf
Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He, A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2021, DOI: 10.18653/v1/2021.naacl-main.162 https://aclanthology.org/2021.naacl-main.162/
Zizhao Wang, Wei Bao, Dong Yuan, Liming Ge, Nguyen H. Tran, Albert Y. Zomaya, SEE: Scheduling Early Exit for Mobile DNN Inference during Service Outage, MSWIM '19: Proceedings of the 22nd International ACM Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, November 2019, Pages 279–288, https://doi.org/10.1145/3345768.3355917, https://dl.acm.org/doi/abs/10.1145/3345768.3355917
Xinrui Tan, Hongjia Li, Liming Wang, Xueqing Huang, Zhen Xu, Empowering Adaptive Early-Exit Inference with Latency Awareness, DOI: https://doi.org/10.1609/aaai.v35i11.17181, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/17181/16988
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. arXiv preprint arXiv:2207.07061, 2022, https://arxiv.org/abs/2104.08803
Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Weinberger. 2018. Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations (ICLR), https://arxiv.org/abs/1703.09844
Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. 2019. Shallow-deep networks: Understanding and mitigating network overthinking. In International Conference on Machine Learning (ICML), volume 97, pages 3301–3310. PMLR, https://arxiv.org/abs/1810.07052
Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, and Noah A. Smith. 2020. The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651, Association for Computational Linguistics, https://arxiv.org/abs/2004.07453
Ji Xin, Rodrigo Nogueira, Yaoliang Yu, and Jimmy Lin. 2020. Early exiting BERT for efficient document ranking. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 83–88, Online. Association for Computational Linguistics. PDF: https://cs.uwaterloo.ca/~jimmylin/publications/Xin_etal_SustaiNLP2020.pdf
Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. 2021. BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 91–104, Association for Computational Linguistics, https://aclanthology.org/2021.eacl-main.8/, Code: https://github.com/castorini/berxit
V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh, 2018, SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks, In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 662–673, https://doi.org/10.1109/ISCA.2018.00061, https://ieeexplore.ieee.org/document/8416863
D Li, W Wu, L Zeng, K Li - Wentai and Zeng, Lan and Li, Keqin, Es, Es-Fedavg: Early-Exit Personalized Federated Learning with Sparse Weight for Adaptive Computation, January 2023, SSRN Electronic Journal, DOI:10.2139/ssrn.4361705, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4361705, https://www.researchgate.net/publication/368592513_Es-Fedavg_Early-Exit_Personalized_Federated_Learning_with_Sparse_Weight_for_Adaptive_Computation
Ting-Kuei Hu, Tianlong Chen, Haotao Wang, and Zhangyang Wang. Triple wins: Boosting accuracy, robustness and efficiency together by enabling input-adaptive inference. In ICLR, Feb 2020, https://arxiv.org/abs/2002.10025
Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In ECCV, 2018, https://arxiv.org/abs/1711.09485
Learning Early Exit for Deep Neural Network Inference on Mobile Devices through Multi-Armed Bandits Weiyu Ju; Wei Bao; Dong Yuan; Liming Ge; Bing Bing Zhou, 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 10-13 May 2021, https://ieeexplore.ieee.org/abstract/document/9499356, https://doi.org/10.1109/CCGrid51090.2021.00011
Weiyu Ju, Wei Bao, Liming Ge, Dong Yuan, Dynamic Early Exit Scheduling for Deep Neural Network Inference through Contextual Bandits, CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, October 2021, Pages 823–832, https://doi.org/10.1145/3459637.3482335, https://dl.acm.org/doi/abs/10.1145/3459637.3482335
Andong Li; Chengshi Zheng; Lu Zhang; Xiaodong Li, Learning to Inference with Early Exit in the Progressive Speech Enhancement, 2021 29th European Signal Processing Conference (EUSIPCO), 23-27 August 2021, https://ieeexplore.ieee.org/abstract/document/9616248, https://doi.org/10.23919/EUSIPCO54536.2021.9616248, https://arxiv.org/abs/2106.11730
FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits, Polina Karpikova, Ekaterina Radionova, Anastasia Yaschenko, Andrei Spiridonov, Leonid Kostyushko, Riccardo Fabbricatore, Aleksei Ivakhnenko; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), April 2023, pp. 12032-12043 https://openaccess.thecvf.com/content/CVPR2023/html/Karpikova_FIANCEE_Faster_Inference_of_Adversarial_Networks_via_Conditional_Early_Exits_CVPR_2023_paper.html, https://arxiv.org/abs/2304.10306
Rongkang Dong, Yuyi Mao, and Jun Zhang. Resource-Constrained Edge AI with Early Exit Prediction. Journal of Communications and Information Networks, 7(2):122–134, June 2022, https://arxiv.org/abs/2206.07269
Qunliang Xing, Mai Xu, Tianyi Li, and Zhenyu Guan. Early exit or not: Resource-efficient blind quality enhancement for compressed images. In Computer Vision – ECCV 2020, pages 275–292. Springer International Publishing. 2020, https://arxiv.org/abs/2006.16581
M. Wołczyk et al., “Zero time waste: Recycling predictions in early exit neural networks,” in Proc. 35th Conf. Neural Inf. Process. Syst. (NeurIPS), Virtual Conference, Dec. 2021, https://arxiv.org/abs/2106.05409
M. Phuong and C. H. Lampert, “Distillation-based training for multi-exit architectures,” in Proc. IEEE/CVF Int. Conf. Comput. Vision (ICCV), Seoul, Korea (South), Oct.-Nov. 2019, https://ieeexplore.ieee.org/document/9009834
S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “SPINN: Synergistic progressive inference of neural networks over device and cloud,” in Proc. Annu. Inf. Conf. Mobile Comput. Netw. (MobiCom), London, UK, Sep. 2020, https://arxiv.org/abs/2008.06402
M. Wang, J. Mo, J. Lin, Z. Wang, and L. Du, “DynExit: A dynamic early-exit strategy for deep residual networks,” in Proc. IEEE Int. Wkshop. Signal Process. Syst. (SiPS), Nanjing, China, Oct. 2019, https://ieeexplore.ieee.org/abstract/document/9020551
Qunliang Xing, Mai Xu, Tianyi Li, and Zhenyu Guan. Early exit or not: Resource-efficient blind quality enhancement for compressed images. In Computer Vision – ECCV 2020, pages 275–292. Springer International Publishing. 2020, https://arxiv.org/abs/2006.16581
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, https://arxiv.org/abs/1409.4842
Maciej Woł czyk, Bartosz Wojcik, Klaudia Bał azy, Igor T Podolak, Jacek Tabor, Marek Smieja, and Tomasz Trzcinski. Zero time waste: Recycling predictions in early exit neural networks. In Advances in Neural Information Processing Systems, volume 34, pages 2516–2528. Curran Associates, Inc. 2021, https://arxiv.org/abs/2106.05409
Enrique S. Marquez, Jonathon S. Hare, and Mahesan Niranjan. Deep cascade learning. IEEE Transactions on Neural Networks and Learning Systems, 29(11):5475–5485. 2018, https://ieeexplore.ieee.org/document/8307262
Simone Scardapane, Danilo Comminiello, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. Differentiable branching in deep networks for fast inference. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4167–4171. 2020, https://ieeexplore.ieee.org/document/9054209
Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, Bert Vankeirsbilck, Pieter Simoens, and Bart Dhoedt, Feb 2017, The cascading neural network: building the internet of smart things. Knowledge and Information Systems, 52(3):791–814, https://link.springer.com/article/10.1007/s10115-017-1029-1
Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, and Joseph E Gonzalez. Idk cascades: Fast deep learning by learning not to overthink. arXiv preprint arXiv:1706.00885. 2017, https://arxiv.org/abs/1706.00885
Simone Scardapane, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. Why should we add early exits to neural networks? Cognitive Computation, 12(5):954–966. 2020, https://arxiv.org/abs/2004.12814
Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neural networks for efficient inference. In International Conference on Machine Learning, pages 527–536. PMLR. 2017, https://arxiv.org/abs/1702.07811
Xin Dai, Xiangnan Kong, and Tian Guo. Epnet: Learning to exit with flexible multi-branch network. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, CIKM ’20, page 235–244, New York, NY, USA. Association for Computing Machinery. 2020, https://dl.acm.org/doi/10.1145/3340531.3411973
Xinshi Chen, Hanjun Dai, Yu Li, Xin Gao, and Le Song. Learning to stop while learning to predict. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1520–1530. PMLR. 2020, https://arxiv.org/abs/2006.05082
P. Panda, A. Sengupta, K. Roy, Conditional deep learning for energy-efficient and enhanced pattern recognition, in: 2016 Design, Automation Test in Europe Conference Exhibition (DATE), 2016, pp. 475–480, https://arxiv.org/abs/1509.08971
Francesco Busolin, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Salvatore Trani, May 2021, Learning Early Exit Strategies for Additive Ranking Ensembles, https://arxiv.org/abs/2105.02568
B. Barla Cambazoglu, Hugo Zaragoza, Olivier Chapelle, Jiang Chen, Ciya Liao, Zhaohui Zheng, and Jon Degenhardt. 2010. Early exit optimizations for additive machine learned ranking systems. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), pages 411–420, New York, New York, https://dl.acm.org/doi/10.1145/1718487.1718538
Eunhyeok Park, Dongyoung Kim, Soobeom Kim, Yong-Deok Kim, Gunhee Kim, Sungroh Yoon, Sungjoo Yoo, Big/little deep neural network for ultra low power inference. 2015. In CODES '15: Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis, October 2015, Pages 124–132, https://dl.acm.org/doi/10.5555/2830840.2830854
Geng, S.; Gao, P.; Fu, Z.; and Zhang, Y., 2021, RomeBERT: Robust Training of Multi-Exit BERT, arXiv preprint arXiv:2101.09755, https://arxiv.org/abs/2101.09755
Zhou, W.; Xu, C.; and McAuley, J. J. 2022. BERT Learns to Teach: Knowledge Distillation with Meta Learning. In ACL. https://arxiv.org/abs/2106.04570
Tianxiang Sun, Yunhua Zhou, Xiangyang Liu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing Huang, Xipeng Qiu, 2021. Early Exiting with Ensemble Internal Classifiers. arXiv preprint arXiv:2105.137, https://arxiv.org/abs/2105.13792
Zhu, W. 2021. LeeBERT: Learned Early Exit for BERT with cross-level optimization. In ACL-IJCNLP, PDF: https://aclanthology.org/2021.acl-long.231.pdf
Zhang, Z.; Zhu, W.; Zhang, J.; et al. 2022. PCEE-BERT: Accelerating BERT Inference via Patient and Confident Early Exiting. In NAACL-HLT (Findings), https://aclanthology.org/2022.findings-naacl.25/
Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-adaptive transformer. ArXiv, abs/1910.10073, https://arxiv.org/abs/1910.10073
Tal Schuster, Adam Fisch, Tommi Jaakkola, Regina Barzilay, 2021. Consistent accelerated inference via confident adaptive transformers. arXiv preprint arXiv:2104.08803. https://arxiv.org/abs/2104.08803
Guan, Y.; Li, Z.; Leng, J.; et al. 2022. Transkimmer: Transformer Learns to Layer-wise Skim. In AC, https://arxiv.org/abs/2205.07324
H. Tann, S. Hashemi, R. I. Bahar, and S. Reda. Runtime configurable deep neural networks for energy-accuracy trade-off. In CODES + ISSS, pages 34:1–34:10, 2016. https://ieeexplore.ieee.org/document/9166549
G Li, X Ma, Q Yu, L Liu, H Liu, X Wang, 2023, CoAxNN: Optimizing on-device deep learning with conditional approximate neural networks, Journal of Systems Architecture, https://www.sciencedirect.com/science/article/abs/pii/S1383762123001571
X Gao, Y Liu, T Huang, Z Hou, 2023, PF-BERxiT: Early Exiting for BERT with Parameter-efficient Fine-tuning and Flexible early exiting strategy, Neurocomputing, https://www.sciencedirect.com/science/article/abs/pii/S0925231223008135
Z Zeng, Y Hong, H Dai, H Zhuang, C Chen, August 2023, ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference, PDF: https://www.researchgate.net/publication/373392419_ConsistentEE_A_Consistent_and_Hardness-Guided_Early_Exiting_Method_for_Accelerating_Language_Models_Inference
Duggal, R., Freitas, S., Dhamnani, S., Chau, D.H., Sun, J.: ELF: an early-exiting framework for long-tailed classification. Arxiv Preprint Arxiv:2006.11979 (2020) https://arxiv.org/abs/2006.11979
H Yu, D Liu, Z Zhang, J Wang, 2023, IEEE Transactions on Instrumentation and Measurement (Early Access), A Dynamic Transformer Network with Early Exit Mechanism for Fast Detection of Multiscale Surface Defects, https://ieeexplore.ieee.org/document/10242087
A Zniber, O Karrakchou, M Ghogho, 2023, Dynamic Early Exiting Predictive Coding Neural Networks, arXiv preprint arXiv:2309.02022, https://arxiv.org/pdf/2309.02022.pdf
Y. Long, I. Chakraborty, and K. Roy, 2020, “Conditionally deep hybrid neural networks across edge and cloud,” arXiv:2005.10851, https://arxiv.org/abs/2005.10851
Berestizshevsky, K., Even, G.: Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844, https://arxiv.org/abs/1703.09844 (Has multiple models combined in an early-exit configuration.)
A Moos, 2023, Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs, arXiv preprint arXiv:2309.03530, https://arxiv.org/pdf/2309.03530.pdf (Fast inference for a soccer-playing robot with cascade-like hierarchical early exits.)
Francesco Daghero, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini, Enrico Macii, Massimo Poncino, "Energy-Efficient Adaptive Machine Learning on IoT End-Nodes With Class-Dependent Confidence", 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp.1-4, 2020. https://ieeexplore.ieee.org/document/9294863, https://arxiv.org/abs/2204.03431v1 (An improved stopping policy for early exits on easy-input classification tasks.)
Kyungchul Park, Chanyoung Oh, Youngmin Yi, "BPNet: Branch-pruned Conditional Neural Network for Systematic Time-accuracy Tradeoff", 2020 57th ACM/IEEE Design Automation Conference (DAC), pp.1-6, 2020. https://ieeexplore.ieee.org/document/9218545
T Shen, C Lee, V Narayanan, Oct 2023, Multi-Exit Vision Transformer with Custom Fine-Tuning for Fine-Grained Image Recognition, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222298 (Early exit from multiple places, combined with self-distillation.)
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, John Canny, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer, Sep 2023, Speculative Decoding with Big Little Decoder, https://arxiv.org/abs/2302.07863 (Early exiting in the context of speculative decoder optimizations.)
Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A. The right tool for the job: Matching model and instance complexities. In Annual Meeting of the Association for Computational Linguistics, 2020. https://arxiv.org/abs/2004.07453 (Early exit with "wisdom of committees" decisions.)
X Li, Y Shen, A Zou, Y Ma, 2023, EENet: Energy Efficient Neural Networks with Run-time Power Management, 2023 60th ACM/IEEE Design Automation Conference (DAC), https://ieeexplore.ieee.org/abstract/document/10247701 (Learns early exit characteristics and decision methods over time.)
K Liu, S Moon, 2023, Self-supervised efficient sample weighting for multi-exit networks, Knowledge-Based Systems, https://www.sciencedirect.com/science/article/abs/pii/S0950705123007530 (Early exiting during both training and inference to reduce the disparity.)
Divya J. Bajpai, Vivek K. Trivedi, Sohan L. Yadav, Manjesh K. Hanawal, 2023, SplitEE: Early Exit in Deep Neural Networks with Split Computing, arXiv preprint arXiv:2309.09195, https://arxiv.org/abs/2309.09195
George August Wright, Umberto Cappellazzo, Salah Zaiem, Desh Raj, Lucas Ondel Yang, Daniele Falavigna, Alessio Brutti, Sep 2023, Training dynamic models using early exits for automatic speech recognition on resource-constrained devices, https://arxiv.org/abs/2309.09546
X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387
J Wang, B Li, GL Zhang, 2023, Early Classification for Dynamic Inference of Neural Networks, arXiv preprint arXiv:2309.13443, https://arxiv.org/pdf/2309.13443.pdf
S Tang, Y Wang, C Ding, Y Liang, Y Li, D Xu, 2023, arXiv preprint arXiv:2309.17074, DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation, https://arxiv.org/pdf/2309.17074.pdf (Uses uncertainty-based confidence to decide on early-exit in diffusion models.)
S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
F Regol, J Chataoui, M Coates, Oct 2023, Jointly-Learned Exit and Inference for a Dynamic Neural Network: JEI-DNN, arXiv preprint arXiv:2310.09163, http://export.arxiv.org/abs/2310.09163
Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a small subset of the input to speed up inference with early-exit based on confidence level.)
Hajin Shim, Sung Ju Hwang, and Eunho Yang. Joint active feature acquisition and classification with variable-size set encoding. NeurIPS, pages 1368–1378, 2018. https://papers.nips.cc/paper/2018/file/e5841df2166dd424a57127423d276bbe-Paper.pdf
Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic layer depth routing based on easy vs hard queries to optimize training.)
Ji-Ye Jeon; Xuan Truong Nguyen; Hyuk-Jae Lee, Jan 2024, Mitigation of Over-Confidence in Scale-Adjusted Training for Early-Exit Networks, 2024 International Conference on Electronics, Information, and Communication (ICEIC), 28-31 January, 2024, https://ieeexplore.ieee.org/abstract/document/10457149 (Making early-exit more accurate by avoiding overconfidence of wrong predictions.)
Divya Jyoti Bajpai, Aastha Jaiswal, Manjesh Kumar Hanawal, 19 Jan 2024, I-SplitEE: Image classification in Split Computing DNNs with Early Exits, https://arxiv.org/abs/2401.10541
Li, L., Wang, C., Qiu, M. et al., 2024, Accelerating BERT inference with GPU-efficient exit prediction. Front. Comput. Sci. 18, 183308 (2024). https://doi.org/10.1007/s11704-022-2341-9, https://link.springer.com/article/10.1007/s11704-022-2341-9
Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, Li Li, 18 Jan 2024, When Neural Code Completion Models Size up the Situation: Attaining Cheaper and Faster Completion through Dynamic Model Inference, https://arxiv.org/abs/2401.09964 (Analysing the importance of different layers in code completion use case.)
Xuchen Pan, Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou, 1 Feb 2024, EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models, https://arxiv.org/abs/2402.00518 Code: https://github.com/pan-x-c/EE-LLM
Jingcun Wang1, Bing Li, Grace Li Zhang, 2024, Early-Exit with Class Exclusion for Efficient Inference of Neural Networks, https://www.hwai.tu-darmstadt.de/fileadmin/user_upload/main.pdf (Early exit for classification use case.)
Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 4 Mar 2024, DyCE: Dynamic Configurable Exiting for Deep Learning Compression and Scaling, https://arxiv.org/abs/2403.01695 (General framework for dynamic early exit in complicated architectures.)
Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang, 4 Mar 2024, Not all Layers of LLMs are Necessary during Inference, https://arxiv.org/abs/2403.02181
Filip Szatkowski, Fei Yang, Bartłomiej Twardowski, Tomasz Trzciński, Joost van de Weijer, 12 Mar 2024, Accelerated Inference and Reduced Forgetting: The Dual Benefits of Early-Exit Networks in Continual Learning, https://arxiv.org/abs/2403.07404 (Early exit optimizations applied to continual learning.)
Max Sponner, Lorenzo Servadei, Bernd Waschneck, Robert Wille, Akash Kumar, 12 Mar 2024, Efficient Post-Training Augmentation for Adaptive Inference in Heterogeneous and Distributed IoT Environments, https://arxiv.org/abs/2403.07957 (Early exit for image classification and IoT devices.)
Max Sponner, Lorenzo Servadei, Bernd Waschneck, Robert Wille, Akash Kumar, 12 Mar 2024, Temporal Decisions: Leveraging Temporal Correlation for Efficient Decisions in Early Exit Neural Networks https://arxiv.org/abs/2403.07958 (Using time-based data to help decisions for early-exit, in video sequences.)
Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
Lars Wolfgang Folkerts, Nekatrios Georgios Tsoutsos, 2024, FHE-MENNs: Accelerating Fully Homomorphic Private Inference with Multi-Exit Neural Networks, PDF: https://trustworthycomputing.github.io/FHE-MENN/FHEMENNs.pdf
Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 26 Mar 2024, Tiny Models are the Computational Saver for Large Models, https://arxiv.org/abs/2403.17726v1 (Choose tiny or small models after an initial layer of the larger model, combining early exit with easy-hard queries for multi-model inference.)
Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang, 28 Jan 2024, Efficient Tuning and Inference for Large Language Models on Textual Graphs, https://arxiv.org/abs/2401.15569 (Optimizing Graph Neural Networks on textual graphs using caching and early exit inference.)
Matteo Gambella, Jary Pomponi, Simone Scardapane, Manuel Roveri, 24 Jan 2024, NACHOS: Neural Architecture Search for Hardware Constrained Early Exit Neural Networks, https://arxiv.org/abs/2401.13330
Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf (Early exit in training.)
Metod Jazbec, Patrick Forré, Stephan Mandt, Dan Zhang, Eric Nalisnick, 10 Nov 2023, Anytime-Valid Confidence Sequences for Consistent Uncertainty Estimation in Early-Exit Neural Networks, https://arxiv.org/abs/2311.05931
Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan, R. Manmatha, Nov 2023, DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models, https://arxiv.org/abs/2311.08623
Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, Claire Cui, 26 Nov 2023, Learning to Skip for Language Modeling, https://arxiv.org/abs/2311.15436 (Generalizes token-based early exiting to skip entire layers.)
Mikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk, Nov 2023, Adaptive Early Exiting for Collaborative Inference over Noisy Wireless Channels, https://arxiv.org/abs/2311.18098 (Early exiting combined with collaborative inference.)
Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou, Dec 2023, EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, https://arxiv.org/abs/2312.04916 Code: https://github.com/pan-x-c/EE-LLM
H Kim, S Yoon, S Pack, 2023, Poster: Adaptive In-Network Inference using Early-Exits, Proceedings of the 6th on European P4 Workshop, https://dl.acm.org/doi/abs/10.1145/3630047#page=62
Haoyu Wang, Yaqing Wang, Tianci Liu, Tuo Zhao, and Jing Gao, 2023, HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference https://aclanthology.org/2023.findings-emnlp.283.pdf (Layer skipping during fine-tuning.)
JY Jeon, XT Nguyen, S Ryu, HJ Lee, 2024, USDN: A Unified Sample-wise Dynamic Network with Mixed-Precision and Early-Exit, https://openaccess.thecvf.com/content/WACV2024/papers/Jeon_USDN_A_Unified_Sample-Wise_Dynamic_Network_With_Mixed-Precision_and_Early-Exit_WACV_2024_paper.pdf
F Ilhan, KH Chow, S Hu, T Huang, S Tekin, W Wei, 2024, Adaptive Deep Neural Network Inference Optimization with EENet, https://openaccess.thecvf.com/content/WACV2024/papers/Ilhan_Adaptive_Deep_Neural_Network_Inference_Optimization_With_EENet_WACV_2024_paper.pdf
Mohammed Ayyat; Tamer Nadeem; Bartosz Krawczyk, Dec 2023, ClassyNet: Class-Aware Early Exit Neural Networks for Edge Devices, IEEE Internet of Things Journal (Early Access), https://ieeexplore.ieee.org/abstract/document/10365527
Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
MM Rastikerdar, J Huang, S Fang, H Guan, D Ganesan, Oct 2023, Efficient IoT Inference via Context-Awareness, https://arxiv.org/pdf/2310.19112.pdf (Does dynamic context-aware "classifier switching" which is similar to cascades and/or early exiting.)
N Varshney, A Chatterjee, M Parmar, C Baral, Oct 2023, arXiv preprint arXiv:2310.18581, Accelerating LLM Inference by Enabling Intermediate Layer Decoding, https://arxiv.org/pdf/2310.18581.pdf (Dynamic confidence-based early exiting analysis on LLama models.)
Tan Rong Loo, T. Hui Teo, Mulat Ayinet Tiruye, I-Chyn Wey, 2022, High-Performance Asynchronous CNN Accelerator with Early Termination, 2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp.140-144, 2022. https://ieeexplore.ieee.org/document/10008416
Zhiwei Liang, Yuezhi Zhou, 2022, Dispense Mode for Inference to Accelerate Branchynet, 2022 IEEE International Conference on Image Processing (ICIP), pp.1246-1250, 2022. https://ieeexplore.ieee.org/document/9897574
Francesco Daghero, Alessio Burrello, Chen Xie, Luca Benini, Andrea Calimera, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari, Low-Overhead Early-Stopping Policies for Efficient Random Forests Inference on Microcontrollers, VLSI-SoC: Technology Advancement on SoC Design, vol.661, pp.25, 2022. https://doi.org/10.1007/978-3-031-16818-5_2
T Hu, C Meinel, H Yang, 2023, Flexible BERT with Width-and Depth-dynamic Inference, 2023 International Joint Conference on Neural Networks (IJCNN), https://ieeexplore.ieee.org/abstract/document/10191515/ (A 2023 version of BERT that does dual pruning with early exit and width gating.)
C Luo, J Chen, X Feng, J Zhang, J Li, 2023, Sustainable Collaborative Inference in Intelligent Transportation Systems IEEE Transactions on Intelligent Transportation, https://ieeexplore.ieee.org/document/10239242
E Mohammed, O Mashaal, H Abou-Zeid, 2023, Using Early Exits for Fast Inference in Automatic Modulation Classification, arXiv preprint arXiv:2308.11100, 2023, https://arxiv.org/pdf/2308.11100.pdf
Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. 2017, Adaptive neural networks for efficient inference. In International Conference on Machine Learning, pages 527–536. PMLR, 2017. https://arxiv.org/abs/1702.07811
Huang, G.; Chen, D.; Li, T.; Wu, F.; van der Maaten, L.; and Weinberger, K. Q., 2017. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, https://arxiv.org/abs/1703.09844 (Doing dynamic inference, early-exit & changing the features.)
Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
Metod Jazbec, Alexander Timans, Tin Hadži Veljković, Kaspar Sakmann, Dan Zhang, Christian A. Naesseth, Eric Nalisnick, 31 May 2024, Fast yet Safe: Early-Exiting with Risk Control, https://arxiv.org/abs/2405.20915 (Adjusting the early exit threshold according to a risk tolerance.)
Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
Lianming Huang, Shangyu Wu, Yufei Cui, Ying Xiong, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 24 May 2024, RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference, https://arxiv.org/abs/2405.15198 (Early exit classifiers built with pre-computation using a retrieval database.)
Myeonggu Kang, Junyoung Park, Hyein Shin, Jaekang Shin, Lee-Sup Kim, 2024, ToEx: Accelerating Generation Stage of Transformer-based Language Models via Token-adaptive Early Exit, IEEE Transactions on Computers, PrePrints pp. 1-14, DOI Bookmark: 10.1109/TC.2024.3404051, https://www.computer.org/csdl/journal/tc/5555/01/10535998/1X7QeuzvX4A
Pietro Farina, Subrata Biswas, Eren Yıldız, Khakim Akhunov, Saad Ahmed, Bashima Islam, Kasım Sinan Yıldırım, 16 May 2024, Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems, https://arxiv.org/abs/2405.10426
William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
Haoyi Wu, Kewei Tu, 17 May 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Use the KV cache for only the final layer as the KV cache for all other layers, or alternatively, use only the cache from a few layers, also possibly using a few standard layers as "warmup layers". This idea is conceptuatlly similar to "propagation" of the KV cache in early exit methods or to layer fusion of weights.)
Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji, 9 May 2024, Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference, https://arxiv.org/abs/2405.05803 Code: https://github.com/lzhxmu/VTW (Removing all visual tokens in later layers of a multimodal model, which is effectively early exiting or token pruning, but affecting only the vision part of the multimodal Transformer.)
Jiaming Huang, Yi Gao, Wei Dong, 13 May 2024, Unlocking the Non-deterministic Computing Power with Memory-Elastic Multi-Exit Neural Networks, WWW '24: Proceedings of the ACM on Web Conference 2024 May 2024, Pages 2777–2785, https://doi.org/10.1145/3589334.3645340 https://dl.acm.org/doi/abs/10.1145/3589334.3645340
Caelin Kaplan, Tareq Si Salem, Angelo Rodio, Chuan Xu, Giovanni Neglia, 7 May 2024, Federated Learning for Cooperative Inference Systems: The Case of Early Exit Networks, https://arxiv.org/abs/2405.04249
Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
Dieter Verbruggen, Sofie Pollin, Hazem Sallouha, 6 May 2024, Computational Efficient Width-Wise Early Exits in Modulation Classification, https://arxiv.org/abs/2405.03222
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
3 Jan 2024 (v2), SPEED: Speculative Pipelined Execution for Efficient Decoding, Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao, https://arxiv.org/abs/2310.12072 (Speculatively executing multiple future tokens in parallel to the current token, by using multiple tokens with high probability from the early layers of inference of the current token in the model. This allows multiple speculations of the autoregressive inference of the next token to start before the current token is finished.)
Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 23 May 2024, CEEBERT: Cross-Domain Inference in Early Exit BERT, https://arxiv.org/abs/2405.15039
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Utkarsh Saxena, Kaushik Roy, McQueen: Mixed Precision Quantization of Early Exit Networks, https://papers.bmvc2023.org/0511.pdf (Combination of mixed-precision quantization, with precision specifiable staticly to a layerwise granularity, with early exit dynamic depth optimizations.)
Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
Cristóbal Eyzaguirre, Felipe del Río, Vladimir Araujo, Álvaro Soto, DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference, Sep 2021, ArXiv preprint, abs/2109.11745, https://arxiv.org/abs/2109.11745
Y. Matsubara, M. Levorato, and F. Restuccia, Mar 2022, “Split computing and early exiting for deep learning applications: Survey and research challenges,” ACM Comput. Surveys, https://arxiv.org/abs/2103.04505 -
Mor Geva, Avi Caciularu, Guy Dar, Paul Roit, Shoval Sadde, Micah Shlain, Bar Tamir, and Yoav Goldberg. 2022a. LM-debugger: An interactive tool inspection and intervention in transformer-based language models. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 12–21, Abu Dhabi, UAE. Association for Computational Linguistics https://aclanthology.org/2022.emnlp-demos.2
Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.3
Ting Hu, Christoph Meinel, Haojin Yang, 2024, A flexible BERT model enabling width- and depth-dynamic inference, Computer Speech & Language 4 April 2024, 101646, https://www.sciencedirect.com/science/article/pii/S0885230824000299 (Dual pruning method with layerwise "neural grafting" that gives dynamic width models, and combined with early exit on the depth dimension.)
David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-token layer skipping for a type of adaptive inference with conditional computation.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
Neeraj Varshney, Agneet Chatterjee, Mihir Parmar, Chitta Baral, Nov 2023, Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE, https://arxiv.org/abs/2310.18581 (Improved training for early-exit networks by handling intermediate layer weight updates.)
Tzu-Quan Lin, Hung-yi Lee, Hao Tang, 8 Jun 2024, DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models, https://arxiv.org/abs/2406.05464
L. Li, K. Ota, and M. Dong, 2018, “Deep learning for smart industry: Efficient manufacture inspection system with fog computing,” IEEE Trans. Ind. Informat., vol. 14, no. 10, pp. 4665–4673, Oct. 2018. https://ieeexplore.ieee.org/document/8370640
David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
X Hou, J Liu, X Tang, C Li, KT Cheng, L Li, M Guo, 2023 MMExit: Enabling Fast and Efficient Multi-modal DNN Inference with Adaptive Network Exits, https://link.springer.com/chapter/10.1007/978-3-031-39698-4_29
J Zhang, M Tan, P Dai, W Zhu, 2023, LECO: Improving Early Exiting via Learned Exits and Comparison-based Exiting Mechanism https://aclanthology.org/2023.acl-srw.43/ https://aclanthology.org/2023.acl-srw.43.pdf
X Gao, W Zhu, J Gao, C Yin, 2023, F-PABEE: Flexible-patience-based Early Exiting for Single-label and Multi-label text Classification Tasks https://ieeexplore.ieee.org/abstract/document/10095864/
T Tambe, 2023, Architecting High Performance Silicon Systems for Accurate and Efficient On-Chip Deep Learning, https://dash.harvard.edu/bitstream/handle/1/37375806/Final_Draft_PhD_Dissertation_Thierry_Tambe.pdf?sequence=1&isAllowed=y
J Xin, 2023, Efficient Inference of Transformers in Natural Language Processing: Early Exiting and Beyond, https://uwspace.uwaterloo.ca/handle/10012/19111 https://uwspace.uwaterloo.ca/bitstream/handle/10012/19111/Xin_Ji.pdf?sequence=1
Jinting Chen, Zhaocheng Zhu, Cheng Li, Yuming Zhao, Oct 2019, Self-Adaptive Network Pruning, https://arxiv.org/abs/1910.08906
Emanuele Lattanzi; Chiara Contoli; Valerio Freschi, 2023, A Study on the Energy Sustainability of Early Exit Networks for Human Activity Recognition IEEE Transactions on Sustainable Computing (Early Access), pp.1-14, 8 August 2023. https://ieeexplore.ieee.org/abstract/document/10213213
M Omer Mohammed Elamin Elshaigi, 2023 Adaptive Deep Neural Networks for Human Pose Estimation on Autonomous Nano-Drones, Masters Thesis, PDF: https://webthesis.biblio.polito.it/secure/27689/1/tesi.pdf
Aviv Slobodkin, Leshem Choshen, and Omri Abend. 2021. Mediators in determining what processing BERT performs first. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 86–93, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.8
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1452
Elena Voita, Rico Sennrich, and Ivan Titov. 2019. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4396–4406, Hong Kong, China. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1448
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.446
J Alammar. 2021. Ecco: An open source library for the explainability of transformer language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 249–257, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-demo.30
Z Fei, X Yan, S Wang, Q Tian, 2022, Deecap: Dynamic early exiting for efficient image captioning, Both accuracy and efficiency are crucial for image captioning in real-world scenarios. http://openaccess.thecvf.com/content/CVPR2022/html/Fei_DeeCap_Dynamic_Early_Exiting_for_Efficient_Image_Captioning_CVPR_2022_paper.html
Alexander Yom Din, Taelin Karidi, Leshem Choshen, Mor Geva, Mar 2023, Jump to Conclusions: Short-Cutting Transformers With Linear Transformations, https://arxiv.org/abs/2303.09435
Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. Proceedings of ICLR, 2018, March 2019, https://arxiv.org/abs/1807.03819
Rippel, O.; Gelbart, M.; and Adams, R. 2014. Learning ordered representations with nested dropout. In International Conference on Machine Learning, 1746–1754. PMLR https://arxiv.org/abs/1402.0915
E Mohammed, O Mashaal, H Abou-Zeid, arXiv preprint arXiv:2308.11100, 2023, Using Early Exits for Fast Inference in Automatic Modulation Classification, https://arxiv.org/abs/2308.11100
Ori Ram, Liat Bezalel, Adi Zicher, Yonatan Belinkov, Jonathan Berant, and Amir Globerson. 2022. What are you token about? dense retrieval as distributions over the vocabulary. arXiv preprint arXiv:2212.10380 https://arxiv.org/abs/2212.10380
Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. 2022. Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535. https://arxiv.org/abs/2209.02535
Wang, J., Chen, K., Chen, G., Shou, L., McAuley, J.: Skipbert: Efficient inference with shallow layer skipping. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7287–7301 (2022) https://aclanthology.org/2022.acl-long.503/ (Skips early layers of a model via precomputed lookup tables based on detecting known token n-grams in the prompt.)
Max Lamparth and Anka Reuel. 2023. Analyzing and editing inner mechanisms of backdoored language models. arXiv preprint arXiv:2302.12461. https://arxiv.org/abs/2302.12461
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014), https://dl.acm.org/doi/abs/10.5555/2627435.2670313
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
Jiachen Jiang, Jinxin Zhou, Zhihui Zhu, 20 Jun 2024, On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier, https://arxiv.org/abs/2406.14479 (Using layer similarity for early exit classifiers, which is also related to layer fusion.)
Florian Valade, 17 July 2024, Accelerating Large Language Model Inference with Self-Supervised Early Exits, hal-04644928, https://hal.science/hal-04644928/document
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Bartłomiej Krzepkowski, Monika Michaluk, Franciszek Szarwacki, Piotr Kubaty, Jary Pomponi, Tomasz Trzciński, Bartosz Wójcik, Kamil Adamczewski, 19 Jul 2024, Joint or Disjoint: Mixing Training Regimes for Early-Exit Models, https://arxiv.org/abs/2407.14320
Feng (Shelley) Xia, May 2024, Exploring Early Exiting Strategies for Deep Neural Networks, Masters Thesis, Princeton University, https://www.proquest.com/openview/40f3d8735c8cc93d77bc6724c6535190/1?pq-origsite=gscholar&cbl=18750&diss=y
Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang, 25 Jul 2024, An Efficient Inference Framework for Early-exit Large Language Models, https://arxiv.org/abs/2407.20272 (Faster early exit using batching and KV cache resolution.)
Filipe Laitenberger, Max Belitsky, Denys Sheremet, Oliver Savolainen, Mark Bodracska, 2024, Exploring Monotonicity in Early-Exiting Language Models, https://openreview.net/pdf?id=BM1Aijdheb
Mehrnaz Mofakhami, Reza Bayat, Ioannis Mitliagkas, Joao Monteiro, Valentina Zantedeschi, 2024, Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones, ICML 2024 Workshop on Efficient Systems for Foundation Models (ES-FoMo-II), Vienna, Austria, https://openreview.net/pdf?id=7zf1584SUG
Marco Colocrese, Erdem Koyuncu, Hulya Seferoglu, 8 Aug 2024, Early-Exit meets Model-Distributed Inference at Edge Networks, https://arxiv.org/abs/2408.05247
R. Narmeen, P. Mach, Z. Becvar and I. Ahmad, 16 August 2024, Joint Exit Selection and Offloading Decision for Applications Based on Deep Neural Networks, IEEE Internet of Things Journal, doi: 10.1109/JIOT.2024.3444898, https://doi.org/10.1109/JIOT.2024.3444898 https://ieeexplore.ieee.org/abstract/document/10638073
J. Wang, B. Li and G. L. Zhang, 2024, Early-Exit with Class Exclusion for Efficient Inference of Neural Networks, 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 2024, pp. 263-267, doi: 10.1109/AICAS59952.2024.10595861, https://ieeexplore.ieee.org/document/10595861 Code: https://github.com/HWAI-TUDa/EarlyClassExclusion (Reduces the early exit tokens under consideration for the classifier at each layer, which is similar to slimmable networks.)
Basar Kutukcu, Sabur Baidya, Sujit Dey, 2024, SLEXNet: Adaptive Inference Using Slimmable Early Exit Neural Networks, https://doi.org/10.1145/3689632 https://dl.acm.org/doi/pdf/10.1145/3689632 (Combined width and depth pruning with slimmable and early exit networks.)
Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193
Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim, 19 Jul 2024 (v5), SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks, https://arxiv.org/abs/2402.09025 https://github.com/jiwonsong-dev/SLEB
David Spuler, June 2024, Aussie AI, Optimizing On-Device Transformer Inference for Source Code Checking: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901675
David Spuler, June 2024, Aussie AI, Heuristic Optimization of Transformer On-Device Inference: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901670
David Spuler, June 2024, Aussie AI, Speculative Decoding With Early Exit for Optimized Transformer On-Device Inference: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901656
Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
A Hannan, A Brutti, D Falavigna, 2024, LDASR: An Experimental Study on Layer Drop using Conformer-based Architecture, https://eurasip.org/Proceedings/Eusipco/Eusipco2024/pdfs/0000151.pdf
Youva Addad, AlexisLechervy, Frédéric Jurie, 2024, Balancing Accuracy and Efficiency in Budget-Aware Early-Exiting Neural Networks, https://lechervy.users.greyc.fr/publi/C/publi_pdf/icpr24.pdf
Jordan Dotzel, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, Sep 2024, Opportunities for Post-Training Dynamic Layer Sparsity in Large Vision and Language Models, https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Dotzel_Opportunities_for_Post-Training_Dynamic_Layer_Sparsity_in_Large_Vision_and_CVPRW_2024_paper.pdf (Layerwise dynamic sparsity for vision models.)
Wang, Z., Han, J. (2024). Improve Shallow Decoder Based Transformer with Structured Expert Prediction. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_15 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_15
Zheng Liu, Jinchao Zhu, Nannan Li, Gao Huang, 21 Sep 2024, Multiple-Exit Tuning: Towards Inference-Efficient Adaptation for Vision Transformer, https://arxiv.org/abs/2409.13999 (Early exit ideas applied to PETL-based fine-tuning.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
Divya Jyoti Bajpai and Manjesh Kumar Hanawal, Oct 2024, CAPEEN:Image Captioning with Early Exits and Knowledge Distillation, https://www.researchgate.net/profile/Divya-Jyoti-Bajpai/publication/384595581_CAPEEN_Image_Captioning_with_Early_Exits_and_Knowledge_Distillation/links/66fe753e9e6e82486ffe97ef/CAPEEN-Image-Captioning-with-Early-Exits-and-Knowledge-Distillation.pdf
Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 6 Oct 2024, Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach, https://arxiv.org/abs/2410.05338
R. G. Pacheco et al., "UCBEE: A Multi Armed Bandit Approach for Early-Exit in Neural Networks," in IEEE Transactions on Network and Service Management, doi: 10.1109/TNSM.2024.3479076. https://ieeexplore.ieee.org/abstract/document/10714362
M. Sponner, L. Servadei, B. Waschneck, R. Wille and A. Kumar, "Harnessing Temporal Information for Efficient Edge AI," 2024 9th International Conference on Fog and Mobile Edge Computing (FMEC), Malmö, Sweden, 2024, pp. 5-13, doi: 10.1109/FMEC62297.2024.10710223. https://ieeexplore.ieee.org/abstract/document/10710223
Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal, 16 Oct 2024, FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction, https://arxiv.org/abs/2410.12513
Amrit Diggavi Seshadri, 3 Oct 2024 (v2), Normalized Narrow Jump To Conclusions: Normalized Narrow Shortcuts for Parameter Efficient Early Exit Transformer Prediction, https://arxiv.org/abs/2409.14091 https://ui.adsabs.harvard.edu/abs/2024arXiv240914091D/abstract
Haoyan Luo, Lucia Specia, 16 Oct 2024, Tuning Language Models by Mixture-of-Depths Ensemble, https://arxiv.org/abs/2410.13077
Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 24 Oct 2024, Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952
Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates, 26 Oct 2024, Dynamic layer selection in decoder-only transformers, https://arxiv.org/abs/2410.20022
Z. Wen et al., "SwiftNet: A Cost-Efficient Deep Learning Framework With Diverse Applications," in IEEE Transactions on Industrial Informatics, doi: 10.1109/TII.2024.3448500. https://ieeexplore.ieee.org/abstract/document/10740525
Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
M. Sponner, L. Servadei, B. Waschneck, R. Wille and A. Kumar, "Leveraging Temporal Patterns: Automated Augmentation to Create Temporal Early Exit Networks for Efficient Edge AI," in IEEE Access, doi: 10.1109/ACCESS.2024.3497158. https://ieeexplore.ieee.org/document/10752535
Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Aritra Roy Gosthipaty, Mostafa Elhoushi, Pedro Cuenca, Vaibhav Srivastav, November 20, 2024, Faster Text Generation with Self-Speculative Decoding, https://huggingface.co/blog/layerskip
Weiqiao Shan, Long Meng, Tong Zheng, Yingfeng Luo, Bei Li, junxin Wang, Tong Xiao, Jingbo Zhu, 2 Dec 2024, Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization, https://arxiv.org/abs/2412.01455
Wazib Ansar, Saptarsi Goswami, Amlan Chakrabarti, 6 Dec 2024, BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits, https://arxiv.org/abs/2412.05225
N Kawata, S Orihashi, S Suzuki, T Tanaka, M Ihori, 2024, Block Refinement Learning for Improving Early Exit in Autoregressive ASR, 2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), http://www.apsipa2024.org/files/papers/214.pdf
Jianing He, Qi Zhang, Hongyun Zhang, Xuanjing Huang, Usman Naseem, Duoqian Miao, 17 Dec 2024, COSEE: Consistency-Oriented Signal-Based Early Exiting via Calibrated Sample Weighting Mechanism, https://arxiv.org/abs/2412.13236
Y. Liu, R. Han, Q. Zhang, H. Hou, C. H. Liu and L. Y. Chen, "On Scheduling Early-exit Layers for Model Pipeline in 6G-based Edge Inference," in IEEE Network, doi: 10.1109/MNET.2024.3520555. https://ieeexplore.ieee.org/abstract/document/10807288/
Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, 30 Oct 2021, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, https://arxiv.org/abs/2111.00230
L Huang, S Wu, Y Cui, Y Xiong, X Liu, TW Kuo, N Guan, Dec 2024, RAEE: A Robust Retrieval-Augmented Early Exiting Framework for Efficient Inference, 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-IV 2024), https://neurips2024-enlsp.github.io/papers/paper_66.pdf (Early exit combined with RALM by using retrieval to help the classifier decide to exit at layers.)
Fabio Montello, Ronja Güldenring, Simone Scardapane, Lazaros Nalpantidis, 13 Jan 2025, A Survey on Dynamic Neural Networks: from Computer Vision to Multi-modal Sensor Fusion, https://arxiv.org/abs/2501.07451 (Survey of adaptive inference optimizations: early exit, dynamic routing, token skimming.)
Shashikant Ilager, Lukas Florian Briem, Ivona Brandic, 19 Jan 2025, GREEN-CODE: Optimizing Energy Efficiency in Large Language Models for Code Generation, https://arxiv.org/abs/2501.11006 (Energy efficiency via early exit.)
Weizhi Fei, Xueyan Niu, Guoqing Xie, Yingqing Liu, Bo Bai, Wei Han, 22 Jan 2025, Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference, https://arxiv.org/abs/2501.12959 (Input token scanning efficiencly using early exit during prefill to prune tokens for the decoding phase.)
X Niu, X Lv, W Chen, C Yu, H Jin, Jan 2025, Computing Tasks Saving Schemes through Early Exit in Edge Intelligence-Assisted Systems, IEEE Transactions on Computers, PrePrints pp. 1-12, DOI Bookmark: 10.1109/TC.2025.3533098 https://www.computer.org/csdl/journal/tc/5555/01/10854688/23QQUX5DEQg
Alperen Gormez, 2024, Efficient Neural Network Inference and Training Using Early Exit Strategies, Ph.D. Thesis, Electrical and Computer Engineering, University of Illinois at Chicago, 2024, https://alperengormez.github.io/assets/phd/agormez_phd_thesis.pdf (Early exit in inference, training, and fine-tuning.)
Yicheng Yan, Xianfeng Li, Kai Cui, Haoran Sun, Zifeng Yu, 2025, FreeNet: An efficient frequency-domain early exiting network for dynamic inference, Knowledge-Based Systems, 113155, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2025.113155 https://www.sciencedirect.com/science/article/abs/pii/S0950705125002023
Xian Peng, Xin Wu, Lianming Xu, Li Wang, Aiguo Fei, 6 Feb 2025, DistrEE: Distributed Early Exit of Deep Neural Network Inference on Edge Devices, https://arxiv.org/abs/2502.15735
Daniel Kuhse, Harun Teper, Sebastian Buschjäger, Chien-Yao Wang, Jian-Jia Chen, 21 Mar 2025, You Only Look Once at Anytime (AnytimeYOLO): Analysis and Optimization of Early-Exits for Object-Detection, https://arxiv.org/abs/2503.17497
João Simioni, Eduardo Kugler Viegas, Altair Santin, and Pedro Horchulhack. 2025. An Early Exit Deep Neural Network for Fast Inference Intrusion Detection. In Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing (SAC '25). Association for Computing Machinery, New York, NY, USA, 730–737. https://doi.org/10.1145/3672608.3707974 https://dl.acm.org/doi/abs/10.1145/3672608.3707974
Muzhi Dai, Chenxu Yang, Qingyi Si, 17 May 2025 (v2), S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models, https://arxiv.org/abs/2505.07686
Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda, 27 May 2025, Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits, https://arxiv.org/abs/2505.21594

Early Exit in Generalized Speculative Decoding

Early exiting of a large model can be used as the smaller "draft" model in speculative decoding, rather than using a separate small model. See generalized speculative decoding.

Early Exit and KV Caching

One of the downsides of early exit is that it messes up the well-known optimization of KV caching. The idea of caching the results of the KV computations is one of the earliest recognized optimizations for autoregressive decoding of output sequences. However, computation of the cache requires execution of all layers, but early exiting skips this. In the next inference computation, the cache is out-of-date for the unexecuted layers.

Hence, early exit saves computation, but damages the KV cache, leading to extra computation to fix this. Researchers have examined this issue and found solutions involving either simple cache recomputations, propagating the last executed layer's KV cache to the other skipped layers, and avoiding the issue entirely by modifying early exit methods. More more details about this research see: KV caching with early exit.

Parallelism can be used to further optimize this idea. An important point about these optimizations is that the speedup can be used without accuracy loss simply by computing the missing KV cache in parallel with the earlier-started inference for the next token. This gives efficiency and accuracy, as it is a lossless optimization. Early exit has traditionally simply skipped the KV computation of the layers, using an approximation of KV data. However, there is an overlapping or pipelining idea whereby early exit triggers a token to be emitted, allowing the autoregressive decoding of the next token to start, which is similar to a speculative decoding optimization. The skipped layers can still be executed, even though decoding for the current token is finished, but this is only to create the "missing" KV cache in parallel to the start of the next token generation. The full KV cache data for every layer will thus be available via parallel computation, before it is needed by the next token generation's layers.

Aussie AI

Early Exit

Survey Papers for Early Exit

Research on Early Exit

Early Exit in Generalized Speculative Decoding

Early Exit and KV Caching

More AI Research

Quick Links

Product

New to Writing?

Writing Styles