Aussie AI

Early Exit

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

One inference loop optimization is to exit before all of the layers have been completed, using the results up to that point if there is a high degree of confidence. This method has been called "early exit", or "dynamic layer pruning" (see layer pruning). Early exit means avoiding the calculations for all layers afer the just-finished one, whereas "layer skipping" can skip one to continue on the following layer, and "layer reordering" is a strange generalization where layers can get executed or skipped in any order.

The idea has also been applied in training, many years prior. The terms "dropout" and "early stopping" have also occasionally been used to mean inference early exit, but usually refer to training method optimizations with a similar goal to reduce training times.

Early exiting is possible in a significant number of inference evaluations (e.g. reportedly 40% in DeeBERT [Xin et al. 2020]), but not always. By skipping all layers thereafter, it avoids a significant amount of computation.

Deciding to exit. The simplest idea is to run a fixed number of layers, but this is not especially theoretically interesting. There are different more accurate ways to do early exiting. Xu and McAuley (2022) categorize three different subtypes of early exits, based on the criteria used to decide when to exit:

  • Confidence estimation
  • Internal ensemble
  • Learning to exit

Confidence estimation is the use of a metric predicting that confidence is high enough to exit; internal ensemble uses multiple metrics with a requirement for enough metrics to agree; learning to exit has the model attempting to learn when to exit.

Computing Confidence. There are numerous possible variations on deciding when the layer is "certain" or "confident" enough about its prediction:

  • A single token has a high-enough probability (i.e. a single probability threshold) at a single layer.
  • The highest probability token has a probability higher enough than the second-highest token (comparative probability threshold).
  • Where multiple layer probability predictions are high for a single token (multi-layer threshold).
  • Where multiple layer predictions are the same or similar enough (i.e. stable multi-layer thresholds).
  • Per-layer thresholds. Where a token meets different thresholds for each layer (per-layer threshold), which assumes that early layers are more likely to be wrong, whereas deeper laters are more likely to be accurate and their threshold could be lower.
  • Combinations of any of the above in various ways.

There are also other methods for choosing the exit point based on metrics such as:

  • Entropy
  • Patience
  • Computation budget
  • Desired speedup
  • Latency (response time needed)

Computation overhead of unembedding. A practical point is that computing confidence estimates requires the token probabilities or logits. These are not usually required at all layers, being done once at the end of the layer stack. Hence, analyzing the confidence based on logit probabilities introduces the overhead of extra "unembedding" computations (i.e. multiplication by the inverse matrix or transpose of the embeddings matrix). There is also the cost of analyzing the logit probabilities, such as a "max" or "top-k" type computation on the vector of token probabilities, but this is likely less significant than an unembedding matrix multiplication.

Accuracy benefits. Although most papers view early exit as an approximation that reduces accuracy of a model, some papers have noted advantages of early exiting. For example, executing fewer layers can reduce overfitting and the vanishing gradient problem.

Related Depth Techniques. Effectively, early exit of the inference loop is a form of "dynamic layer pruning" at runtime, and is therefore a type of "dynamic depth pruning" of the model. Early exiting is a special type of layer skipping, which is the more general idea of skipping some layers. Early exit is the special case of skipping all remaining layers at one exit point. Read more about other types of dynamic pruning on the layer or depth dimension of a model:

Early exit is also one of multiple strategies for "dynamic inference". Some papers refer to dynamic inference changes as "adaptive neural networks", where they change execution depending on the inputs. Some types of early exit, such as hierarchical early exit, are similar to research on cascades for DNNs and CNNs.

Survey Papers for Early Exit

Various survey papers on early exit specifically, or review papers on model compression or inference optimization that include coverage of early exit methods.

Early exit surveys. Papers that specifically survey the SOTA for early-exit:

  • Y. Matsubara, M. Levorato, and F. Restuccia, “Split computing and early exiting for deep learning applications: Survey and research challenges,” ACM Comput. Surveys, Mar 2022, https://arxiv.org/abs/2103.04505
  • Stefanos Laskaridis, Alexandros Kouris, Nicholas D. Lane, Adaptive Inference through Early-Exit Networks: Design, Challenges and Directions, EMDL'21: Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, June 2021, Pages 1–6, https://doi.org/10.1145/3469116.3470012, https://dl.acm.org/doi/abs/10.1145/3469116.3470012, https://arxiv.org/pdf/2106.05022.pdf
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)

General surveys. Papers that cover many topics and include a section on early exit:

  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
  • Canwen Xu, Julian McAuley, 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including early exit.)
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)

Research on Early Exit

Early exit was one of the earliest optimization techniques discovered for neural networks. There is no shortage of research papers.

  • Ji Xin, Raphael Tang, Jaejun Lee, Yaoliang Yu, and Jimmy Lin, DeeBERT: Dynamic early exiting for accelerating bert inference, arXiv preprint arXiv:2004.12993, 2020, https://arxiv.org/abs/2004.12993 (Code: https://github.com/castorini/DeeBERT
  • Angela Fan, Edouard Grave, and Armand Joulin, Reducing transformer depth on demand with structured dropout, 2019, arXiv:1909.11556, https://arxiv.org/abs/1909.11556
  • Surat Teerapittayanon, Bradley McDanel, and Hsiang Tsung Kung, BranchyNet: Fast inference via early exiting from deep neural networks, 2017, arXiv:1709.01686, https://arxiv.org/abs/1709.01686
  • S. Teerapittayanon, B. McDanel, H.T. Kung, Distributed deep neural networks over the cloud, the edge and end devices, IEEE, Atlanta, GA, USA, 2017, pp. 328–339, doi:10.1109/ICDCS.2017.226., 5–8 June, https://doi.org/10.1109/ICDCS.2017.226
  • Xiaonan Li, Yunfan Shao, Tianxiang Sun, Hang Yan, Xipeng Qiu, Xuanjing Huang, Accelerating BERT Inference for Sequence Labeling via Early-Exit, May 2021, https://arxiv.org/abs/2105.13878
  • Arian Bakhtiarnia, Qi Zhang, Alexandros Iosifidis, Multi-Exit Vision Transformer for Dynamic Inference, June 2021, https://arxiv.org/abs/2106.15183
  • Nikolaos Passalis, Jenni Raitoharju, Anastasios Tefas, Moncef Gabbouj, Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits, Pattern Recognition Volume 105, September 2020, 107346, https://doi.org/10.1016/j.patcog.2020.107346
  • Xiangjie Li, Chenfei Lou, Yuchi Chen, Zhengping Zhu, Yingtao Shen, Yehan Ma, An Zou, Predictive Exit: Prediction of Fine-Grained Early Exits for Computation- and Energy-Efficient Inference, DOI: https://doi.org/10.1609/aaai.v37i7.26042, https://ojs.aaai.org/index.php/AAAI/article/view/26042
  • A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He, Proceedings of the 29th International Conference on Computational Linguistics, October 2022, https://aclanthology.org/2021.naacl-main.162/, https://aclanthology.org/2021.naacl-main.162.pdf
  • Vanderlei Bonato and Christos Bouganis, 2021, Class-specific early exit design methodology for convolutional neural networks, Applied Soft Computing (2021), https://www.sciencedirect.com/science/article/abs/pii/S1568494621002398, https://doi.org/10.1016/j.asoc.2021.107316, https://spiral.imperial.ac.uk/bitstream/10044/1/90316/2/Paper___Early_Exit___Applied_Soft_Computing.pdf
  • E. Baccarelli, S. Scardapane, M. Scarpiniti, A. Momenzadeh, A. Uncini, Optimized training and scalable implementation of Conditional Deep Neural Networks with early exits for Fog-supported IoT applications, Information Sciences 521 (June 2020), 107–143, DOI: https://doi.org/10.1016/j.ins.2020.02.041, http://www.sciencedirect.com/science/article/pii/
  • S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, When edge meets learning: Adaptive control for resource-constrained distributed machine learning, in: IEEE Conference on Computer Communications (IEEE INFOCOM 2018), 2018, pp. 63–71, doi:10.1109/INFOCOM.2018.8486403, https://doi.org/10.1109/INFOCOM.2018.8486403, Honolulu, HI, USA, 16–19 April, 2018.
  • Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, Furu Wei, BERT Loses Patience: Fast and Robust Inference with Early Exit, https://doi.org/10.48550/arXiv.2006.04152, https://arxiv.org/abs/2006.04152
  • S. Scardapane, M. Scarpiniti, E. Baccarelli, A. Uncini, Why should we add early exits to neural networks?, Cognitive Computation 12 (5) (2020), 954–966, doi:10.1007/s12559-020-09734-4, http://dx.doi.org/10.1007/s12559-020-09734-4, https://arxiv.org/pdf/2004.12814.pdf
  • Kaiyuan Liao, Yi Zhang, Xuancheng Ren, Qi Su, Xu Sun, Bin He, A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 2021, DOI: 10.18653/v1/2021.naacl-main.162 https://aclanthology.org/2021.naacl-main.162/
  • Zizhao Wang, Wei Bao, Dong Yuan, Liming Ge, Nguyen H. Tran, Albert Y. Zomaya, SEE: Scheduling Early Exit for Mobile DNN Inference during Service Outage, MSWIM '19: Proceedings of the 22nd International ACM Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, November 2019, Pages 279–288, https://doi.org/10.1145/3345768.3355917, https://dl.acm.org/doi/abs/10.1145/3345768.3355917
  • Xinrui Tan, Hongjia Li, Liming Wang, Xueqing Huang, Zhen Xu, Empowering Adaptive Early-Exit Inference with Latency Awareness, DOI: https://doi.org/10.1609/aaai.v35i11.17181, PDF: https://ojs.aaai.org/index.php/AAAI/article/view/17181/16988
  • Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, and Donald Metzler. Confident adaptive language modeling. arXiv preprint arXiv:2207.07061, 2022, https://arxiv.org/abs/2104.08803
  • Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens van der Maaten, and Kilian Weinberger. 2018. Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations (ICLR), https://arxiv.org/abs/1703.09844
  • Yigitcan Kaya, Sanghyun Hong, and Tudor Dumitras. 2019. Shallow-deep networks: Understanding and mitigating network overthinking. In International Conference on Machine Learning (ICML), volume 97, pages 3301–3310. PMLR, https://arxiv.org/abs/1810.07052
  • Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, and Noah A. Smith. 2020. The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651, Association for Computational Linguistics, https://arxiv.org/abs/2004.07453
  • Ji Xin, Rodrigo Nogueira, Yaoliang Yu, and Jimmy Lin. 2020. Early exiting BERT for efficient document ranking. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 83–88, Online. Association for Computational Linguistics. PDF: https://cs.uwaterloo.ca/~jimmylin/publications/Xin_etal_SustaiNLP2020.pdf
  • Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. 2021. BERxiT: Early exiting for BERT with better fine-tuning and extension to regression. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 91–104, Association for Computational Linguistics, https://aclanthology.org/2021.eacl-main.8/, Code: https://github.com/castorini/berxit
  • V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh, 2018, SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks, In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, Los Alamitos, CA, 662–673, https://doi.org/10.1109/ISCA.2018.00061, https://ieeexplore.ieee.org/document/8416863
  • D Li, W Wu, L Zeng, K Li - Wentai and Zeng, Lan and Li, Keqin, Es, Es-Fedavg: Early-Exit Personalized Federated Learning with Sparse Weight for Adaptive Computation, January 2023, SSRN Electronic Journal, DOI:10.2139/ssrn.4361705, https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4361705, https://www.researchgate.net/publication/368592513_Es-Fedavg_Early-Exit_Personalized_Federated_Learning_with_Sparse_Weight_for_Adaptive_Computation
  • Ting-Kuei Hu, Tianlong Chen, Haotao Wang, and Zhangyang Wang. Triple wins: Boosting accuracy, robustness and efficiency together by enabling input-adaptive inference. In ICLR, Feb 2020, https://arxiv.org/abs/2002.10025
  • Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, and Joseph E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In ECCV, 2018, https://arxiv.org/abs/1711.09485
  • Learning Early Exit for Deep Neural Network Inference on Mobile Devices through Multi-Armed Bandits Weiyu Ju; Wei Bao; Dong Yuan; Liming Ge; Bing Bing Zhou, 2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), 10-13 May 2021, https://ieeexplore.ieee.org/abstract/document/9499356, https://doi.org/10.1109/CCGrid51090.2021.00011
  • Weiyu Ju, Wei Bao, Liming Ge, Dong Yuan, Dynamic Early Exit Scheduling for Deep Neural Network Inference through Contextual Bandits, CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, October 2021, Pages 823–832, https://doi.org/10.1145/3459637.3482335, https://dl.acm.org/doi/abs/10.1145/3459637.3482335
  • Andong Li; Chengshi Zheng; Lu Zhang; Xiaodong Li, Learning to Inference with Early Exit in the Progressive Speech Enhancement, 2021 29th European Signal Processing Conference (EUSIPCO), 23-27 August 2021, https://ieeexplore.ieee.org/abstract/document/9616248, https://doi.org/10.23919/EUSIPCO54536.2021.9616248, https://arxiv.org/abs/2106.11730
  • FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits, Polina Karpikova, Ekaterina Radionova, Anastasia Yaschenko, Andrei Spiridonov, Leonid Kostyushko, Riccardo Fabbricatore, Aleksei Ivakhnenko; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), April 2023, pp. 12032-12043 https://openaccess.thecvf.com/content/CVPR2023/html/Karpikova_FIANCEE_Faster_Inference_of_Adversarial_Networks_via_Conditional_Early_Exits_CVPR_2023_paper.html, https://arxiv.org/abs/2304.10306
  • Rongkang Dong, Yuyi Mao, and Jun Zhang. Resource-Constrained Edge AI with Early Exit Prediction. Journal of Communications and Information Networks, 7(2):122–134, June 2022, https://arxiv.org/abs/2206.07269
  • Qunliang Xing, Mai Xu, Tianyi Li, and Zhenyu Guan. Early exit or not: Resource-efficient blind quality enhancement for compressed images. In Computer Vision – ECCV 2020, pages 275–292. Springer International Publishing. 2020, https://arxiv.org/abs/2006.16581
  • M. Wołczyk et al., “Zero time waste: Recycling predictions in early exit neural networks,” in Proc. 35th Conf. Neural Inf. Process. Syst. (NeurIPS), Virtual Conference, Dec. 2021, https://arxiv.org/abs/2106.05409
  • M. Phuong and C. H. Lampert, “Distillation-based training for multi-exit architectures,” in Proc. IEEE/CVF Int. Conf. Comput. Vision (ICCV), Seoul, Korea (South), Oct.-Nov. 2019, https://ieeexplore.ieee.org/document/9009834
  • S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane, “SPINN: Synergistic progressive inference of neural networks over device and cloud,” in Proc. Annu. Inf. Conf. Mobile Comput. Netw. (MobiCom), London, UK, Sep. 2020, https://arxiv.org/abs/2008.06402
  • M. Wang, J. Mo, J. Lin, Z. Wang, and L. Du, “DynExit: A dynamic early-exit strategy for deep residual networks,” in Proc. IEEE Int. Wkshop. Signal Process. Syst. (SiPS), Nanjing, China, Oct. 2019, https://ieeexplore.ieee.org/abstract/document/9020551
  • Qunliang Xing, Mai Xu, Tianyi Li, and Zhenyu Guan. Early exit or not: Resource-efficient blind quality enhancement for compressed images. In Computer Vision – ECCV 2020, pages 275–292. Springer International Publishing. 2020, https://arxiv.org/abs/2006.16581
  • Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, https://arxiv.org/abs/1409.4842
  • Maciej Woł czyk, Bartosz Wojcik, Klaudia Bał azy, Igor T Podolak, Jacek Tabor, Marek Smieja, and Tomasz Trzcinski. Zero time waste: Recycling predictions in early exit neural networks. In Advances in Neural Information Processing Systems, volume 34, pages 2516–2528. Curran Associates, Inc. 2021, https://arxiv.org/abs/2106.05409
  • Enrique S. Marquez, Jonathon S. Hare, and Mahesan Niranjan. Deep cascade learning. IEEE Transactions on Neural Networks and Learning Systems, 29(11):5475–5485. 2018, https://ieeexplore.ieee.org/document/8307262
  • Simone Scardapane, Danilo Comminiello, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. Differentiable branching in deep networks for fast inference. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4167–4171. 2020, https://ieeexplore.ieee.org/document/9054209
  • Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, Bert Vankeirsbilck, Pieter Simoens, and Bart Dhoedt, Feb 2017, The cascading neural network: building the internet of smart things. Knowledge and Information Systems, 52(3):791–814, https://link.springer.com/article/10.1007/s10115-017-1029-1
  • Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, and Joseph E Gonzalez. Idk cascades: Fast deep learning by learning not to overthink. arXiv preprint arXiv:1706.00885. 2017, https://arxiv.org/abs/1706.00885
  • Simone Scardapane, Michele Scarpiniti, Enzo Baccarelli, and Aurelio Uncini. Why should we add early exits to neural networks? Cognitive Computation, 12(5):954–966. 2020, https://arxiv.org/abs/2004.12814
  • Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neural networks for efficient inference. In International Conference on Machine Learning, pages 527–536. PMLR. 2017, https://arxiv.org/abs/1702.07811
  • Xin Dai, Xiangnan Kong, and Tian Guo. Epnet: Learning to exit with flexible multi-branch network. In Proceedings of the 29th ACM International Conference on Information and Knowledge Management, CIKM ’20, page 235–244, New York, NY, USA. Association for Computing Machinery. 2020, https://dl.acm.org/doi/10.1145/3340531.3411973
  • Xinshi Chen, Hanjun Dai, Yu Li, Xin Gao, and Le Song. Learning to stop while learning to predict. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 1520–1530. PMLR. 2020, https://arxiv.org/abs/2006.05082
  • P. Panda, A. Sengupta, K. Roy, Conditional deep learning for energy-efficient and enhanced pattern recognition, in: 2016 Design, Automation Test in Europe Conference Exhibition (DATE), 2016, pp. 475–480, https://arxiv.org/abs/1509.08971
  • Francesco Busolin, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Salvatore Trani, May 2021, Learning Early Exit Strategies for Additive Ranking Ensembles, https://arxiv.org/abs/2105.02568
  • B. Barla Cambazoglu, Hugo Zaragoza, Olivier Chapelle, Jiang Chen, Ciya Liao, Zhaohui Zheng, and Jon Degenhardt. 2010. Early exit optimizations for additive machine learned ranking systems. In Proceedings of the Third ACM International Conference on Web Search and Data Mining (WSDM 2010), pages 411–420, New York, New York, https://dl.acm.org/doi/10.1145/1718487.1718538
  • Eunhyeok Park, Dongyoung Kim, Soobeom Kim, Yong-Deok Kim, Gunhee Kim, Sungroh Yoon, Sungjoo Yoo, Big/little deep neural network for ultra low power inference. 2015. In CODES '15: Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis, October 2015, Pages 124–132, https://dl.acm.org/doi/10.5555/2830840.2830854
  • Geng, S.; Gao, P.; Fu, Z.; and Zhang, Y., 2021, RomeBERT: Robust Training of Multi-Exit BERT, arXiv preprint arXiv:2101.09755, https://arxiv.org/abs/2101.09755
  • Zhou, W.; Xu, C.; and McAuley, J. J. 2022. BERT Learns to Teach: Knowledge Distillation with Meta Learning. In ACL. https://arxiv.org/abs/2106.04570
  • Tianxiang Sun, Yunhua Zhou, Xiangyang Liu, Xinyu Zhang, Hao Jiang, Zhao Cao, Xuanjing Huang, Xipeng Qiu, 2021. Early Exiting with Ensemble Internal Classifiers. arXiv preprint arXiv:2105.137, https://arxiv.org/abs/2105.13792
  • Zhu, W. 2021. LeeBERT: Learned Early Exit for BERT with cross-level optimization. In ACL-IJCNLP, PDF: https://aclanthology.org/2021.acl-long.231.pdf
  • Zhang, Z.; Zhu, W.; Zhang, J.; et al. 2022. PCEE-BERT: Accelerating BERT Inference via Patient and Confident Early Exiting. In NAACL-HLT (Findings), https://aclanthology.org/2022.findings-naacl.25/
  • Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. 2020. Depth-adaptive transformer. ArXiv, abs/1910.10073, https://arxiv.org/abs/1910.10073
  • Tal Schuster, Adam Fisch, Tommi Jaakkola, Regina Barzilay, 2021. Consistent accelerated inference via confident adaptive transformers. arXiv preprint arXiv:2104.08803. https://arxiv.org/abs/2104.08803
  • Guan, Y.; Li, Z.; Leng, J.; et al. 2022. Transkimmer: Transformer Learns to Layer-wise Skim. In AC, https://arxiv.org/abs/2205.07324
  • H. Tann, S. Hashemi, R. I. Bahar, and S. Reda. Runtime configurable deep neural networks for energy-accuracy trade-off. In CODES + ISSS, pages 34:1–34:10, 2016. https://ieeexplore.ieee.org/document/9166549
  • G Li, X Ma, Q Yu, L Liu, H Liu, X Wang, 2023, CoAxNN: Optimizing on-device deep learning with conditional approximate neural networks, Journal of Systems Architecture, https://www.sciencedirect.com/science/article/abs/pii/S1383762123001571
  • X Gao, Y Liu, T Huang, Z Hou, 2023, PF-BERxiT: Early Exiting for BERT with Parameter-efficient Fine-tuning and Flexible early exiting strategy, Neurocomputing, https://www.sciencedirect.com/science/article/abs/pii/S0925231223008135
  • Z Zeng, Y Hong, H Dai, H Zhuang, C Chen, August 2023, ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference, PDF: https://www.researchgate.net/publication/373392419_ConsistentEE_A_Consistent_and_Hardness-Guided_Early_Exiting_Method_for_Accelerating_Language_Models_Inference
  • Duggal, R., Freitas, S., Dhamnani, S., Chau, D.H., Sun, J.: ELF: an early-exiting framework for long-tailed classification. Arxiv Preprint Arxiv:2006.11979 (2020) https://arxiv.org/abs/2006.11979
  • H Yu, D Liu, Z Zhang, J Wang, 2023, IEEE Transactions on Instrumentation and Measurement (Early Access), A Dynamic Transformer Network with Early Exit Mechanism for Fast Detection of Multiscale Surface Defects, https://ieeexplore.ieee.org/document/10242087
  • A Zniber, O Karrakchou, M Ghogho, 2023, Dynamic Early Exiting Predictive Coding Neural Networks, arXiv preprint arXiv:2309.02022, https://arxiv.org/pdf/2309.02022.pdf
  • Y. Long, I. Chakraborty, and K. Roy, 2020, “Conditionally deep hybrid neural networks across edge and cloud,” arXiv:2005.10851, https://arxiv.org/abs/2005.10851
  • Berestizshevsky, K., Even, G.: Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26
  • Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844, https://arxiv.org/abs/1703.09844 (Has multiple models combined in an early-exit configuration.)
  • A Moos, 2023, Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs, arXiv preprint arXiv:2309.03530, https://arxiv.org/pdf/2309.03530.pdf (Fast inference for a soccer-playing robot with cascade-like hierarchical early exits.)
  • Francesco Daghero, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini, Enrico Macii, Massimo Poncino, "Energy-Efficient Adaptive Machine Learning on IoT End-Nodes With Class-Dependent Confidence", 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp.1-4, 2020. https://ieeexplore.ieee.org/document/9294863, https://arxiv.org/abs/2204.03431v1 (An improved stopping policy for early exits on easy-input classification tasks.)
  • Kyungchul Park, Chanyoung Oh, Youngmin Yi, "BPNet: Branch-pruned Conditional Neural Network for Systematic Time-accuracy Tradeoff", 2020 57th ACM/IEEE Design Automation Conference (DAC), pp.1-6, 2020. https://ieeexplore.ieee.org/document/9218545
  • T Shen, C Lee, V Narayanan, Oct 2023, Multi-Exit Vision Transformer with Custom Fine-Tuning for Fine-Grained Image Recognition, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222298 (Early exit from multiple places, combined with self-distillation.)
  • Sehoon Kim, Karttikeya Mangalam, Suhong Moon, John Canny, Jitendra Malik, Michael W. Mahoney, Amir Gholami, Kurt Keutzer, Sep 2023, Speculative Decoding with Big Little Decoder, https://arxiv.org/abs/2302.07863 (Early exiting in the context of speculative decoder optimizations.)
  • Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A. The right tool for the job: Matching model and instance complexities. In Annual Meeting of the Association for Computational Linguistics, 2020. https://arxiv.org/abs/2004.07453 (Early exit with "wisdom of committees" decisions.)
  • X Li, Y Shen, A Zou, Y Ma, 2023, EENet: Energy Efficient Neural Networks with Run-time Power Management, 2023 60th ACM/IEEE Design Automation Conference (DAC), https://ieeexplore.ieee.org/abstract/document/10247701 (Learns early exit characteristics and decision methods over time.)
  • K Liu, S Moon, 2023, Self-supervised efficient sample weighting for multi-exit networks, Knowledge-Based Systems, https://www.sciencedirect.com/science/article/abs/pii/S0950705123007530 (Early exiting during both training and inference to reduce the disparity.)
  • Divya J. Bajpai, Vivek K. Trivedi, Sohan L. Yadav, Manjesh K. Hanawal, 2023, SplitEE: Early Exit in Deep Neural Networks with Split Computing, arXiv preprint arXiv:2309.09195, https://arxiv.org/abs/2309.09195
  • George August Wright, Umberto Cappellazzo, Salah Zaiem, Desh Raj, Lucas Ondel Yang, Daniele Falavigna, Alessio Brutti, Sep 2023, Training dynamic models using early exits for automatic speech recognition on resource-constrained devices, https://arxiv.org/abs/2309.09546
  • X Xu, K Yan, S Han, B Wang, X Tao, P Zhang, 2023, Learning-Based Edge-Device Collaborative DNN Inference in IoVT Networks IEEE Internet of Things Journal, https://ieeexplore.ieee.org/abstract/document/10258387
  • J Wang, B Li, GL Zhang, 2023, Early Classification for Dynamic Inference of Neural Networks, arXiv preprint arXiv:2309.13443, https://arxiv.org/pdf/2309.13443.pdf
  • S Tang, Y Wang, C Ding, Y Liang, Y Li, D Xu, 2023, arXiv preprint arXiv:2309.17074, DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation, https://arxiv.org/pdf/2309.17074.pdf (Uses uncertainty-based confidence to decide on early-exit in diffusion models.)
  • S Bae, J Ko, H Song, SY Yun, Oct 2023, Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding, arXiv preprint arXiv:2310.05424, https://arxiv.org/pdf/2310.05424.pdf (Combination of early-exit with a "shallow-deep module" and parallel decoding.)
  • F Regol, J Chataoui, M Coates, Oct 2023, Jointly-Learned Exit and Inference for a Dynamic Neural Network: JEI-DNN, arXiv preprint arXiv:2310.09163, http://export.arxiv.org/abs/2310.09163
  • Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a small subset of the input to speed up inference with early-exit based on confidence level.)
  • Hajin Shim, Sung Ju Hwang, and Eunho Yang. Joint active feature acquisition and classification with variable-size set encoding. NeurIPS, pages 1368–1378, 2018. https://papers.nips.cc/paper/2018/file/e5841df2166dd424a57127423d276bbe-Paper.pdf
  • Jinmin He, Kai Li, Yifan Zang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng, 25 Jan 2024, Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing, https://arxiv.org/abs/2312.14472 (Dynamic layer depth routing based on easy vs hard queries to optimize training.)
  • Ji-Ye Jeon; Xuan Truong Nguyen; Hyuk-Jae Lee, Jan 2024, Mitigation of Over-Confidence in Scale-Adjusted Training for Early-Exit Networks, 2024 International Conference on Electronics, Information, and Communication (ICEIC), 28-31 January, 2024, https://ieeexplore.ieee.org/abstract/document/10457149 (Making early-exit more accurate by avoiding overconfidence of wrong predictions.)
  • Divya Jyoti Bajpai, Aastha Jaiswal, Manjesh Kumar Hanawal, 19 Jan 2024, I-SplitEE: Image classification in Split Computing DNNs with Early Exits, https://arxiv.org/abs/2401.10541
  • Li, L., Wang, C., Qiu, M. et al., 2024, Accelerating BERT inference with GPU-efficient exit prediction. Front. Comput. Sci. 18, 183308 (2024). https://doi.org/10.1007/s11704-022-2341-9, https://link.springer.com/article/10.1007/s11704-022-2341-9
  • Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, Li Li, 18 Jan 2024, When Neural Code Completion Models Size up the Situation: Attaining Cheaper and Faster Completion through Dynamic Model Inference, https://arxiv.org/abs/2401.09964 (Analysing the importance of different layers in code completion use case.)
  • Xuchen Pan, Yanxi Chen, Yaliang Li, Bolin Ding, Jingren Zhou, 1 Feb 2024, EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models, https://arxiv.org/abs/2402.00518 Code: https://github.com/pan-x-c/EE-LLM
  • Jingcun Wang1, Bing Li, Grace Li Zhang, 2024, Early-Exit with Class Exclusion for Efficient Inference of Neural Networks, https://www.hwai.tu-darmstadt.de/fileadmin/user_upload/main.pdf (Early exit for classification use case.)
  • Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
  • Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 4 Mar 2024, DyCE: Dynamic Configurable Exiting for Deep Learning Compression and Scaling, https://arxiv.org/abs/2403.01695 (General framework for dynamic early exit in complicated architectures.)
  • Siqi Fan, Xin Jiang, Xiang Li, Xuying Meng, Peng Han, Shuo Shang, Aixin Sun, Yequan Wang, Zhongyuan Wang, 4 Mar 2024, Not all Layers of LLMs are Necessary during Inference, https://arxiv.org/abs/2403.02181
  • Filip Szatkowski, Fei Yang, Bartłomiej Twardowski, Tomasz Trzciński, Joost van de Weijer, 12 Mar 2024, Accelerated Inference and Reduced Forgetting: The Dual Benefits of Early-Exit Networks in Continual Learning, https://arxiv.org/abs/2403.07404 (Early exit optimizations applied to continual learning.)
  • Max Sponner, Lorenzo Servadei, Bernd Waschneck, Robert Wille, Akash Kumar, 12 Mar 2024, Efficient Post-Training Augmentation for Adaptive Inference in Heterogeneous and Distributed IoT Environments, https://arxiv.org/abs/2403.07957 (Early exit for image classification and IoT devices.)
  • Max Sponner, Lorenzo Servadei, Bernd Waschneck, Robert Wille, Akash Kumar, 12 Mar 2024, Temporal Decisions: Leveraging Temporal Correlation for Efficient Decisions in Early Exit Neural Networks https://arxiv.org/abs/2403.07958 (Using time-based data to help decisions for early-exit, in video sequences.)
  • Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
  • Lars Wolfgang Folkerts, Nekatrios Georgios Tsoutsos, 2024, FHE-MENNs: Accelerating Fully Homomorphic Private Inference with Multi-Exit Neural Networks, PDF: https://trustworthycomputing.github.io/FHE-MENN/FHEMENNs.pdf
  • Qingyuan Wang, Barry Cardiff, Antoine Frappé, Benoit Larras, Deepu John, 26 Mar 2024, Tiny Models are the Computational Saver for Large Models, https://arxiv.org/abs/2403.17726v1 (Choose tiny or small models after an initial layer of the larger model, combining early exit with easy-hard queries for multi-model inference.)
  • Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang, 28 Jan 2024, Efficient Tuning and Inference for Large Language Models on Textual Graphs, https://arxiv.org/abs/2401.15569 (Optimizing Graph Neural Networks on textual graphs using caching and early exit inference.)
  • Matteo Gambella, Jary Pomponi, Simone Scardapane, Manuel Roveri, 24 Jan 2024, NACHOS: Neural Architecture Search for Hardware Constrained Early Exit Neural Networks, https://arxiv.org/abs/2401.13330
  • Mohamed Nabih Ali, Daniele Falavigna, Alessio Brutti, 2024, Fed-EE: Federating Heterogeneous ASR Models using Early-Exit Architectures, PDF: https://cris.fbk.eu/bitstream/11582/343747/1/paper_49.pdf (Early exit in training.)
  • Metod Jazbec, Patrick Forré, Stephan Mandt, Dan Zhang, Eric Nalisnick, 10 Nov 2023, Anytime-Valid Confidence Sequences for Consistent Uncertainty Estimation in Early-Exit Neural Networks, https://arxiv.org/abs/2311.05931
  • Peng Tang, Pengkai Zhu, Tian Li, Srikar Appalaraju, Vijay Mahadevan, R. Manmatha, Nov 2023, DEED: Dynamic Early Exit on Decoder for Accelerating Encoder-Decoder Transformer Models, https://arxiv.org/abs/2311.08623
  • Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen, Claire Cui, 26 Nov 2023, Learning to Skip for Language Modeling, https://arxiv.org/abs/2311.15436 (Generalizes token-based early exiting to skip entire layers.)
  • Mikolaj Jankowski, Deniz Gunduz, Krystian Mikolajczyk, Nov 2023, Adaptive Early Exiting for Collaborative Inference over Noisy Wireless Channels, https://arxiv.org/abs/2311.18098 (Early exiting combined with collaborative inference.)
  • Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou, Dec 2023, EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism, https://arxiv.org/abs/2312.04916 Code: https://github.com/pan-x-c/EE-LLM
  • H Kim, S Yoon, S Pack, 2023, Poster: Adaptive In-Network Inference using Early-Exits, Proceedings of the 6th on European P4 Workshop, https://dl.acm.org/doi/abs/10.1145/3630047#page=62
  • Haoyu Wang, Yaqing Wang, Tianci Liu, Tuo Zhao, and Jing Gao, 2023, HadSkip: Homotopic and Adaptive Layer Skipping of Pre-trained Language Models for Efficient Inference https://aclanthology.org/2023.findings-emnlp.283.pdf (Layer skipping during fine-tuning.)
  • JY Jeon, XT Nguyen, S Ryu, HJ Lee, 2024, USDN: A Unified Sample-wise Dynamic Network with Mixed-Precision and Early-Exit, https://openaccess.thecvf.com/content/WACV2024/papers/Jeon_USDN_A_Unified_Sample-Wise_Dynamic_Network_With_Mixed-Precision_and_Early-Exit_WACV_2024_paper.pdf
  • F Ilhan, KH Chow, S Hu, T Huang, S Tekin, W Wei, 2024, Adaptive Deep Neural Network Inference Optimization with EENet, https://openaccess.thecvf.com/content/WACV2024/papers/Ilhan_Adaptive_Deep_Neural_Network_Inference_Optimization_With_EENet_WACV_2024_paper.pdf
  • Mohammed Ayyat; Tamer Nadeem; Bartosz Krawczyk, Dec 2023, ClassyNet: Class-Aware Early Exit Neural Networks for Edge Devices, IEEE Internet of Things Journal (Early Access), https://ieeexplore.ieee.org/abstract/document/10365527
  • Sangmin Bae, Jongwoo Ko, Hwanjun Song, and Se-Young Yun. 2023. Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding. arXiv preprint arXiv:2310.05424. https://arxiv.org/abs/2310.05424 (Using early exits as the draft model in generalized speculative decoding.)
  • MM Rastikerdar, J Huang, S Fang, H Guan, D Ganesan, Oct 2023, Efficient IoT Inference via Context-Awareness, https://arxiv.org/pdf/2310.19112.pdf (Does dynamic context-aware "classifier switching" which is similar to cascades and/or early exiting.)
  • N Varshney, A Chatterjee, M Parmar, C Baral, Oct 2023, arXiv preprint arXiv:2310.18581, Accelerating LLM Inference by Enabling Intermediate Layer Decoding, https://arxiv.org/pdf/2310.18581.pdf (Dynamic confidence-based early exiting analysis on LLama models.)
  • Tan Rong Loo, T. Hui Teo, Mulat Ayinet Tiruye, I-Chyn Wey, 2022, High-Performance Asynchronous CNN Accelerator with Early Termination, 2022 IEEE 15th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), pp.140-144, 2022. https://ieeexplore.ieee.org/document/10008416
  • Zhiwei Liang, Yuezhi Zhou, 2022, Dispense Mode for Inference to Accelerate Branchynet, 2022 IEEE International Conference on Image Processing (ICIP), pp.1246-1250, 2022. https://ieeexplore.ieee.org/document/9897574
  • Francesco Daghero, Alessio Burrello, Chen Xie, Luca Benini, Andrea Calimera, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari, Low-Overhead Early-Stopping Policies for Efficient Random Forests Inference on Microcontrollers, VLSI-SoC: Technology Advancement on SoC Design, vol.661, pp.25, 2022. https://doi.org/10.1007/978-3-031-16818-5_2
  • T Hu, C Meinel, H Yang, 2023, Flexible BERT with Width-and Depth-dynamic Inference, 2023 International Joint Conference on Neural Networks (IJCNN), https://ieeexplore.ieee.org/abstract/document/10191515/ (A 2023 version of BERT that does dual pruning with early exit and width gating.)
  • C Luo, J Chen, X Feng, J Zhang, J Li, 2023, Sustainable Collaborative Inference in Intelligent Transportation Systems IEEE Transactions on Intelligent Transportation, https://ieeexplore.ieee.org/document/10239242
  • E Mohammed, O Mashaal, H Abou-Zeid, 2023, Using Early Exits for Fast Inference in Automatic Modulation Classification, arXiv preprint arXiv:2308.11100, 2023, https://arxiv.org/pdf/2308.11100.pdf
  • Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. 2017, Adaptive neural networks for efficient inference. In International Conference on Machine Learning, pages 527–536. PMLR, 2017. https://arxiv.org/abs/1702.07811
  • Huang, G.; Chen, D.; Li, T.; Wu, F.; van der Maaten, L.; and Weinberger, K. Q., 2017. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, https://arxiv.org/abs/1703.09844 (Doing dynamic inference, early-exit & changing the features.)
  • Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, 6 Jun 2024, Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism, https://arxiv.org/abs/2406.03853
  • Metod Jazbec, Alexander Timans, Tin Hadži Veljković, Kaspar Sakmann, Dan Zhang, Christian A. Naesseth, Eric Nalisnick, 31 May 2024, Fast yet Safe: Early-Exiting with Risk Control, https://arxiv.org/abs/2405.20915 (Adjusting the early exit threshold according to a risk tolerance.)
  • Wei Zhong, Manasa Bharadwaj, 30 May 2024, S3D: A Simple and Cost-Effective Self-Speculative Decoding Scheme for Low-Memory GPUs, https://arxiv.org/abs/2405.20314 (Self-speculative decoding using early layers, multi-token non-autoregressive token predictions for the draft model, and layer skipping.)
  • Lianming Huang, Shangyu Wu, Yufei Cui, Ying Xiong, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue, 24 May 2024, RAEE: A Training-Free Retrieval-Augmented Early Exiting Framework for Efficient Inference, https://arxiv.org/abs/2405.15198 (Early exit classifiers built with pre-computation using a retrieval database.)
  • Myeonggu Kang, Junyoung Park, Hyein Shin, Jaekang Shin, Lee-Sup Kim, 2024, ToEx: Accelerating Generation Stage of Transformer-based Language Models via Token-adaptive Early Exit, IEEE Transactions on Computers, PrePrints pp. 1-14, DOI Bookmark: 10.1109/TC.2024.3404051, https://www.computer.org/csdl/journal/tc/5555/01/10535998/1X7QeuzvX4A
  • Pietro Farina, Subrata Biswas, Eren Yıldız, Khakim Akhunov, Saad Ahmed, Bashima Islam, Kasım Sinan Yıldırım, 16 May 2024, Memory-efficient Energy-adaptive Inference of Pre-Trained Models on Batteryless Embedded Systems, https://arxiv.org/abs/2405.10426
  • William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, Jonathan Ragan Kelly, 21 May 2024, Reducing Transformer Key-Value Cache Size with Cross-Layer Attention, https://arxiv.org/abs/2405.12981 (Sharing KV cache values across layers in MQA, every 2nd or 3rd layer, to reduce overall KV cache size by 2 or 3 times.)
  • Haoyi Wu, Kewei Tu, 17 May 2024, Layer-Condensed KV Cache for Efficient Inference of Large Language Models, https://arxiv.org/abs/2405.10637 Code: https://github.com/whyNLP/LCKV (Use the KV cache for only the final layer as the KV cache for all other layers, or alternatively, use only the cache from a few layers, also possibly using a few standard layers as "warmup layers". This idea is conceptuatlly similar to "propagation" of the KV cache in early exit methods or to layer fusion of weights.)
  • Yao Fu, 14 May 2024, Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis, https://arxiv.org/abs/2405.08944 (The KV cache size is the main bottleneck for long context processing, in both prefill and decoding phases, and includes analysis of different optimizations to address this.)
  • Zhihang Lin, Mingbao Lin, Luxi Lin, Rongrong Ji, 9 May 2024, Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference, https://arxiv.org/abs/2405.05803 Code: https://github.com/lzhxmu/VTW (Removing all visual tokens in later layers of a multimodal model, which is effectively early exiting or token pruning, but affecting only the vision part of the multimodal Transformer.)
  • Jiaming Huang, Yi Gao, Wei Dong, 13 May 2024, Unlocking the Non-deterministic Computing Power with Memory-Elastic Multi-Exit Neural Networks, WWW '24: Proceedings of the ACM on Web Conference 2024 May 2024, Pages 2777–2785, https://doi.org/10.1145/3589334.3645340 https://dl.acm.org/doi/abs/10.1145/3589334.3645340
  • Caelin Kaplan, Tareq Si Salem, Angelo Rodio, Chuan Xu, Giovanni Neglia, 7 May 2024, Federated Learning for Cooperative Inference Systems: The Case of Early Exit Networks, https://arxiv.org/abs/2405.04249
  • Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei, 9 May 2024 (v2), You Only Cache Once: Decoder-Decoder Architectures for Language Models, https://arxiv.org/abs/2405.05254 Code: https://aka.ms/YOCO (A novel decoder-decoder architecture with fast KV caching and cross-attention.)
  • Dieter Verbruggen, Sofie Pollin, Hazem Sallouha, 6 May 2024, Computational Efficient Width-Wise Early Exits in Modulation Classification, https://arxiv.org/abs/2405.03222
  • Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang, 29 Apr 2024, Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting, https://arxiv.org/abs/2404.18911 Code: https://github.com/Equationliu/Kangaroo (Speculative decoding where the draft model is an early exit of layers in the verifier model, but the draft model is also sped up further by early exiting confidence analysis.)
  • Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu, 25 Apr 2024, Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding, https://arxiv.org/abs/2404.16710 (Multiple contributions including training with early exit, and speculative decoding with a draft model that is early exit within the larger model, with the advantages: (a) the draft and verifier model thereby share KV cache data for the early layers and (b) avoidance of the problems with an outdated KV cache normally caused by early exiting.)
  • 3 Jan 2024 (v2), SPEED: Speculative Pipelined Execution for Efficient Decoding, Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao, https://arxiv.org/abs/2310.12072 (Speculatively executing multiple future tokens in parallel to the current token, by using multiple tokens with high probability from the early layers of inference of the current token in the model. This allows multiple speculations of the autoregressive inference of the next token to start before the current token is finished.)
  • Georgy Tyukin, 2 Apr 2024, Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations, Masters Thesis, Data Science and Machine Learning, University College London., https://arxiv.org/abs/2404.05741 (Reviews various model compression and inference optimization techniques, and specifically analyzes layer skipping and sublayer skipping, such as attention head pruning and FFN/MLP pruning.)
  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865 (Presents an input-adaptive layer skipping scheme for drops about 30% of FFN calculations. Avoids the KV caching problems by only skipping FFN computations in layers.)
  • Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 23 May 2024, CEEBERT: Cross-Domain Inference in Early Exit BERT, https://arxiv.org/abs/2405.15039
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Utkarsh Saxena, Kaushik Roy, McQueen: Mixed Precision Quantization of Early Exit Networks, https://papers.bmvc2023.org/0511.pdf (Combination of mixed-precision quantization, with precision specifiable staticly to a layerwise granularity, with early exit dynamic depth optimizations.)
  • Canwen Xu, Julian McAuley, Nov 2022, A Survey on Model Compression and Acceleration for Pretrained Language Models, https://arxiv.org/abs/2202.07105
  • Cristóbal Eyzaguirre, Felipe del Río, Vladimir Araujo, Álvaro Soto, DACT-BERT: Differentiable Adaptive Computation Time for an Efficient BERT Inference, Sep 2021, ArXiv preprint, abs/2109.11745, https://arxiv.org/abs/2109.11745
  • Y. Matsubara, M. Levorato, and F. Restuccia, Mar 2022, “Split computing and early exiting for deep learning applications: Survey and research challenges,” ACM Comput. Surveys, https://arxiv.org/abs/2103.04505 -
  • Mor Geva, Avi Caciularu, Guy Dar, Paul Roit, Shoval Sadde, Micah Shlain, Bar Tamir, and Yoav Goldberg. 2022a. LM-debugger: An interactive tool inspection and intervention in transformer-based language models. In Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 12–21, Abu Dhabi, UAE. Association for Computational Linguistics https://aclanthology.org/2022.emnlp-demos.2
  • Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. https://aclanthology.org/2022.emnlp-main.3
  • Ting Hu, Christoph Meinel, Haojin Yang, 2024, A flexible BERT model enabling width- and depth-dynamic inference, Computer Speech & Language 4 April 2024, 101646, https://www.sciencedirect.com/science/article/pii/S0885230824000299 (Dual pruning method with layerwise "neural grafting" that gives dynamic width models, and combined with early exit on the depth dimension.)
  • David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-token layer skipping for a type of adaptive inference with conditional computation.)
  • Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 15 Mar 2024 (v5), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer (A large survey of a variety of LLM optimizations.)
  • Neeraj Varshney, Agneet Chatterjee, Mihir Parmar, Chitta Baral, Nov 2023, Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE, https://arxiv.org/abs/2310.18581 (Improved training for early-exit networks by handling intermediate layer weight updates.)
  • Tzu-Quan Lin, Hung-yi Lee, Hao Tang, 8 Jun 2024, DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models, https://arxiv.org/abs/2406.05464
  • L. Li, K. Ota, and M. Dong, 2018, “Deep learning for smart industry: Efficient manufacture inspection system with fog computing,” IEEE Trans. Ind. Informat., vol. 14, no. 10, pp. 4665–4673, Oct. 2018. https://ieeexplore.ieee.org/document/8370640
  • David Spuler, March 2024, Chapter 47. Early Exit and Layer Pruning, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • X Hou, J Liu, X Tang, C Li, KT Cheng, L Li, M Guo, 2023 MMExit: Enabling Fast and Efficient Multi-modal DNN Inference with Adaptive Network Exits, https://link.springer.com/chapter/10.1007/978-3-031-39698-4_29
  • J Zhang, M Tan, P Dai, W Zhu, 2023, LECO: Improving Early Exiting via Learned Exits and Comparison-based Exiting Mechanism https://aclanthology.org/2023.acl-srw.43/ https://aclanthology.org/2023.acl-srw.43.pdf
  • X Gao, W Zhu, J Gao, C Yin, 2023, F-PABEE: Flexible-patience-based Early Exiting for Single-label and Multi-label text Classification Tasks https://ieeexplore.ieee.org/abstract/document/10095864/
  • T Tambe, 2023, Architecting High Performance Silicon Systems for Accurate and Efficient On-Chip Deep Learning, https://dash.harvard.edu/bitstream/handle/1/37375806/Final_Draft_PhD_Dissertation_Thierry_Tambe.pdf?sequence=1&isAllowed=y
  • J Xin, 2023, Efficient Inference of Transformers in Natural Language Processing: Early Exiting and Beyond, https://uwspace.uwaterloo.ca/handle/10012/19111 https://uwspace.uwaterloo.ca/bitstream/handle/10012/19111/Xin_Ji.pdf?sequence=1
  • Jinting Chen, Zhaocheng Zhu, Cheng Li, Yuming Zhao, Oct 2019, Self-Adaptive Network Pruning, https://arxiv.org/abs/1910.08906
  • Emanuele Lattanzi; Chiara Contoli; Valerio Freschi, 2023, A Study on the Energy Sustainability of Early Exit Networks for Human Activity Recognition IEEE Transactions on Sustainable Computing (Early Access), pp.1-14, 8 August 2023. https://ieeexplore.ieee.org/abstract/document/10213213
  • M Omer Mohammed Elamin Elshaigi, 2023 Adaptive Deep Neural Networks for Human Pose Estimation on Autonomous Nano-Drones, Masters Thesis, PDF: https://webthesis.biblio.polito.it/secure/27689/1/tesi.pdf
  • Aviv Slobodkin, Leshem Choshen, and Omri Abend. 2021. Mediators in determining what processing BERT performs first. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 86–93, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.8
  • Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1452
  • Elena Voita, Rico Sennrich, and Ivan Titov. 2019. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4396–4406, Hong Kong, China. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1448
  • Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.emnlp-main.446
  • J Alammar. 2021. Ecco: An open source library for the explainability of transformer language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, pages 249–257, Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.acl-demo.30
  • Z Fei, X Yan, S Wang, Q Tian, 2022, Deecap: Dynamic early exiting for efficient image captioning, Both accuracy and efficiency are crucial for image captioning in real-world scenarios. http://openaccess.thecvf.com/content/CVPR2022/html/Fei_DeeCap_Dynamic_Early_Exiting_for_Efficient_Image_Captioning_CVPR_2022_paper.html
  • Alexander Yom Din, Taelin Karidi, Leshem Choshen, Mor Geva, Mar 2023, Jump to Conclusions: Short-Cutting Transformers With Linear Transformations, https://arxiv.org/abs/2303.09435
  • Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Universal transformers. Proceedings of ICLR, 2018, March 2019, https://arxiv.org/abs/1807.03819
  • Rippel, O.; Gelbart, M.; and Adams, R. 2014. Learning ordered representations with nested dropout. In International Conference on Machine Learning, 1746–1754. PMLR https://arxiv.org/abs/1402.0915
  • E Mohammed, O Mashaal, H Abou-Zeid, arXiv preprint arXiv:2308.11100, 2023, Using Early Exits for Fast Inference in Automatic Modulation Classification, https://arxiv.org/abs/2308.11100
  • Ori Ram, Liat Bezalel, Adi Zicher, Yonatan Belinkov, Jonathan Berant, and Amir Globerson. 2022. What are you token about? dense retrieval as distributions over the vocabulary. arXiv preprint arXiv:2212.10380 https://arxiv.org/abs/2212.10380
  • Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. 2022. Analyzing transformers in embedding space. arXiv preprint arXiv:2209.02535. https://arxiv.org/abs/2209.02535
  • Wang, J., Chen, K., Chen, G., Shou, L., McAuley, J.: Skipbert: Efficient inference with shallow layer skipping. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 7287–7301 (2022) https://aclanthology.org/2022.acl-long.503/ (Skips early layers of a model via precomputed lookup tables based on detecting known token n-grams in the prompt.)
  • Max Lamparth and Anka Reuel. 2023. Analyzing and editing inner mechanisms of backdoored language models. arXiv preprint arXiv:2302.12461. https://arxiv.org/abs/2302.12461
  • Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014), https://dl.acm.org/doi/abs/10.5555/2627435.2670313
  • 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
  • Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha, 24 May 2024 (v2), A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models, https://arxiv.org/abs/2405.13019
  • Jiachen Jiang, Jinxin Zhou, Zhihui Zhu, 20 Jun 2024, On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier, https://arxiv.org/abs/2406.14479 (Using layer similarity for early exit classifiers, which is also related to layer fusion.)
  • Florian Valade, 17 July 2024, Accelerating Large Language Model Inference with Self-Supervised Early Exits, hal-04644928, https://hal.science/hal-04644928/document
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
  • Bartłomiej Krzepkowski, Monika Michaluk, Franciszek Szarwacki, Piotr Kubaty, Jary Pomponi, Tomasz Trzciński, Bartosz Wójcik, Kamil Adamczewski, 19 Jul 2024, Joint or Disjoint: Mixing Training Regimes for Early-Exit Models, https://arxiv.org/abs/2407.14320
  • Feng (Shelley) Xia, May 2024, Exploring Early Exiting Strategies for Deep Neural Networks, Masters Thesis, Princeton University, https://www.proquest.com/openview/40f3d8735c8cc93d77bc6724c6535190/1?pq-origsite=gscholar&cbl=18750&diss=y
  • Ruijie Miao, Yihan Yan, Xinshuo Yao, Tong Yang, 25 Jul 2024, An Efficient Inference Framework for Early-exit Large Language Models, https://arxiv.org/abs/2407.20272 (Faster early exit using batching and KV cache resolution.)
  • Filipe Laitenberger, Max Belitsky, Denys Sheremet, Oliver Savolainen, Mark Bodracska, 2024, Exploring Monotonicity in Early-Exiting Language Models, https://openreview.net/pdf?id=BM1Aijdheb
  • Mehrnaz Mofakhami, Reza Bayat, Ioannis Mitliagkas, Joao Monteiro, Valentina Zantedeschi, 2024, Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones, ICML 2024 Workshop on Efficient Systems for Foundation Models (ES-FoMo-II), Vienna, Austria, https://openreview.net/pdf?id=7zf1584SUG
  • Marco Colocrese, Erdem Koyuncu, Hulya Seferoglu, 8 Aug 2024, Early-Exit meets Model-Distributed Inference at Edge Networks, https://arxiv.org/abs/2408.05247
  • R. Narmeen, P. Mach, Z. Becvar and I. Ahmad, 16 August 2024, Joint Exit Selection and Offloading Decision for Applications Based on Deep Neural Networks, IEEE Internet of Things Journal, doi: 10.1109/JIOT.2024.3444898, https://doi.org/10.1109/JIOT.2024.3444898 https://ieeexplore.ieee.org/abstract/document/10638073
  • J. Wang, B. Li and G. L. Zhang, 2024, Early-Exit with Class Exclusion for Efficient Inference of Neural Networks, 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 2024, pp. 263-267, doi: 10.1109/AICAS59952.2024.10595861, https://ieeexplore.ieee.org/document/10595861 Code: https://github.com/HWAI-TUDa/EarlyClassExclusion (Reduces the early exit tokens under consideration for the classifier at each layer, which is similar to slimmable networks.)
  • Basar Kutukcu, Sabur Baidya, Sujit Dey, 2024, SLEXNet: Adaptive Inference Using Slimmable Early Exit Neural Networks, https://doi.org/10.1145/3689632 https://dl.acm.org/doi/pdf/10.1145/3689632 (Combined width and depth pruning with slimmable and early exit networks.)
  • Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193
  • Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim, 19 Jul 2024 (v5), SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks, https://arxiv.org/abs/2402.09025 https://github.com/jiwonsong-dev/SLEB
  • David Spuler, June 2024, Aussie AI, Optimizing On-Device Transformer Inference for Source Code Checking: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901675
  • David Spuler, June 2024, Aussie AI, Heuristic Optimization of Transformer On-Device Inference: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901670
  • David Spuler, June 2024, Aussie AI, Speculative Decoding With Early Exit for Optimized Transformer On-Device Inference: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901656
  • Eric Samikwa, 2024, Resource-Aware Distributed Machine Learning for Artificial Intelligence of Things, Ph.D. thesis, Faculty of Science, University of Bern, Switzerland, https://boristheses.unibe.ch/5378/1/24samikwa_e_1_.pdf https://doi.org/10.48549/5378 (Multi-edge device with early exit, "micro-split" scheduling, split/federated learning, and distributed inference.)
  • A Hannan, A Brutti, D Falavigna, 2024, LDASR: An Experimental Study on Layer Drop using Conformer-based Architecture, https://eurasip.org/Proceedings/Eusipco/Eusipco2024/pdfs/0000151.pdf
  • Youva Addad, AlexisLechervy, Frédéric Jurie, 2024, Balancing Accuracy and Efficiency in Budget-Aware Early-Exiting Neural Networks, https://lechervy.users.greyc.fr/publi/C/publi_pdf/icpr24.pdf
  • Jordan Dotzel, Carly Jiang, Mohamed Abdelfattah, Zhiru Zhang, Sep 2024, Opportunities for Post-Training Dynamic Layer Sparsity in Large Vision and Language Models, https://openaccess.thecvf.com/content/CVPR2024W/ELVM/papers/Dotzel_Opportunities_for_Post-Training_Dynamic_Layer_Sparsity_in_Large_Vision_and_CVPRW_2024_paper.pdf (Layerwise dynamic sparsity for vision models.)
  • Wang, Z., Han, J. (2024). Improve Shallow Decoder Based Transformer with Structured Expert Prediction. In: Wand, M., Malinovská, K., Schmidhuber, J., Tetko, I.V. (eds) Artificial Neural Networks and Machine Learning – ICANN 2024. ICANN 2024. Lecture Notes in Computer Science, vol 15022. Springer, Cham. https://doi.org/10.1007/978-3-031-72350-6_15 https://link.springer.com/chapter/10.1007/978-3-031-72350-6_15
  • Zheng Liu, Jinchao Zhu, Nannan Li, Gao Huang, 21 Sep 2024, Multiple-Exit Tuning: Towards Inference-Efficient Adaptation for Vision Transformer, https://arxiv.org/abs/2409.13999 (Early exit ideas applied to PETL-based fine-tuning.)
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty, 25 Sep 2024, Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction, https://arxiv.org/abs/2409.17422 https://github.com/SalesforceAIResearch/GemFilter (Use the early layers of a model to choose the most relevant tokens, similar to early exiting, and then compress the input token sequences based on the importance of these tokens. Notably, this reduces latency and also increases accuracy on long contexts.)
  • Divya Jyoti Bajpai and Manjesh Kumar Hanawal, Oct 2024, CAPEEN:Image Captioning with Early Exits and Knowledge Distillation, https://www.researchgate.net/profile/Divya-Jyoti-Bajpai/publication/384595581_CAPEEN_Image_Captioning_with_Early_Exits_and_Knowledge_Distillation/links/66fe753e9e6e82486ffe97ef/CAPEEN-Image-Captioning-with-Early-Exits-and-Knowledge-Distillation.pdf
  • Divya Jyoti Bajpai, Manjesh Kumar Hanawal, 6 Oct 2024, Distributed Inference on Mobile Edge and Cloud: An Early Exit based Clustering Approach, https://arxiv.org/abs/2410.05338
  • R. G. Pacheco et al., "UCBEE: A Multi Armed Bandit Approach for Early-Exit in Neural Networks," in IEEE Transactions on Network and Service Management, doi: 10.1109/TNSM.2024.3479076. https://ieeexplore.ieee.org/abstract/document/10714362
  • M. Sponner, L. Servadei, B. Waschneck, R. Wille and A. Kumar, "Harnessing Temporal Information for Efficient Edge AI," 2024 9th International Conference on Fog and Mobile Edge Computing (FMEC), Malmö, Sweden, 2024, pp. 5-13, doi: 10.1109/FMEC62297.2024.10710223. https://ieeexplore.ieee.org/abstract/document/10710223
  • Akriti Jain, Saransh Sharma, Koyel Mukherjee, Soumyabrata Pal, 16 Oct 2024, FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction, https://arxiv.org/abs/2410.12513
  • Amrit Diggavi Seshadri, 3 Oct 2024 (v2), Normalized Narrow Jump To Conclusions: Normalized Narrow Shortcuts for Parameter Efficient Early Exit Transformer Prediction, https://arxiv.org/abs/2409.14091 https://ui.adsabs.harvard.edu/abs/2024arXiv240914091D/abstract
  • Haoyan Luo, Lucia Specia, 16 Oct 2024, Tuning Language Models by Mixture-of-Depths Ensemble, https://arxiv.org/abs/2410.13077
  • Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec, 24 Oct 2024, Dynamic Vocabulary Pruning in Early-Exit LLMs, https://arxiv.org/abs/2410.18952
  • Theodore Glavas, Joud Chataoui, Florence Regol, Wassim Jabbour, Antonios Valkanas, Boris N. Oreshkin, Mark Coates, 26 Oct 2024, Dynamic layer selection in decoder-only transformers, https://arxiv.org/abs/2410.20022
  • Z. Wen et al., "SwiftNet: A Cost-Efficient Deep Learning Framework With Diverse Applications," in IEEE Transactions on Industrial Informatics, doi: 10.1109/TII.2024.3448500. https://ieeexplore.ieee.org/abstract/document/10740525
  • Haim Barad, Jascha Achterberg, Tien Pei Chou, Jean Yu, 30 Oct 2024, Accelerated AI Inference via Dynamic Execution Methods, https://arxiv.org/abs/2411.00853
  • M. Sponner, L. Servadei, B. Waschneck, R. Wille and A. Kumar, "Leveraging Temporal Patterns: Automated Augmentation to Create Temporal Early Exit Networks for Efficient Edge AI," in IEEE Access, doi: 10.1109/ACCESS.2024.3497158. https://ieeexplore.ieee.org/document/10752535
  • Samarth N Ramesh, Zhixue Zhao, 22 Nov 2024, Efficient Pruning of Text-to-Image Models: Insights from Pruning Stable Diffusion, https://arxiv.org/abs/2411.15113 (Comprehensive analysis of different types of pruning on diffusion image models.)
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
  • Aritra Roy Gosthipaty, Mostafa Elhoushi, Pedro Cuenca, Vaibhav Srivastav, November 20, 2024, Faster Text Generation with Self-Speculative Decoding, https://huggingface.co/blog/layerskip
  • Weiqiao Shan, Long Meng, Tong Zheng, Yingfeng Luo, Bei Li, junxin Wang, Tong Xiao, Jingbo Zhu, 2 Dec 2024, Early Exit Is a Natural Capability in Transformer-based Models: An Empirical Study on Early Exit without Joint Optimization, https://arxiv.org/abs/2412.01455
  • Wazib Ansar, Saptarsi Goswami, Amlan Chakrabarti, 6 Dec 2024, BEExformer: A Fast Inferencing Transformer Architecture via Binarization with Multiple Early Exits, https://arxiv.org/abs/2412.05225

Early Exit in Generalized Speculative Decoding

Early exiting of a large model can be used as the smaller "draft" model in speculative decoding, rather than using a separate small model. See generalized speculative decoding.

Early Exit and KV Caching

One of the downsides of early exit is that it messes up the well-known optimization of KV caching. The idea of caching the results of the KV computations is one of the earliest recognized optimizations for autoregressive decoding of output sequences. However, computation of the cache requires execution of all layers, but early exiting skips this. In the next inference computation, the cache is out-of-date for the unexecuted layers.

Hence, early exit saves computation, but damages the KV cache, leading to extra computation to fix this. Researchers have examined this issue and found solutions involving either simple cache recomputations, propagating the last executed layer's KV cache to the other skipped layers, and avoiding the issue entirely by modifying early exit methods. More more details about this research see: KV caching with early exit.

Parallelism can be used to further optimize this idea. An important point about these optimizations is that the speedup can be used without accuracy loss simply by computing the missing KV cache in parallel with the earlier-started inference for the next token. This gives efficiency and accuracy, as it is a lossless optimization. Early exit has traditionally simply skipped the KV computation of the layers, using an approximation of KV data. However, there is an overlapping or pipelining idea whereby early exit triggers a token to be emitted, allowing the autoregressive decoding of the next token to start, which is similar to a speculative decoding optimization. The skipped layers can still be executed, even though decoding for the current token is finished, but this is only to create the "missing" KV cache in parallel to the start of the next token generation. The full KV cache data for every layer will thus be available via parallel computation, before it is needed by the next token generation's layers.

More AI Research

Read more about: