Aussie AI

Cascades

Last Updated 27 August, 2025

by David Spuler, Ph.D.

Cascades are a type of model inference optimization where execution flows down through a "cascade" of sub-structures, with the routing sequence depending on the inputs. This optimization mainly relates to early types of neural networks (i.e. DNNs and CNNs), rather than Transformer model architectures.

Cascade optimization is similar to "dynamic routing", early exiting (especially "hierarchical early-exit"), and dynamic structural pruning (e.g. filter pruning, channel pruning, width pruning). The general class of algorithms is dynamic inference optimization (also called "adaptive inference"), where the model's execution is changed dynamically, depending on the inputs.

The generalization of cascades is "multi-AI" with multiple models acting together, which is technically called ensemble architectures. Examples include model selection algorithms, big-little architectures, speculative decoding, consensus-based decoding, and collaborative inference.

Research on Cascades

Research papers on cascade optimizations:

P. Panda, A. Sengupta, and K. Roy, “Conditional deep learning for energy-efficient and enhanced pattern recognition,” in Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2016. https://arxiv.org/abs/1509.08971
Sokratis Nikolaidis, Stylianos I. Venieris, Iakovos S. Venieris, "MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at the Consumer Edge", 2023 IEEE Symposium on Computers and Communications (ISCC), pp.411-416, 2023. https://ieeexplore.ieee.org/document/10217872
Oihane Gómez-Carmona, Diego Casado-Mansilla, Diego López-de-Ipiña, Javier García-Zubia, "Optimizing Computational Resources for Edge Intelligence Through Model Cascade Strategies", IEEE Internet of Things Journal, vol.9, no.10, pp.7404-7417, 2022. https://ieeexplore.ieee.org/document/9564246
Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, Bert Vankeirsbilck, Pieter Simoens, Bart Dhoedt, "The cascading neural network: building the Internet of Smart Things", Knowledge and Information Systems, 2017. https://doi.org/10.1007/s10115-017-1029-1
Wang, X., Luo, Y., Crankshaw, D., Tumanov, A., Yu, F., and Gonzalez, J. E. (2018). Idk cascades: Fast deep learning by learning not to overthink. https://arxiv.org/abs/1706.00885
Chenguang Wang, Zihao Ye, Aston Zhang, Zheng Zhang, and Alexander J. Smola. 2020. Transformer on a Diet. arXiv e-prints (2020), arXiv:2002.06170. https://arxiv.org/abs/2002.06170
K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In Design, Automation, and Test in Europe Conference, pages 551 – 556, 2018. https://ieeexplore.ieee.org/document/8342068
Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017 https://arxiv.org/abs/1605.07648 (Not cascades, but similar conceptually.)
H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015. https://ieeexplore.ieee.org/document/7299170
Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3476–3483. IEEE, 2013. https://ieeexplore.ieee.org/document/6619290
Thomas Dean, Mark A Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan, and Jay Yagnik. 2013. Fast, accurate detection of 100,000 object classes on a single machine. In Proc. CVPR. https://web.stanford.edu/class/cs231m/references/hashing-dpm.pdf
A. Kouris, S. I. Venieris, C. Bouganis, Cascade CNN: Pushing the performance limits of quantisation in convolutional neural networks, in: 2018 28th International Conference on Field Programmable Logic and Applications (FPL), 2018, pp. 155–1557. doi:10.1109/FPL.2018.00034. http://dx.doi.org/10.1109/FPL.2018.00034
A. Kouris, S. Venieris, C.-S. Bouganis, A throughput-latency co-optimised cascade of convolutional neural network classifiers, IEEE, 2019. http://hdl.handle.net/10044/1/75445, http://hdl.handle.net/10044/1/75445
E. S. Marquez, J. S. Hare, M. Niranjan, Deep cascade learning, IEEE Transactions on Neural Networks and Learning Systems 29 (11) (2018) 5475–5485. doi:10.1109/TNNLS.2018.2805098. http://dx.doi.org/10.1109/TNNLS.2018.2805098
Berestizshevsky, K., Even, G.: Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26 (Early exit; somewhat related to cascades.)
Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844 https://arxiv.org/abs/1703.09844 (Hierarchical early-exit scheme with multiple models is conceptually similar to cascades.)
Jayakodi, N.K., Chatterjee, A., Choi, W., Doppa, J.R., Pande, P.P.: Trading-off accuracy and energy of deep inference on embedded systems: A co-design approach. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(11), 2881–2893 (2018). https://doi.org/10.1109/tcad.2018.2857338, https://arxiv.org/abs/1901.10584
Passalis, N., Raitoharju, J., Tefas, A., Gabbouj, M.: Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits. Pattern Recognition 105, 107346 (2020). https://doi.org/10.1016/j.patcog.2020.107346, PDF: https://hal.science/hal-03265174/document (Hierarchical early exit is similar to cascades.)
A Moos, 2023, Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs, arXiv preprint arXiv:2309.03530, https://arxiv.org/pdf/2309.03530.pdf (Fast inference for a soccer-playing robot with cascade-like hierarchical early exits.)
H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. CVPR, 2015. https://ieeexplore.ieee.org/document/7299170
F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. CVPR, 2016. https://ieeexplore.ieee.org/document/7780603, PDF: https://www.cvlibs.net/projects/autonomous_vision_survey/literature/Yang2016CVPR.pdf (Cascaded rejection classifiers.)
Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, Yizhou Yu May 2015, HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition, https://arxiv.org/abs/1410.0736
Y Tang, T Iwaguchi, H Kawasaki, 2023, Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy, arXiv preprint arXiv:2309.03445, https://arxiv.org/abs/2309.03445, Code: https://github.com/piggy2009/DM_underwater (Skipping iteratively is somewhat similar to cascading.)
D. Kang, J. Emmons, F. Abuzaid, P. Bailis and M. Zaharia, "NoScope: Optimizing neural network queries over video at scale", Proc. VLDB Endowment, vol. 10, no. 11, pp. 1586-1597, 2017. https://arxiv.org/abs/1703.02529 (Cascades when analyzing images in video in real-time.)
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001. https://ieeexplore.ieee.org/document/990517
Zhaowei Cai; Mohammad Saberian; Nuno Vasconcelos. Learning complexity-aware cascades for deep pedestrian detection. In ICCV, 2015. https://ieeexplore.ieee.org/document/8686227
Rodrigo Verschae, Javier Ruiz-del-Solar & Mauricio Correa , A unified learning framework for object detection and classification using nested cascades of boosted classifiers. Machine Vision and Applications, 19(2), 2008, https://link.springer.com/article/10.1007/s00138-007-0084-0
K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In Design, Automation, and Test in Europe Conference, pages 551 – 556, 2018. https://ieeexplore.ieee.org/document/8342068 (Sequences of small feed-forward networks focus on parts of an image.)
Francesco Daghero, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini, Enrico Macii, Massimo Poncino, "Energy-Efficient Adaptive Machine Learning on IoT End-Nodes With Class-Dependent Confidence", 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp.1-4, 2020. https://ieeexplore.ieee.org/document/9294863, https://arxiv.org/abs/2204.03431v1 (An improved stopping policy for early exits on easy-input classification tasks.)
Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248, 2023. https://dl.acm.org/doi/10.1145/3552326.3587438, PDF: https://yidingwang.xyz/public/files/tabi_eurosys23.pdf (Has multiple models, some big, some small, with characteristics similar to ensembles, big-little, and cascades.)
P Kavehzadeh, M Valipour, M Tahaei, A Ghodsi, 2023, Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT), arXiv preprint, https://arxiv.org/pdf/2309.08968.pdf (Cascade-like item: SortedNet method unlocks the potential of intermediate layers.)
Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar, 29 May 2024, Faster Cascades via Speculative Decoding, https://arxiv.org/abs/2405.19261 (A combination of cascades with speculative decoding.)
Z. Chen, X. Yang, J. Lin, C. Sun, J. Huang, and K. C.-C. Chang, 27 Feb 2024 (v4), “Cascade speculative drafting for even faster llm inference,” arXiv preprint arXiv:2312.11462, 2023. https://arxiv.org/abs/2312.11462 Code: https://github.com/lfsszd/CS-Drafting (Uses non-LLM draft models based on statistical langage models and also uses multiple draft models in a hierarchy.)
M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
MM Rastikerdar, J Huang, S Fang, H Guan, D Ganesan, Oct 2023, Efficient IoT Inference via Context-Awareness, https://arxiv.org/pdf/2310.19112.pdf (Does dynamic context-aware "classifier switching" which is similar to cascades and/or early exiting.)
Seulki Lee and Shahriar Nirjon. 2020. SubFlow: A Dynamic Induced-Subgraph Strategy Toward Real-Time DNN Inference and Training. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, 15–29. https://ieeexplore.ieee.org/document/9113121
Mingfei Gao, Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2018. Dynamic zoom-in network for fast object detection in large images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6926–6935. https://ieeexplore.ieee.org/document/8578822 https://arxiv.org/abs/1711.05187 (Analysis of objects in image from coarse to fine resolution, similar to cascading.)
Huizi Mao, Taeyoung Kong, and Bill Dally. 2019. CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video. In Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31-April 2, 2019. https://arxiv.org/abs/1810.00434
Denoyer, Ludovic and Gallinari, Patrick, 2014, Deep sequential neural network. CoRR, abs/1410.0510, http://arxiv.org/abs/1410.0510
M Salehi, S Mehta, A Kusupati, A Farhadi, H Hajishirzi, 2023 Sharcs: Efficient transformers through routing with dynamic width sub-networks https://arxiv.org/pdf/2310.12126.pdf (Direct queries to subnetworks with different widths.)
Minkyu Kim and Jae Sun Seo. 2021. An energy-efficient deep convolutional neural network accelerator featuring conditional computing and low external memory access. IEEE Journal of Solid-State Circuits 56, 3 (2021), 803–813, https://ieeexplore.ieee.org/document/9229157
David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith, 2 Jul 2024, Revisiting Cascaded Ensembles for Efficient Inference https://arxiv.org/abs/2407.02348
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Kai Zhang, Liqian Peng, Congchao Wang, Alec Go, Xiaozhong Liu, 10 Oct 2024, LLM Cascade with Multi-Objective Optimal Consideration, https://arxiv.org/abs/2410.08014
Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
Sokratis Nikolaidis, Stylianos I. Venieris, Iakovos S. Venieris, 5 Dec 2024, MultiTASC++: A Continuously Adaptive Scheduler for Edge-Based Multi-Device Cascade Inference, https://arxiv.org/abs/2412.04147
Michael J. Zellinger, Matt Thomson, 16 Jan 2025, Rational Tuning of LLM Cascades via Probabilistic Modeling, https://arxiv.org/abs/2501.09345
Wenfeng Lin, Jiangchuan Wei, Boyuan Liu, Yichen Zhang, Shiyue Yan, Mingyu Guo, 28 Jan 2025, CascadeV: An Implementation of Wurstchen Architecture for Video Generation, https://arxiv.org/abs/2501.16612
Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Dingqi Yang, Hailong Sun, Philip S. Yu, 25 Feb 2025, Harnessing Multiple Large Language Models: A Survey on LLM Ensemble,https://arxiv.org/abs/2502.18036 https://github.com/junchenzhi/Awesome-LLM-Ensemble
Stephan Rabanser, Nathalie Rauschmayr, Achin Kulshrestha, Petra Poklukar, Wittawat Jitkrittum, Sean Augenstein, Congchao Wang, Federico Tombari, 26 Feb 2025, I Know What I Don't Know: Improving Model Cascades Through Confidence Tuning, https://arxiv.org/abs/2502.19335
Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Nicholas D. Lane, Binhang Yuan, 4 Jun 2025, Cascadia: A Cascade Serving System for Large Language Models, https://arxiv.org/abs/2506.04203
David Warren, Mark Dras, 27 Apr 2025, Bi-directional Model Cascading with Proxy Confidence, https://arxiv.org/abs/2504.19391
Zixuan Wang, Jinghao Shi, Hanzhong Liang, Xiang Shen, Vera Wen, Zhiqian Chen, Yifan Wu, Zhixin Zhang, Hongyu Xiong, 23 Jul 2025, Filter-And-Refine: A MLLM Based Cascade System for Industrial-Scale Video Content Moderation, https://arxiv.org/abs/2507.17204
Shan Jiang, Pranoy Kovuri, David Tao, Zhixun Tan, 23 Jul 2025, CASCADE: LLM-Powered JavaScript Deobfuscator at Google, https://arxiv.org/abs/2507.17691
Yunqing Li, Zixiang Tang, Jiaying Zhuang, Zhenyu Yang, Farhad Ameri, Jianbang Zhang, 11 Aug 2025, C-MAG: Cascade Multimodal Attributed Graphs for Supply Chain Link Prediction, https://arxiv.org/abs/2508.08071
Bolian Li, Yifan Wang, Anamika Lochab, Ananth Grama, Ruqi Zhang, 3 Aug 2025, Cascade Reward Sampling for Efficient Decoding-Time Alignment, https://arxiv.org/abs/2406.16306
Chu-Cheng Lin, Daiyi Peng, Yifeng Lu, Ming Zhang, Eugene Ie, 25 Aug 2025, Type-Compliant Adaptation Cascades: Adapting Programmatic LM Workflows to Data, https://arxiv.org/abs/2508.18244