Aussie AI

Cascades

  • Last Updated 8 December, 2024
  • by David Spuler, Ph.D.

Cascades are a type of model inference optimization where execution flows down through a "cascade" of sub-structures, with the routing sequence depending on the inputs. This optimization mainly relates to early types of neural networks (i.e. DNNs and CNNs), rather than Transformer model architectures.

Cascade optimization is similar to "dynamic routing", early exiting (especially "hierarchical early-exit"), and dynamic structural pruning (e.g. filter pruning, channel pruning, width pruning). The general class of algorithms is dynamic inference optimization (also called "adaptive inference"), where the model's execution is changed dynamically, depending on the inputs.

The generalization of cascades is "multi-AI" with multiple models acting together, which is technically called ensemble architectures. Examples include model selection algorithms, big-little architectures, speculative decoding, consensus-based decoding, and collaborative inference.

Research on Cascades

Research papers on cascade optimizations:

  • P. Panda, A. Sengupta, and K. Roy, “Conditional deep learning for energy-efficient and enhanced pattern recognition,” in Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2016. https://arxiv.org/abs/1509.08971
  • Sokratis Nikolaidis, Stylianos I. Venieris, Iakovos S. Venieris, "MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at the Consumer Edge", 2023 IEEE Symposium on Computers and Communications (ISCC), pp.411-416, 2023. https://ieeexplore.ieee.org/document/10217872
  • Oihane Gómez-Carmona, Diego Casado-Mansilla, Diego López-de-Ipiña, Javier García-Zubia, "Optimizing Computational Resources for Edge Intelligence Through Model Cascade Strategies", IEEE Internet of Things Journal, vol.9, no.10, pp.7404-7417, 2022. https://ieeexplore.ieee.org/document/9564246
  • Sam Leroux, Steven Bohez, Elias De Coninck, Tim Verbelen, Bert Vankeirsbilck, Pieter Simoens, Bart Dhoedt, "The cascading neural network: building the Internet of Smart Things", Knowledge and Information Systems, 2017. https://doi.org/10.1007/s10115-017-1029-1
  • Wang, X., Luo, Y., Crankshaw, D., Tumanov, A., Yu, F., and Gonzalez, J. E. (2018). Idk cascades: Fast deep learning by learning not to overthink. https://arxiv.org/abs/1706.00885
  • Chenguang Wang, Zihao Ye, Aston Zhang, Zheng Zhang, and Alexander J. Smola. 2020. Transformer on a Diet. arXiv e-prints (2020), arXiv:2002.06170. https://arxiv.org/abs/2002.06170
  • K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In Design, Automation, and Test in Europe Conference, pages 551 – 556, 2018. https://ieeexplore.ieee.org/document/8342068
  • Gustav Larsson, Michael Maire, and Gregory Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. In ICLR, 2017 https://arxiv.org/abs/1605.07648 (Not cascades, but similar conceptually.)
  • H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015. https://ieeexplore.ieee.org/document/7299170
  • Y. Sun, X. Wang, and X. Tang. Deep convolutional network cascade for facial point detection. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 3476–3483. IEEE, 2013. https://ieeexplore.ieee.org/document/6619290
  • Thomas Dean, Mark A Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan, and Jay Yagnik. 2013. Fast, accurate detection of 100,000 object classes on a single machine. In Proc. CVPR. https://web.stanford.edu/class/cs231m/references/hashing-dpm.pdf
  • A. Kouris, S. I. Venieris, C. Bouganis, Cascade CNN: Pushing the performance limits of quantisation in convolutional neural networks, in: 2018 28th International Conference on Field Programmable Logic and Applications (FPL), 2018, pp. 155–1557. doi:10.1109/FPL.2018.00034. http://dx.doi.org/10.1109/FPL.2018.00034
  • A. Kouris, S. Venieris, C.-S. Bouganis, A throughput-latency co-optimised cascade of convolutional neural network classifiers, IEEE, 2019. http://hdl.handle.net/10044/1/75445, http://hdl.handle.net/10044/1/75445
  • E. S. Marquez, J. S. Hare, M. Niranjan, Deep cascade learning, IEEE Transactions on Neural Networks and Learning Systems 29 (11) (2018) 5475–5485. doi:10.1109/TNNLS.2018.2805098. http://dx.doi.org/10.1109/TNNLS.2018.2805098
  • Berestizshevsky, K., Even, G.: Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence. In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26 (Early exit; somewhat related to cascades.)
  • Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.Q.: Multi-scale dense networks for resource efficient image classification. In: 6th International Conference on Learning Representations, ICLR 2018 (2018). https://doi.org/10.48550/arXiv.1703.09844 https://arxiv.org/abs/1703.09844 (Hierarchical early-exit scheme with multiple models is conceptually similar to cascades.)
  • Jayakodi, N.K., Chatterjee, A., Choi, W., Doppa, J.R., Pande, P.P.: Trading-off accuracy and energy of deep inference on embedded systems: A co-design approach. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(11), 2881–2893 (2018). https://doi.org/10.1109/tcad.2018.2857338, https://arxiv.org/abs/1901.10584
  • Passalis, N., Raitoharju, J., Tefas, A., Gabbouj, M.: Efficient adaptive inference for deep convolutional neural networks using hierarchical early exits. Pattern Recognition 105, 107346 (2020). https://doi.org/10.1016/j.patcog.2020.107346, PDF: https://hal.science/hal-03265174/document (Hierarchical early exit is similar to cascades.)
  • A Moos, 2023, Efficient Single Object Detection on Image Patches with Early Exit Enhanced High-Precision CNNs, arXiv preprint arXiv:2309.03530, https://arxiv.org/pdf/2309.03530.pdf (Fast inference for a soccer-playing robot with cascade-like hierarchical early exits.)
  • H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. CVPR, 2015. https://ieeexplore.ieee.org/document/7299170
  • F. Yang, W. Choi, and Y. Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. CVPR, 2016. https://ieeexplore.ieee.org/document/7780603, PDF: https://www.cvlibs.net/projects/autonomous_vision_survey/literature/Yang2016CVPR.pdf (Cascaded rejection classifiers.)
  • Zhicheng Yan, Hao Zhang, Robinson Piramuthu, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, Yizhou Yu May 2015, HD-CNN: Hierarchical Deep Convolutional Neural Network for Large Scale Visual Recognition, https://arxiv.org/abs/1410.0736
  • Y Tang, T Iwaguchi, H Kawasaki, 2023, Underwater Image Enhancement by Transformer-based Diffusion Model with Non-uniform Sampling for Skip Strategy, arXiv preprint arXiv:2309.03445, https://arxiv.org/abs/2309.03445, Code: https://github.com/piggy2009/DM_underwater (Skipping iteratively is somewhat similar to cascading.)
  • D. Kang, J. Emmons, F. Abuzaid, P. Bailis and M. Zaharia, "NoScope: Optimizing neural network queries over video at scale", Proc. VLDB Endowment, vol. 10, no. 11, pp. 1586-1597, 2017. https://arxiv.org/abs/1703.02529 (Cascades when analyzing images in video in real-time.)
  • P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In CVPR, 2001. https://ieeexplore.ieee.org/document/990517
  • Zhaowei Cai; Mohammad Saberian; Nuno Vasconcelos. Learning complexity-aware cascades for deep pedestrian detection. In ICCV, 2015. https://ieeexplore.ieee.org/document/8686227
  • Rodrigo Verschae, Javier Ruiz-del-Solar & Mauricio Correa , A unified learning framework for object detection and classification using nested cascades of boosted classifiers. Machine Vision and Applications, 19(2), 2008, https://link.springer.com/article/10.1007/s00138-007-0084-0
  • K. Neshatpour, F. Behnia, H. Homayoun, and A. Sasan. ICNN: An iterative implementation of convolutional neural networks to enable energy and computational complexity aware dynamic approximation. In Design, Automation, and Test in Europe Conference, pages 551 – 556, 2018. https://ieeexplore.ieee.org/document/8342068 (Sequences of small feed-forward networks focus on parts of an image.)
  • Francesco Daghero, Alessio Burrello, Daniele Jahier Pagliari, Luca Benini, Enrico Macii, Massimo Poncino, "Energy-Efficient Adaptive Machine Learning on IoT End-Nodes With Class-Dependent Confidence", 2020 27th IEEE International Conference on Electronics, Circuits and Systems (ICECS), pp.1-4, 2020. https://ieeexplore.ieee.org/document/9294863, https://arxiv.org/abs/2204.03431v1 (An improved stopping policy for early exits on easy-input classification tasks.)
  • Yiding Wang, Kai Chen, Haisheng Tan, and Kun Guo. Tabi: An efficient multi-level inference system for large language models. In Proceedings of the Eighteenth European Conference on Computer Systems, pages 233–248, 2023. https://dl.acm.org/doi/10.1145/3552326.3587438, PDF: https://yidingwang.xyz/public/files/tabi_eurosys23.pdf (Has multiple models, some big, some small, with characteristics similar to ensembles, big-little, and cascades.)
  • P Kavehzadeh, M Valipour, M Tahaei, A Ghodsi, 2023, Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT), arXiv preprint, https://arxiv.org/pdf/2309.08968.pdf (Cascade-like item: SortedNet method unlocks the potential of intermediate layers.)
  • Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, Sanjiv Kumar, 29 May 2024, Faster Cascades via Speculative Decoding, https://arxiv.org/abs/2405.19261 (A combination of cascades with speculative decoding.)
  • Z. Chen, X. Yang, J. Lin, C. Sun, J. Huang, and K. C.-C. Chang, 27 Feb 2024 (v4), “Cascade speculative drafting for even faster llm inference,” arXiv preprint arXiv:2312.11462, 2023. https://arxiv.org/abs/2312.11462 Code: https://github.com/lfsszd/CS-Drafting (Uses non-LLM draft models based on statistical langage models and also uses multiple draft models in a hierarchy.)
  • M Sponner, B Waschneck, A Kumar , 2024, Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning, ACM Computing Surveys,, PDF: https://dl.acm.org/doi/pdf/10.1145/3657283 (Survey of various adaptive inference optimization techniques with much focus on image and video processing optimization for LLMs.)
  • Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
  • MM Rastikerdar, J Huang, S Fang, H Guan, D Ganesan, Oct 2023, Efficient IoT Inference via Context-Awareness, https://arxiv.org/pdf/2310.19112.pdf (Does dynamic context-aware "classifier switching" which is similar to cascades and/or early exiting.)
  • Seulki Lee and Shahriar Nirjon. 2020. SubFlow: A Dynamic Induced-Subgraph Strategy Toward Real-Time DNN Inference and Training. In 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, 15–29. https://ieeexplore.ieee.org/document/9113121
  • Mingfei Gao, Ruichi Yu, Ang Li, Vlad I Morariu, and Larry S Davis. 2018. Dynamic zoom-in network for fast object detection in large images. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6926–6935. https://ieeexplore.ieee.org/document/8578822 https://arxiv.org/abs/1711.05187 (Analysis of objects in image from coarse to fine resolution, similar to cascading.)
  • Huizi Mao, Taeyoung Kong, and Bill Dally. 2019. CaTDet: Cascaded Tracked Detector for Efficient Object Detection from Video. In Proceedings of Machine Learning and Systems 2019, MLSys 2019, Stanford, CA, USA, March 31-April 2, 2019. https://arxiv.org/abs/1810.00434
  • Denoyer, Ludovic and Gallinari, Patrick, 2014, Deep sequential neural network. CoRR, abs/1410.0510, http://arxiv.org/abs/1410.0510
  • M Salehi, S Mehta, A Kusupati, A Farhadi, H Hajishirzi, 2023 Sharcs: Efficient transformers through routing with dynamic width sub-networks https://arxiv.org/pdf/2310.12126.pdf (Direct queries to subnetworks with different widths.)
  • Minkyu Kim and Jae Sun Seo. 2021. An energy-efficient deep convolutional neural network accelerator featuring conditional computing and low external memory access. IEEE Journal of Solid-State Circuits 56, 3 (2021), 803–813, https://ieeexplore.ieee.org/document/9229157
  • David Spuler, March 2024, Chapter 54. Ensemble Multi-Model Architectures, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Steven Kolawole, Don Dennis, Ameet Talwalkar, Virginia Smith, 2 Jul 2024, Revisiting Cascaded Ensembles for Efficient Inference https://arxiv.org/abs/2407.02348
  • Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Kai Zhang, Liqian Peng, Congchao Wang, Alec Go, Xiaozhong Liu, 10 Oct 2024, LLM Cascade with Multi-Objective Optimal Consideration, https://arxiv.org/abs/2410.08014
  • Lingjiao Chen, Matei Zaharia, James Zou, 9 May 2023, FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, https://arxiv.org/abs/2305.05176
  • Sokratis Nikolaidis, Stylianos I. Venieris, Iakovos S. Venieris, 5 Dec 2024, MultiTASC++: A Continuously Adaptive Scheduler for Edge-Based Multi-Device Cascade Inference, https://arxiv.org/abs/2412.04147

More AI Research

Read more about: