Aussie AI

Types of Adaptive Inference

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Types of Adaptive Inference

Adaptive inference is not yet part of the mainstream optimizations of an AI engine. However, there are many different research areas where this general idea is applied in a particular way to speed up inference. The main examples where the computation path “adapts” to the user inputs include:

Early exit (dynamic layer pruning)
Layer skipping
Layer reordering
Dynamic width pruning
Token pruning (dynamic length pruning)
Prompt compression
Semantic caching
Cascades
Stochastic algorithms (intentional randomness)

For example, “early exiting” is a type of “dynamic layer pruning” where some of the layers of weights are simply skipped at runtime. Instead of going through all 12 layers of GPT-2, it might decide to only run through 9 layers for some inputs.

Adaptive inference research. The above specific techniques are covered in other chapters with pointers to research papers. Research papers on adaptive inference generally:

Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 10, “Input Hardness Adaptive Models” for methods of running faster on easy image classification problems.)
Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. 2017. Adaptive neural networks for efficient inference, In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Proceedings of Machine Learning Research, pages 527–536. https://arxiv.org/abs/1702.07811, http://proceedings.mlr.press/v70/bolukbasi17a.html
Nan, F. and Saligrama, V. 2017. Adaptive classification for prediction under a budget, In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (editors), 2017, Advances in Neural Information Processing Systems, volume 30, pages 4727–4737. Curran Associates, https://arxiv.org/abs/1705.10194 PDF: https://proceedings.neurips.cc/paper_files/paper/2017/file/d9ff90f4000eacd3a6c9cb27f78994cf-Paper.pdf
Nan, F. and Saligrama, V., 2017. Dynamic model selection for prediction under a budget, arXiv preprint arXiv:1704.07505. https://arxiv.org/abs/1704.07505
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. 2022. Dynamic neural networks: A survey, volume 44, pages 7436–7456, Los Alamitos, CA, USA. IEEE Computer Society. https://arxiv.org/abs/2102.04906, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837 (Survey of dynamic inference techniques, where the engine is adaptive to the input.)
Graves, A. (2016). Adaptive computation time for recurrent neural networks, CoRR, abs/1603.08983. http://arxiv.org/abs/1603.08983
Jernite, Y., Grave, E., Joulin, A., and Mikolov, T. (2017). Variable computation in recurrent neural networks, In International Conference on Learning Representations. https://openreview.net/forum?id=S1LVSrcge, https://arxiv.org/abs/1611.06188
D. Stamoulis, T.-W. Chin, A. K. Prakash, H. Fang, S. Sajja, M. Bognar, and D. Marculescu, 2018, Designing adaptive neural networks for energy-constrained image classification, in Proceedings of the International Conference on Computer-Aided Design, 2018, pp. 1–8. https://ieeexplore.ieee.org/document/8587753
Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, Joseph E. Gonzalez, 2018, Skipnet: Learning dynamic routing in convolutional networks, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 409–424. https://arxiv.org/abs/1711.09485
Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard Baraniuk, Zhangyang Wang, Yingyan Lin, 2020, Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference, IEEE Journal of Selected Topics in Signal Processing, 2020. https://arxiv.org/abs/1907.04523
P. Panda, A. Sengupta, and K. Roy, 2016, Conditional deep learning for energy-efficient and enhanced pattern recognition, in Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2016. https://arxiv.org/abs/1509.08971
Berestizshevsky, K., Even, G., 2019, Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence, In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26
Jayakodi, N.K., Chatterjee, A., Choi, W., Doppa, J.R., Pande, P.P., Trading-off accuracy and energy of deep inference on embedded systems: A co-design approach, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(11), 2881–2893 (2018). https://doi.org/10.1109/tcad.2018.2857338, https://arxiv.org/abs/1901.10584
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input-dependent inference optimization via layer-wise weight clustering and early exit based on a termination condition.)
Maedeh Hemmat; Azadeh Davoodi, March 2019, Dynamic Reconfiguration of CNNs for Input-Dependent Approximation, 20th International Symposium on Quality Electronic Design (ISQED), https://ieeexplore.ieee.org/document/8697843 (Dynamically decides how many clusters of similar weights to use, depending on input.)
B Wójcik, M Przewiȩźlikowski, F Szatkowski, Oct 2023, Zero time waste in pre-trained early exit neural networks, Neural Networks, https://www.sciencedirect.com/science/article/pii/S0893608023005555, PDF: https://papers.nips.cc/paper/2021/file/149ef6419512be56a93169cd5e6fa8fd-Paper.pdf (Attempts to quickly handle easy inputs by caching classifications from prior layers for early exit decisions.)

For more general research on adaptive inference, refer to https://www.aussieai.com/research/inference-optimization#dynamic.

Non-adaptive methods. Note that there are numerous other ways to speed up inference at runtime that do not actually change the sequence of computations for different inputs. They are not “adaptive” because they still do the same inference calculations each time, only faster. As some examples, these methods are inference optimizations, but not true adaptive inference:

Quantization
Model compression
Static pruning
Sparsity
Knowledge distillation
Inference caching
KV Caching
Integer algorithms
Zero-multiplication models

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Types of Adaptive Inference

Types of Adaptive Inference

Quick Links

Product

New to Writing?

Writing Styles