Aussie AI

Types of Adaptive Inference

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Types of Adaptive Inference

Adaptive inference is not yet part of the mainstream optimizations of an AI engine. However, there are many different research areas where this general idea is applied in a particular way to speed up inference. The main examples where the computation path “adapts” to the user inputs include:

  • Early exit (dynamic layer pruning)
  • Layer skipping
  • Layer reordering
  • Dynamic width pruning
  • Token pruning (dynamic length pruning)
  • Prompt compression
  • Semantic caching
  • Cascades
  • Stochastic algorithms (intentional randomness)

For example, “early exiting” is a type of “dynamic layer pruning” where some of the layers of weights are simply skipped at runtime. Instead of going through all 12 layers of GPT-2, it might decide to only run through 9 layers for some inputs.

Adaptive inference research. The above specific techniques are covered in other chapters with pointers to research papers. Research papers on adaptive inference generally:

  1. Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 10, “Input Hardness Adaptive Models” for methods of running faster on easy image classification problems.)
  2. Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. 2017. Adaptive neural networks for efficient inference, In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Proceedings of Machine Learning Research, pages 527–536. https://arxiv.org/abs/1702.07811, http://proceedings.mlr.press/v70/bolukbasi17a.html
  3. Nan, F. and Saligrama, V. 2017. Adaptive classification for prediction under a budget, In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (editors), 2017, Advances in Neural Information Processing Systems, volume 30, pages 4727–4737. Curran Associates, https://arxiv.org/abs/1705.10194 PDF: https://proceedings.neurips.cc/paper_files/paper/2017/file/d9ff90f4000eacd3a6c9cb27f78994cf-Paper.pdf
  4. Nan, F. and Saligrama, V., 2017. Dynamic model selection for prediction under a budget, arXiv preprint arXiv:1704.07505. https://arxiv.org/abs/1704.07505
  5. Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. 2022. Dynamic neural networks: A survey, volume 44, pages 7436–7456, Los Alamitos, CA, USA. IEEE Computer Society. https://arxiv.org/abs/2102.04906, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837 (Survey of dynamic inference techniques, where the engine is adaptive to the input.)
  6. Graves, A. (2016). Adaptive computation time for recurrent neural networks, CoRR, abs/1603.08983. http://arxiv.org/abs/1603.08983
  7. Jernite, Y., Grave, E., Joulin, A., and Mikolov, T. (2017). Variable computation in recurrent neural networks, In International Conference on Learning Representations. https://openreview.net/forum?id=S1LVSrcge, https://arxiv.org/abs/1611.06188
  8. D. Stamoulis, T.-W. Chin, A. K. Prakash, H. Fang, S. Sajja, M. Bognar, and D. Marculescu, 2018, Designing adaptive neural networks for energy-constrained image classification, in Proceedings of the International Conference on Computer-Aided Design, 2018, pp. 1–8. https://ieeexplore.ieee.org/document/8587753
  9. Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, Joseph E. Gonzalez, 2018, Skipnet: Learning dynamic routing in convolutional networks, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 409–424. https://arxiv.org/abs/1711.09485
  10. Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard Baraniuk, Zhangyang Wang, Yingyan Lin, 2020, Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference, IEEE Journal of Selected Topics in Signal Processing, 2020. https://arxiv.org/abs/1907.04523
  11. P. Panda, A. Sengupta, and K. Roy, 2016, Conditional deep learning for energy-efficient and enhanced pattern recognition, in Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2016. https://arxiv.org/abs/1509.08971
  12. Berestizshevsky, K., Even, G., 2019, Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence, In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26
  13. Jayakodi, N.K., Chatterjee, A., Choi, W., Doppa, J.R., Pande, P.P., Trading-off accuracy and energy of deep inference on embedded systems: A co-design approach, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(11), 2881–2893 (2018). https://doi.org/10.1109/tcad.2018.2857338, https://arxiv.org/abs/1901.10584
  14. Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input-dependent inference optimization via layer-wise weight clustering and early exit based on a termination condition.)
  15. Maedeh Hemmat; Azadeh Davoodi, March 2019, Dynamic Reconfiguration of CNNs for Input-Dependent Approximation, 20th International Symposium on Quality Electronic Design (ISQED), https://ieeexplore.ieee.org/document/8697843 (Dynamically decides how many clusters of similar weights to use, depending on input.)
  16. B Wójcik, M Przewiȩźlikowski, F Szatkowski, Oct 2023, Zero time waste in pre-trained early exit neural networks, Neural Networks, https://www.sciencedirect.com/science/article/pii/S0893608023005555, PDF: https://papers.nips.cc/paper/2021/file/149ef6419512be56a93169cd5e6fa8fd-Paper.pdf (Attempts to quickly handle easy inputs by caching classifications from prior layers for early exit decisions.)

For more general research on adaptive inference, refer to https://www.aussieai.com/research/inference-optimization#dynamic.

Non-adaptive methods. Note that there are numerous other ways to speed up inference at runtime that do not actually change the sequence of computations for different inputs. They are not “adaptive” because they still do the same inference calculations each time, only faster. As some examples, these methods are inference optimizations, but not true adaptive inference:

  • Quantization
  • Model compression
  • Static pruning
  • Sparsity
  • Knowledge distillation
  • Inference caching
  • KV Caching
  • Integer algorithms
  • Zero-multiplication models

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++