Aussie AI
Conditional Computation
-
Last Updated 3 September, 2024
-
by David Spuler, Ph.D.
Conditional computation is an optimization technique for AI model inference where simple computations are done first, so that more complicated and expensive computations are only done "conditionally" and often avoided completely. Other names for conditional computation as a programming optimization technique include "skipping", "lazy evaluation", "easy case first", "simple case first", and "common case first".
When applied to neural network inference, conditional computation is a type of dynamic inference (or "adaptive inference"), where the computations change dynamically based on the input sequence, and only parts of the full model are activated. Some examples of conditional computation algorithms for dynamic inference include:
- Zero skipping (including skipping negatives sent to RELU)
- Layer skipping
- Dynamic sparsification
- Dynamic pruning (e.g. channel pruning, filter pruning, head pruning)
- Early exiting layers
- Low-rank matrix factorization
- Cascades
- Speculative-decoding
- Big-little architectures (dynamically selecting either a small or large model)
Research on Conditional Computation
Research papers on various types of conditional computation, with an initial cheap computation to avoid a larger subsequent computation, include:
- Yuxiang Huan, Yifan Qin, Yantian You, Lirong Zheng, and Zhuo Zou. Sep 2016. A multiplication reduction technique with near-zero approximation for embedded learning in IoT devices. 2016 29th IEEE International System-on-Chip Conference (SOCC), 102–107. https://ieeexplore.ieee.org/abstract/document/7905445 (Avoids near-zero low multiplications on small values, by efficiently counting the number of prefix zeros in the floating point representation using bitwise arithmetic.)
- Duvindu Piyasena, Rukshan Wickramasinghe, Debdeep Paul, Siew Kei Lam, and Meiqing Wu. 2019. Reducing dynamic power in streaming CNN hardware accelerators by exploiting computational redundancies. Proceedings 29th International Conference on Field-Programmable Logic and Applications, FPL 2019 (9 2019), 354–359, https://ieeexplore.ieee.org/document/8891989 PDF: https://siewkeilam.github.io/ei-research-group/Paper/2019H-Duvindu-FPL.pdf ("Negative skipping": Quickly estimates computed values, thereby avoiding entire computations that would be negative, since they would be reduced to zero by RELU activation.)
- Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel, 2022, Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp 1–36 https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737 (Extensive survey with a section on "Skipping" which discusses conditional computation.)
- T. Ujiie, M. Hiromoto, and T. Sato. 2016. Approximated Prediction Strategy for Reducing Power Consumption of Convolutional Neural Network Processor. Conf. on Comp. Vision and Pattern Recog. Workshops (CVPRW), 870–876. https://ieeexplore.ieee.org/document/7789603 https://openaccess.thecvf.com/content_cvpr_2016_workshops/w14/papers/Ujiie_Approximated_Prediction_Strategy_CVPR_2016_paper.pdf ("Negative skipping": Uses fast logic with ternary weights to quickly approximate the value of a convolution, so as to skip it entirely if expected to be negative.)
- JA Chen, W Niu, B Ren, Y Wang, X Shen, 2023, Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Survey paper covering various data redundancy optimizations such as skipping or reusing computations for similar data.)
- Mingcong Song; Jiechen Zhao; Yang Hu; Jiaqi Zhang; Tao Li, 2018, Prediction based execution on deep neural networks, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/8416870/, https://www.researchgate.net/profile/Mingcong-Song/publication/326566905_Prediction_Based_Execution_on_Deep_Neural_Networks/links/5bd68551a6fdcc3a8dad72ff/Prediction-Based-Execution-on-Deep-Neural-Networks.pdf
- H Park, D Kim, J Ahn, S Yoo, 2016, Zero and data reuse-aware fast convolution for deep neural networks on GPU, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), https://dl.acm.org/doi/abs/10.1145/2968456.2968476, https://ieeexplore.ieee.org/document/7750981 (Zero-skipping by prediction of the results.)
- Tianyang Lin, Yuxin Wang, Xiangyang Liu, Xipeng Qiu, June 2021, A Survey of Transformers, AI Open, https://arxiv.org/abs/2106.04554 (Examines some Transformer models with "Adapative Computation Transformer" (ACT) arcthitectures.)
- Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. 2020. Controlling Computation versus Quality for Neural Sequence Models. arXiv:2002.07106 [cs.LG], https://arxiv.org/abs/2002.07106
- Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. 2019. Universal Transformers. In Proceedings of ICLR. https://openreview.net/forum?id=HyzdRiR9Y7, PDF: https://openreview.net/pdf?id=HyzdRiR9Y7
- Huang G., Chen D., Li T., Wu F., van der Maaten L., Weinberger K.Q., 2018, Multi-scale dense networks for resource efficient image classification, International conference on learning representations (2018), https://arxiv.org/abs/1703.09844
- Wang Y., Lv K., Huang R., Song S., Yang L., Huang G., 2020, Glance and focus: a dynamic approach to reducing spatial redundancy in image classification, Advances in neural information processing systems, Vol. 33 (2020), pp. 2432-2444, https://arxiv.org/abs/2010.05300, Code: https://github.com/blackfeather-wang/GFNet-Pytorch (Focuses on a small subset of the input to speed up inference with early-exit based on confidence level.)
- Hajin Shim, Sung Ju Hwang, and Eunho Yang. Joint active feature acquisition and classification with variable-size set encoding. NeurIPS, pages 1368–1378, 2018. https://papers.nips.cc/paper/2018/file/e5841df2166dd424a57127423d276bbe-Paper.pdf
- Weizhe Hua, Yuan Zhou, Christopher M De Sa, Zhiru Zhang, and G Edward Suh. Channel gating neural networks. NeurIPS, pages 1884–1894, 2019, https://arxiv.org/abs/1805.12549
- Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, and Stephen Lin. 2020. Spatially adaptive inference with stochastic feature sampling and interpolation. arXiv preprint arXiv:2003.08866, https://arxiv.org/abs/2003.08866
- Yang L., Han Y., Chen X., Song S., Dai J., Huang G., 2020, Resolution adaptive networks for efficient inference, 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (2020), pp. 2369-2378, https://arxiv.org/abs/2003.07326
- Bengio Y., Léonard N., Courville A., 2013, Estimating or propagating gradients through stochastic neurons for conditional computation, arXiv:1308.3432, https://arxiv.org/abs/1308.3432
- Davis A., Arel I., 2013, Low-rank approximations for conditional feedforward computation in deep neural networks, arXiv:1312.4461, https://arxiv.org/abs/1312.4461
- Ignacio de Gregorio, April 2024, Mixture-of-Depths, a Dazzling New AI Breakthrough: Conditional Computing is Finally Here, Medium, https://medium.com/@ignacio.de.gregorio.noblejas/mixture-of-depths-a-dazzling-new-ai-breakthrough-be958fc629b2 (Mixture of depths is a layer-wise per-token limit to attention head computations, which is like width pruning with dynamic depth.)
- David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, Adam Santoro, 2 Apr 2024, Mixture-of-Depths: Dynamically allocating compute in transformer-based language models, https://arxiv.org/abs/2404.02258 (Per-layer pruning of which tokens can be in the attention computations to give a type of mixed lengthwise pruning combined with a dynamic width pruning or slimmable network approach.)
- Bartosz Wójcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini, Simone Scardapane, 15 Dec 2023, Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference, https://arxiv.org/abs/2312.10193 (Modifies its computation depending on the difficulty of each input token.)
- Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia, 23 Dec 2023, Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems, https://arxiv.org/abs/2312.15234
- Rafael Fão de Moura, Paulo C Santos, João Paulo C de Lima, Marco AZ Alves, Antonio CS Beck, and Luigi Carro. 2019. Skipping CNN convolutions through efficient memoization. In International Conference on Embedded Computer Systems. Springer, 65–76. https://link.springer.com/chapter/10.1007/978-3-030-27562-4_5
- Weijie Chen, Yuan Zhang, Di Xie, and Shiliang Pu. 2019. A layer decomposition-recomposition framework for neuron pruning towards accurate lightweight networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3355–3362. https://arxiv.org/abs/1812.06611 (Layerwise dynamic structural pruning of unimportant neurons.)
- Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, and Tomoaki Nishimura. 2020. Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error. IJCAI. https://arxiv.org/abs/1808.08558 (A type of structured pruning based on information loss metrics.)
- J Ainslie, T Lei, M de Jong, S Ontañón, 2023, Colt5: Faster long-range transformers with conditional computation, https://arxiv.org/abs/2303.09752
- Denoyer, Ludovic and Gallinari, Patrick, 2014, Deep sequential neural network. CoRR, abs/1410.0510, http://arxiv.org/abs/1410.0510
- Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, Zhifeng Chen, 2020, GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding, https://arxiv.org/abs/2006.16668
- M Lin, J Fu, Y Bengio, 2019, Conditional computation for continual learning, arXiv preprint arXiv:1906.06635, https://arxiv.org/abs/1906.06635
- Y Lou, F Xue, Z Zheng, Y You, 2022, Cross-token modeling with conditional computation, arXiv preprint arXiv:2109.02008, https://arxiv.org/abs/2109.02008
- Simone Scardapane, Alessandro Baiocchi, Alessio Devoto, Valerio Marsocci, Pasquale Minervini, Jary Pomponi, 12 Mar 2024, Conditional computation in neural networks: principles and research trends, https://arxiv.org/abs/2403.07965 (Investigated three types of dynamic inference: MoE, early exit, and token selection.)
- Hengyuan Hu. 2016. Papers with Code. Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures. https://paperswithcode.com/paper/network-trimming-a-data-driven-neuron-pruning (2021).
- Xitong Gao. 2019. Papers with Code. Dynamic Channel Pruning: Feature Boosting and Suppression. 2021, https://paperswithcode.com/paper/dynamic-channel-pruning-feature-boosting-and
- V Vanhoucke, A Senior, MZ Mao, 2011, Improving the speed of neural networks on CPUs, Google Research, https://research.google/pubs/pub37631.pdf
- 20 Mar 2023, Memorization Capacity of Neural Networks with Conditional Computation, Erdem Koyuncu, https://arxiv.org/abs/2303.11247
- Folino, F., Folino, G., Pisani, F.S. et al., 2024, Efficiently approaching vertical federated learning by combining data reduction and conditional computation techniques. J Big Data 11, 77 (2024). https://doi.org/10.1186/s40537-024-00933-6 https://link.springer.com/article/10.1186/s40537-024-00933-6 https://link.springer.com/content/pdf/10.1186/s40537-024-00933-6.pdf
More AI Research
Read more about:
- Caching and Data Reuse
- Zero Skipping
- Code Optimizations
- Inference Optimizations
- Loop Optimizations
- « Research Home