Aussie AI
Probabilistic Optimizations
-
Last Updated 14 October, 2024
-
by David Spuler, Ph.D.
I'm going to say it straight up: all neural networks are probabilistic algorithms. The entire concept of a neural network is a probabilistic approximation of reality based on what it's seen during training. In that sense, an AI engine is one big probabilistic algorithm, and optimizing it with another low-level probabilistic algorithm is really creating a doubly-probabilistic algorithm. After all, conceptually speaking, the result of an AI engine's output, called "logits", are probabilities of the likelihood of outputting a particular word. The core functionality of an AI model is to train it so it knows what words are "likely", and then it becomes good enough to "guestimate" what words should be output based on the probabilities that it calculates.
Similar to probabilistic algorithms are "stochastic algorithms", where an amount of randomness is injected to give results. For example, in the core theory of Computer Science, the "Monte Carlo method" is a well-known algorithm for measuring the area under a curve using randomly generated points. Some of these concepts can also be applied to AI models. These stochastic algorithms are technically a subset of probabilistic algorithms, since there are probabilistic methods that do not require random numbers, but still rely on probabilities for correct execution (e.g. neural networks). An example of a stochastic algorithm for neural networks is "stochastic quantization", where intentional randomness attempts to improve a quantized model.
The main use of probabilistic algorithms in AI inference is the randomness injected into decoding algorithms with the top-k, top-p, and temperature hyper-parameters. The randomness in these methods ensure that the AI engine doesn't output the exact same response when it receives the same input, so it can show creativity and originality in its responses to prompts. Note that the goal of randomness in decoding algorithms is variety of responses, and does not improve the speed of inference. However, the final stage of decoding is rarely a bottleneck, so the randomness doesn't degrade latency either.
Randomness of stochastic algorithms is not always beneficial. It can be useful to turn off the randomness in these decoding algorithms when regression testing an AI engine, which is a situation where having the exact same responses can be useful.
Probabilistic Optimizations
Probabilistic optimizations are methods of speeding up inference based on the relative probabilities of events. There are both high-level and lower-level ways that neural networks can use probabilistic algorithms. There are two main types of probabilistic algorithms:
- Common case first. These are optimizations where a likely case is handled first, using fast code, but the unlikely case is also handled as a backup, with slower code. In such cases, the probabilistic algorithm does not affect the result of the neural network, only its speed.
- Lossy or error-accepting algorithms. These are probabilistic algorithms where the unlikely case is simply ignored, and an error in such cases is accepted as part of the logic. Such algorithms may be appropriate for neural networks because they have an inherent resilience to errors.
Arguably, many of the inference optimization techniques and almost all model compression optimizations are lossy probabilistic optimizations, where a known level of error is tolerated. They take the probabilistic AI engine, add some known extra error sources to it (but making it faster), and the result is a faster, but slightly less accurate model. Examples of such "probabilistic optimizations" that offer trade-offs at a high-level are techniques such as:
- Quantization (accepts errors resulting from the loss of precision).
- Early exits (assumes that later layers won't change results, tolerating cases where this is erroneous).
- Magnitude pruning (assumes that low-value weights have little impact, accepting the resulting errors).
- Big-little models (common case first method; see ensemble algorithms)
- Approximate algorithms (various types)
Some examples of the lower-level specific uses of probabilistic algorithms in neural networks include:
- Bloom filters
- Stochastic quantization
- Approximate arithmetic multiplication algorithms
- Approximate matrix multiplication algorithms
- Loop perforation
Probabilistic Algorithm Research Papers
General research papers on specific probabilistic algorithms (including stochastic algorithms) in neural networks:
- Ao Ren, Ji Li, Zhe Li, Caiwen Ding, Xuehai Qian, Qinru Qiu, Bo Yuan, Yanzhi Wang, 2017, "SC-DCNN: Highly-scalable deep convolutional neural network using stochastic computing", ACM SIGPLAN Notices, vol. 52, no. 4, pp. 405-418, 2017. https://arxiv.org/abs/1611.05939 (Stochastic method with multiplication and addition approximations via AND gates and multiplexers.)
- R. Hojabr et al., "SkippyNN: An embedded stochastic-computing accelerator for convolutional neural networks", Proc. 56th ACM/IEEE Design Automat. Conf. (DAC), pp. 1-6, Jun. 2019. https://ieeexplore.ieee.org/document/8806970
- Nikola Popovic, Danda Pani Paudel, Thomas Probst, Luc Van Gool, July 2022, Improving the Behaviour of Vision Transformers with Token-consistent Stochastic Layers, https://arxiv.org/abs/2112.15111
- N Popovic, DP Paudel, T Probst, L Van Gool, 2023, Token-Consistent Dropout For Calibrated Vision Transformers, 2023 IEEE International Conference on Image Processing (ICIP), https://ieeexplore.ieee.org/abstract/document/10222084
- Carlos Florensa, Yan Duan, Pieter Abbeel, April 2017, Stochastic Neural Networks for Hierarchical Reinforcement Learning, arXiv preprint arXiv:1704.03012, 2017, https://arxiv.org/abs/1704.03012
- E Wong, 1991, Stochastic neural networks, Algorithmica, Springer https://link.springer.com/article/10.1007/BF01759054, https://arxiv.org/pdf/1704.03012.pdf
- Roohi Menon, Hangyu Zhou, Gerald Ogbonna, Vikram Raghavan, 2021, Stochastic programming, SYSEN 6800 Fall 2021, Cornell University Computational Optimization Open Textbook, https://optimization.cbe.cornell.edu/index.php?title=Stochastic_programming
- John R. Birge , François Louveaux, 2011, Introduction to Stochastic Programming, Springer Series in Operations Research and Financial Engineering (ORFE), https://link.springer.com/book/10.1007/978-1-4614-0237-4
- Sairam Sri Vatsavai, Venkata Sai Praneeth Karempudi, Ishan Thakkar, Ahmad Salehi, Todd Hastings, Feb 2023, SCONNA: A Stochastic Computing Based Optical Accelerator for Ultra-Fast, Energy-Efficient Inference of Integer-Quantized CNNs, https://arxiv.org/abs/2302.07036, Code: https://github.com/uky-UCAT/SC_ONN_SIM.git
- Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, and Stephen Lin. 2020. Spatially adaptive inference with stochastic feature sampling and interpolation. arXiv preprint arXiv:2003.08866, https://arxiv.org/abs/2003.08866
- Bengio, Yoshua, Leonard, Nicholas, and Courville, Aaron, 2013, Estimating or propagating gradients ´ through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, https://arxiv.org/abs/1308.3432
- Hanting Chen, Zhicheng Liu, Xutao Wang, Yuchuan Tian, Yunhe Wang, 1 Apr 2024 (v2), DiJiang: Efficient Large Language Models through Compact Kernelization, https://arxiv.org/abs/2403.19928 (Using the Monte Carlo method to achieve a linear attention approximation.)
- Salar Shakibhamedan, Amin Aminifar, Nima TaheriNejad, Axel Jantsch, 2024, EASE: Energy Optimization through Adaptation — A Review of Runtime Energy-Aware Approximate Deep Learning Algorithms, https://eclectx.org/Publications/2024_M13.pdf (Survey paper on techniques for adaptive inference with a focus on approximations of inference, including loop performance, stochastic algorithms, approximate arithmetic, quantization, pruning and low-rank.)
- Jing Yang Lee, Kong Aik Lee, Woon-Seng Gan, Nov 2023, Partially Randomizing Transformer Weights for Dialogue Response Diversity, https://arxiv.org/abs/2311.10943
- Zhengmian Hu, Heng Huang, 2024, Accelerated Speculative Sampling Based on Tree Monte Carlo, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:19216-19251, 2024. https://proceedings.mlr.press/v235/hu24f.html https://openreview.net/forum?id=stMhi1Sn2G PDF: https://raw.githubusercontent.com/mlresearch/v235/main/assets/hu24f/hu24f.pdf
- Lingyun Yao, Martin Trapp, Jelin Leslin, Gaurav Singh, Peng Zhang, Karthekeyan Periasamy, Martin Andraud, 22 May 2024, On Hardware-efficient Inference in Probabilistic Circuits, https://arxiv.org/abs/2405.13639
- Wenlun Zhang, Shimpei Ando, Yung-Chin Chen, Satomi Miyagi, Shinya Takamaeda-Yamazaki, Kentaro Yoshioka, 29 Aug 2024, PACiM: A Sparsity-Centric Hybrid Compute-in-Memory Architecture via Probabilistic Approximation, https://arxiv.org/abs/2408.16246
- Haiyue Ma, Jian Liu, Ronny Krashinsky, 10 Oct 2024, Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM, https://arxiv.org/abs/2410.07531
More AI Research
Read more about: