Aussie AI

Softmax Optimization

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.
The Softmax function is a significant cost in Transformer inference because it is part of the attention mechanism, whereas it was less of a bottleneck in earlier neural network architectures. A vanilla Softmax implementation is very expensive because it involves computing the exponentials of all of the logits. Various attempts have been made to optimize and approximate Softmax calculations, including:
  • Softmax optimizations
  • Softmax approximations
  • Integer-only Softmax
  • Pruned Softmax (removal)
  • Fused Softmax (kernel fusion)
  • Softmax replacements (use different functions)

Note that there are several other areas of theory that are relevant to Softmax optimizations and approximation. The denominator of the Softmax formula is a "sum of exponentials", and this type of calculation also appears in Logarithmic Number System (LNS) addition. Also, the sum of exponentials calculation, appears in "log-sum-exp networks", which are somewhat related to "tropical algebra". The approximations of "max-plus networks" may also be relevant to Softmax approximations.

What is Softmax?

The purpose of Softmax is to take a set of values in a vector of calculated values, and normalize them into probabilities. After Softmax, the output vector contains a new normalized set of values which all add up to 1, and they are intended to represent probabilities of the likelihood of each token/word associated with each vector element.

So why do we need all the exponentials? The idea is that the input vector contains "logits", which are logarithms of probabilities, so we are exponentiating each one to bring it out of log-domain into real-domain. Then we are normalizing them all so that they are probabilities that total exactly 1.

Softmax and Temperature: At the end of each decoder sequence, the Softmax function is used to normalize the logits before processing by a decoding algorithm to choose an output token with the highest probability. As part of this method, the Softmax function is usually changed to a "scaled Softmax" that uses a parameter called the "temperature".

What is the temperature? The purpose of the temperature parameter is as a hyper-parameter that influences the level of randomness or unpredictability in the output. A higher setting for temperature means that the decoder is more likely to output the lower-probability tokens. If the temperature is very large, the decoder is mostly going to output the highest probability token, meaning it is much less random.

What is the value of the temperature? The temperature is a non-zero positive floating point number that can be between 0 and 1, or can also be greater than 1. A temperature of zero cannot be used as it would cause divide-by-zero errors. If the temperature equals 1.0, it doesn't change the Softmax function at all (i.e. continues harmlessly without scaling). Since the Softmax function is scaled by the reciprocal of the temperature, the effect is to make randomness higher with a larger temperature setting (so it runs "hotter" and gets more "bubbly"). If the temperature is below 1.0, making it a fraction, the effect is to spread out the logits more, which has the effect of reducing randomness of the output. If the temperature is greater than 1.0, this contracts the logits towards each other, making the decoder more likely to choose each of them (although still with some randomness), thereby increasing output randomness. Read more about decoding algorithms and temperature.

Softmax Optimization

Example: Basic Softmax in C++: The Softmax function is an inefficient computation by default. First, the whole vector is scanned to compute the sum of the exponential of each value. Then the whole vector is re-scanned to divide each vector element by this sum-of-exponentials.

Here is a naive implementation of Softmax in C++:

    #include <math.h>  // Declare expf()
    ....
    float yapi_vector_sum_of_exponentials(float v[], int n)
    {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		float e = expf(v[i]);
		yassert(e > 0.0);
		sum += e;
	}
	return sum;
    }

    void yapi_vector_softmax_basic(float v[], int n)
    {
	float denom = yapi_vector_sum_of_exponentials(v, n);  // Denominator...
	if (denom == 0.0) {
		yassert(denom != 0.0);
		return;  // fail (should not occur)
	}
	for (int i = 0; i < n; i++) {
		v[i] = expf(v[i]) / denom;
	}
    }

Not only is this computation very slow (all those calls to exp!), but it's also prone to overflow and underflow. The real computation of Softmax needs to be further optimized and scaled to avoid these problems. Further optimizations may include the use of API calls to hardware acceleration APIs, pre-computed tables as approximations for the "exp" function, and converting the loop to pointer arithmetic.

Softmax Optimization

Research papers on Softmax optimizations in general:

  • Y Fu, C Zhou, T Huang, E Han, Y He, H Jiao, 2024, SoftAct: A High-Precision Softmax Architecture for Transformers Supporting Nonlinear Functions, https://ieeexplore.ieee.org/abstract/document/10495359/ (Hardware-optimized Softmax and non-linear activation functions.)
  • Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Robin Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu, 4 Apr 2024, Outlier-Efficient Hopfield Layers for Large Transformer-Based Models, https://arxiv.org/abs/2404.03828 Code: https://github.com/MAGICS-LAB/OutEffHop (Addresses outliers in quantization with a modified Softmax and an advanced Hopfield memory model.)
  • Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
  • M. Milakov and N. Gimelshein, “Online normalizer calculation for softmax,” 2018. https://arxiv.org/abs/1805.02867
  • Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai, 18 Feb 2024, In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness, https://arxiv.org/abs/2402.11639
  • Shiwei Liu, Guanchen Tao, Yifei Zou, Derek Chow, Zichen Fan, Kauna Lei, Bangfei Pan, Dennis Sylvester, Gregory Kielian, Mehdi Saligane, 20 Feb 2024 (v2), ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters, https://arxiv.org/abs/2402.10930 (Efficient version of approximate Softmax with better parallelization.)
  • David Spuler, March 2024, Chapter 25. SoftMax, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
  • Jingyu Wang; Lu Zhang; Xueqing Li; Huazhong Yang; Yongpan Liu, Nov 2023, ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10304367
  • Tianhua Xia, Sai Qian Zhang, 22 Nov 2023, Softmax Acceleration with Adaptive Numeric Format for both Training and Inference, https://arxiv.org/abs/2311.13290 (Hardware-based Softmax accelerator.)
  • Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
  • Nelson Elhage∗†, Tristan Hume∗, Catherine Olsson∗, Neel Nanda∗§, Tom Henighan†, Scott Johnston†, Sheer El Showk†, Nicholas Joseph†, Nova DasSarma†, Ben Mann†, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, Christopher Olah‡, June 2022, Softmax Linear Units, https://transformer-circuits.pub/2022/solu/index.html (Early SoftMax paper.)
  • Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
  • Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
  • J. Kim, S. Kim, K. Choi and I. -C. Park, 2024, Hardware-Efficient SoftMax Architecture With Bit-Wise Exponentiation and Reciprocal Calculation, IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2024.3443270, https://ieeexplore.ieee.org/abstract/document/10640134
  • Yichuan Deng, Zhihang Li, Zhao Song, 26 Apr 2023 (v2), Attention Scheme Inspired Softmax Regression, https://arxiv.org/abs/2304.10411
  • Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, 6 May 2024, Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond, https://arxiv.org/abs/2405.03251
  • Tianhua Xia and Sai Qian Zhang. 2024. Hyft: A Reconfigurable Softmax Accelerator with Hybrid Numeric Format for both Training and Inference. In Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED '24). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3665314.3670816 https://dl.acm.org/doi/abs/10.1145/3665314.3670816 PDF: https://dl.acm.org/doi/pdf/10.1145/3665314.3670816
  • Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron Banner, Kfir Yehuda Levy, 4 Oct 2024, EXAQ: Exponent Aware Quantization For LLMs Acceleration, https://arxiv.org/abs/2410.03185
  • Bettayeb, M., Halawani, Y., Khan, M.U. et al. Efficient memristor accelerator for transformer self-attention functionality. Sci Rep 14, 24173 (2024). https://doi.org/10.1038/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z.pdf
  • Shuai Dong, Junyi Yang, Xiaoqi Peng, Hongyang Shang, Ye Ke, Xiaofeng Yang, Hongjie Liu, Arindam Basu, 20 Nov 2024, Topkima-Former: Low-energy, Low-Latency Inference for Transformers using top-k In-memory ADC, https://arxiv.org/abs/2411.13050
  • Francesco Franco, Dec 2024, Activation Functions: ReLU, Sigmoid, Tanh and Softmax, https://medium.com/@francescofranco_39234/four-key-activation-functions-relu-sigmoid-tanh-and-softmax-6d2525eb55a4
  • Andrea Belano, Yvan Tortorella, Angelo Garofalo, Luca Benini, Davide Rossi, Francesco Conti, 9 Dec 2024, A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU, https://arxiv.org/abs/2412.06321

Softmax Approximation

Softmax can be approximated in various ways, and there are also integer-only Softmax methods. Research papers on approximating the Softmax algorithm include:

Integer-Only Softmax

Softmax can be optimized by conversion to integer arithmetic. See also the overview of integer arithmetic in neural networks.

Research papers with an integer-only Softmax function:

Softmax Pruning

In addition to approximating Softmax, consideration has been given to "pruning" the Softmax component, i.e. removing it completely.

Softmax Alternatives

There is also research on replacing Softmax with a different component:

More AI Research

Read more about: