Aussie AI

Softmax Optimization

Last Updated 12 December, 2024

by David Spuler, Ph.D.

The Softmax function is a significant cost in Transformer inference because it is part of the attention mechanism, whereas it was less of a bottleneck in earlier neural network architectures. A vanilla Softmax implementation is very expensive because it involves computing the exponentials of all of the logits. Various attempts have been made to optimize and approximate Softmax calculations, including:

Softmax optimizations
Softmax approximations
Integer-only Softmax
Pruned Softmax (removal)
Fused Softmax (kernel fusion)
Softmax replacements (use different functions)

Note that there are several other areas of theory that are relevant to Softmax optimizations and approximation. The denominator of the Softmax formula is a "sum of exponentials", and this type of calculation also appears in Logarithmic Number System (LNS) addition. Also, the sum of exponentials calculation, appears in "log-sum-exp networks", which are somewhat related to "tropical algebra". The approximations of "max-plus networks" may also be relevant to Softmax approximations.

What is Softmax?

The purpose of Softmax is to take a set of values in a vector of calculated values, and normalize them into probabilities. After Softmax, the output vector contains a new normalized set of values which all add up to 1, and they are intended to represent probabilities of the likelihood of each token/word associated with each vector element.

So why do we need all the exponentials? The idea is that the input vector contains "logits", which are logarithms of probabilities, so we are exponentiating each one to bring it out of log-domain into real-domain. Then we are normalizing them all so that they are probabilities that total exactly 1.

Softmax and Temperature: At the end of each decoder sequence, the Softmax function is used to normalize the logits before processing by a decoding algorithm to choose an output token with the highest probability. As part of this method, the Softmax function is usually changed to a "scaled Softmax" that uses a parameter called the "temperature".

What is the temperature? The purpose of the temperature parameter is as a hyper-parameter that influences the level of randomness or unpredictability in the output. A higher setting for temperature means that the decoder is more likely to output the lower-probability tokens. If the temperature is very large, the decoder is mostly going to output the highest probability token, meaning it is much less random.

What is the value of the temperature? The temperature is a non-zero positive floating point number that can be between 0 and 1, or can also be greater than 1. A temperature of zero cannot be used as it would cause divide-by-zero errors. If the temperature equals 1.0, it doesn't change the Softmax function at all (i.e. continues harmlessly without scaling). Since the Softmax function is scaled by the reciprocal of the temperature, the effect is to make randomness higher with a larger temperature setting (so it runs "hotter" and gets more "bubbly"). If the temperature is below 1.0, making it a fraction, the effect is to spread out the logits more, which has the effect of reducing randomness of the output. If the temperature is greater than 1.0, this contracts the logits towards each other, making the decoder more likely to choose each of them (although still with some randomness), thereby increasing output randomness. Read more about decoding algorithms and temperature.

Softmax Optimization

Example: Basic Softmax in C++: The Softmax function is an inefficient computation by default. First, the whole vector is scanned to compute the sum of the exponential of each value. Then the whole vector is re-scanned to divide each vector element by this sum-of-exponentials.

Here is a naive implementation of Softmax in C++:

    #include <math.h>  // Declare expf()
    ....
    float yapi_vector_sum_of_exponentials(float v[], int n)
    {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		float e = expf(v[i]);
		yassert(e > 0.0);
		sum += e;
	}
	return sum;
    }

    void yapi_vector_softmax_basic(float v[], int n)
    {
	float denom = yapi_vector_sum_of_exponentials(v, n);  // Denominator...
	if (denom == 0.0) {
		yassert(denom != 0.0);
		return;  // fail (should not occur)
	}
	for (int i = 0; i < n; i++) {
		v[i] = expf(v[i]) / denom;
	}
    }

Not only is this computation very slow (all those calls to exp!), but it's also prone to overflow and underflow. The real computation of Softmax needs to be further optimized and scaled to avoid these problems. Further optimizations may include the use of API calls to hardware acceleration APIs, pre-computed tables as approximations for the "exp" function, and converting the loop to pointer arithmetic.

Softmax Optimization

Research papers on Softmax optimizations in general:

Y Fu, C Zhou, T Huang, E Han, Y He, H Jiao, 2024, SoftAct: A High-Precision Softmax Architecture for Transformers Supporting Nonlinear Functions, https://ieeexplore.ieee.org/abstract/document/10495359/ (Hardware-optimized Softmax and non-linear activation functions.)
Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Robin Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu, 4 Apr 2024, Outlier-Efficient Hopfield Layers for Large Transformer-Based Models, https://arxiv.org/abs/2404.03828 Code: https://github.com/MAGICS-LAB/OutEffHop (Addresses outliers in quantization with a modified Softmax and an advanced Hopfield memory model.)
Shulin Zeng, Jun Liu, Guohao Dai, Xinhao Yang, Tianyu Fu, Hongyi Wang, Wenheng Ma, Hanbo Sun, Shiyao Li, Zixiao Huang, Yadong Dai, Jintao Li, Zehao Wang, Ruoyu Zhang, Kairui Wen, Xuefei Ning, Yu Wang, 9 Jan 2024, FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs, https://arxiv.org/abs/2401.03868 (Does FFN optimization by splitting FFNs into two categories, those commonly firing and those rarely used, in both RELU and non-RELU models; effectively this is FFN pruning of a subset of FFNs.)
M. Milakov and N. Gimelshein, “Online normalizer calculation for softmax,” 2018. https://arxiv.org/abs/1805.02867
Liam Collins, Advait Parulekar, Aryan Mokhtari, Sujay Sanghavi, Sanjay Shakkottai, 18 Feb 2024, In-Context Learning with Transformers: Softmax Attention Adapts to Function Lipschitzness, https://arxiv.org/abs/2402.11639
Shiwei Liu, Guanchen Tao, Yifei Zou, Derek Chow, Zichen Fan, Kauna Lei, Bangfei Pan, Dennis Sylvester, Gregory Kielian, Mehdi Saligane, 20 Feb 2024 (v2), ConSmax: Hardware-Friendly Alternative Softmax with Learnable Parameters, https://arxiv.org/abs/2402.10930 (Efficient version of approximate Softmax with better parallelization.)
David Spuler, March 2024, Chapter 25. SoftMax, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Jingyu Wang; Lu Zhang; Xueqing Li; Huazhong Yang; Yongpan Liu, Nov 2023, ULSeq-TA: Ultra-Long Sequence Attention Fusion Transformer Accelerator Supporting Grouped Sparse Softmax and Dual-Path Sparse LayerNorm, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (Early Access), https://ieeexplore.ieee.org/abstract/document/10304367
Tianhua Xia, Sai Qian Zhang, 22 Nov 2023, Softmax Acceleration with Adaptive Numeric Format for both Training and Inference, https://arxiv.org/abs/2311.13290 (Hardware-based Softmax accelerator.)
Ke Hong, Guohao Dai, Jiaming Xu, Qiuli Mao, Xiuhong Li, Jun Liu, kangdi chen, Yuhan Dong, Yu Wang, 2024, FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics, Part of Proceedings of Machine Learning and Systems 6 (MLSys 2024) Conference, PDF: https://proceedings.mlsys.org/paper_files/paper/2024/file/5321b1dabcd2be188d796c21b733e8c7-Paper-Conference.pdf (Next generation of Flash Decoding, with improved ascynchronous parallelism of Softmax in both prefill and decoding phases, heuristic dataflow management algorithms, and enhanced GEMM during the decoding phase.)
Nelson Elhage∗†, Tristan Hume∗, Catherine Olsson∗, Neel Nanda∗§, Tom Henighan†, Scott Johnston†, Sheer El Showk†, Nicholas Joseph†, Nova DasSarma†, Ben Mann†, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, Christopher Olah‡, June 2022, Softmax Linear Units, https://transformer-circuits.pub/2022/solu/index.html (Early SoftMax paper.)
Chen, C, 2024, Hardware‑software co‑exploration and optimization for next‑generation learning machines. Doctoral thesis, Nanyang Technological University, Singapore. https://hdl.handle.net/10356/178423 (Extensive coverage of hardware design with multiple contributions to accelerating various neural network types, ranging from acceleration of various single non-linear functions and end-to-end optimization algorithms. Specific topics include data compression, non-maximum suppression, MHA, and MatMul/GEMM optimizations.)
Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao, July 11, 2024, FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision, https://arxiv.org/abs/2407.08608 https://tridao.me/blog/2024/flash3/
J. Kim, S. Kim, K. Choi and I. -C. Park, 2024, Hardware-Efficient SoftMax Architecture With Bit-Wise Exponentiation and Reciprocal Calculation, IEEE Transactions on Circuits and Systems I: Regular Papers, doi: 10.1109/TCSI.2024.3443270, https://ieeexplore.ieee.org/abstract/document/10640134
Yichuan Deng, Zhihang Li, Zhao Song, 26 Apr 2023 (v2), Attention Scheme Inspired Softmax Regression, https://arxiv.org/abs/2304.10411
Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, 6 May 2024, Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond, https://arxiv.org/abs/2405.03251
Tianhua Xia and Sai Qian Zhang. 2024. Hyft: A Reconfigurable Softmax Accelerator with Hybrid Numeric Format for both Training and Inference. In Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED '24). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3665314.3670816 https://dl.acm.org/doi/abs/10.1145/3665314.3670816 PDF: https://dl.acm.org/doi/pdf/10.1145/3665314.3670816
Moran Shkolnik, Maxim Fishman, Brian Chmiel, Hilla Ben-Yaacov, Ron Banner, Kfir Yehuda Levy, 4 Oct 2024, EXAQ: Exponent Aware Quantization For LLMs Acceleration, https://arxiv.org/abs/2410.03185
Bettayeb, M., Halawani, Y., Khan, M.U. et al. Efficient memristor accelerator for transformer self-attention functionality. Sci Rep 14, 24173 (2024). https://doi.org/10.1038/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z https://www.nature.com/articles/s41598-024-75021-z.pdf
Shuai Dong, Junyi Yang, Xiaoqi Peng, Hongyang Shang, Ye Ke, Xiaofeng Yang, Hongjie Liu, Arindam Basu, 20 Nov 2024, Topkima-Former: Low-energy, Low-Latency Inference for Transformers using top-k In-memory ADC, https://arxiv.org/abs/2411.13050
Francesco Franco, Dec 2024, Activation Functions: ReLU, Sigmoid, Tanh and Softmax, https://medium.com/@francescofranco_39234/four-key-activation-functions-relu-sigmoid-tanh-and-softmax-6d2525eb55a4
Andrea Belano, Yvan Tortorella, Angelo Garofalo, Luca Benini, Davide Rossi, Francesco Conti, 9 Dec 2024, A Flexible Template for Edge Generative AI with High-Accuracy Accelerated Softmax & GELU, https://arxiv.org/abs/2412.06321

Softmax Approximation

Softmax can be approximated in various ways, and there are also integer-only Softmax methods. Research papers on approximating the Softmax algorithm include:

Jacob R. Stevens, Rangharajan Venkatesan, Steve Dai, Brucek Khailany, and Anand Raghunathan. 2021. Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers. In DAC. https://arxiv.org/abs/2103.09301 (Attempts to optimize softmax with focus on the max operation and the use of base-2 exponentials.)
Ihor Vasyltsov, Wooseok Chang, Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism, arXiv preprint arXiv:2111.10770, 2021, https://arxiv.org/abs/2111.10770
Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou, Efficient softmax approximation for GPUs, Proceedings of the 34th International Conference on Machine Learning, PMLR 70:1302-1310, 2017. http://proceedings.mlr.press/v70/grave17a.html
Yue Gao; Weiqiang Liu; Fabrizio Lombardi, 2020, Design and implementation of an approximate softmax layer for deep neural networks, 2020 IEEE International Symposium on Circuits and Systems (ISCAS) https://ieeexplore.ieee.org/document/9180870 (Uses lookup-tables for the exponentials, including Taylor series, and simplifies division with shifting.)
Ke Chen; Yue Gao; Haroon Waris; Weiqiang Liu; Fabrizio Lombardi, 2023, Approximate Softmax Functions for Energy-Efficient Deep Neural Networks, IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( Volume 31, Issue 1, January 2023), https://ieeexplore.ieee.org/document/9968119 (Software approximation using most-significant bits and least-significant bits separately. Look-up tables on the MSBs. Taylor's expansion and division on LSBs.)
Gian Carlo Cardarilli, Luca Di Nunzio, Rocco Fazzolari, Daniele Giardino, Alberto Nannarelli, Marco Re & Sergio Spanò, A pseudo-softmax function for hardware-based high speed image classification, Scientific Reports volume 11, Article number: 15307 (2021) https://www.nature.com/articles/s41598-021-94691-7 (Uses logarithms base 2 rather than natural logarithms, and processes the exponent and mantissa separately.)
Danyang Zhu; Siyuan Lu; Meiqi Wang; Jun Lin; Zhongfeng Wang, 2020, Efficient precision-adjustable architecture for softmax function in deep learning, IEEE Transactions on Circuits and Systems II: Express Briefs (Volume 67, Issue 12, December 2020) https://ieeexplore.ieee.org/abstract/document/9119139/ (Uses the "log-sum-exp trick" to transform the Softmax formula, so as to replace division with subtraction, and also changes to base 2 to permit bitshift operations, and separately processes the integer and fractional parts.)
Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka, Shuichi Adachi, 2018, Sigsoftmax: Reanalysis of the softmax bottleneck, 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, https://arxiv.org/abs/1805.10829, PDF: https://proceedings.neurips.cc/paper_files/paper/2018/file/9dcb88e0137649590b755372b040afad-Paper.pdf (Adds the sigmoid function to the Softmax formula to make an approximation.)
Zhiyun Lu, Eugene Ie, Fei Sha, May 2021, Mean-field approximation to gaussian-softmax integral with application to uncertainty estimation arXiv preprint arXiv:2006.07584, https://arxiv.org/abs/2006.07584 (Integrals of Gaussian-Softmax, coverages of the sigmoid function, and other complex mathematics.)
Kunal Banerjee, Vishak Prasad C, Rishi Raj Gupta, Karthik Vyas, Anushree H, Biswajit Mishra, Nov 2020, Exploring alternatives to softmax function https://arxiv.org/abs/2011.11538 (Various alternatives to approximate Softmax are examined including Taylor Softmax and their own new suggestion, with a focus on modeling accuracy rather than latency speed.)
Nazim Altar Koca; Anh Tuan Do; Chip-Hong Chang, 2023, Hardware-efficient Softmax Approximation for Self-Attention Networks, 2023 IEEE International Symposium on Circuits and Systems (ISCAS), https://ieeexplore.ieee.org/abstract/document/10181465/
Yikang Shen, Shawn Tan, Chrisopher Pal, Aaron Courville, 2017, Self-organized hierarchical softmax arXiv preprint arXiv:1707.08588, https://arxiv.org/abs/1707.08588 (Analysis of the self-organizing Softmax method.)
Qizhao Chen, Morgane Austern, Vasilis Syrgkanis, Mar 2023, Inference on Optimal Dynamic Policies via Softmax Approximation, arXiv preprint arXiv:2303.04416, https://arxiv.org/abs/2303.04416 (Lots of deep mathematics about Softmax.)
M Dukhan, A Ablavatski, 2020, Two-pass softmax algorithm, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) https://ieeexplore.ieee.org/document/9150394, PDF: https://arxiv.org/pdf/2001.04438, Code: http://www.github.com/google/XNNPACK (A method to compute Softmax using two passes by caching some calculations. Analysis of Softmax efficiency in the inference phase. Various approximations of exponentiation considered.)
Alessandro Epasto, Mohammad Mahdian, Vahab Mirrokni, Manolis Zampetakis, 2020, Optimal approximation-smoothness tradeoffs for soft-max functions 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada. https://arxiv.org/abs/2010.11450, https://proceedings.neurips.cc/paper/2020/file/1bd413de70f32142f4a33a94134c5690-Paper.pdf (Examines the mathematical basis of various alternatives to Softmax.)
Alberto Marchisio, Beatrice Bussolino, Edoardo Salvati, Maurizio Martina, Guido Masera, Muhammad Shafique, 2022, Enabling capsule networks at the edge through approximate softmax and squash operations, ISLPED '22: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, August 2022, Article No. 27, Pages 1–6, https://doi.org/10.1145/3531437.3539717, https://arxiv.org/abs/2206.10200v1 (Inference efficiency of approximate Softmax functions, including use of base 2 and Taylor approximations, with various approximation algorithms considered including softmax-lnu and softmax-b2.)
Fanny Spagnolo; Stefania Perri; Pasquale Corsonello, 2021, Aggressive approximation of the softmax function for power-efficient hardware implementations, IEEE Transactions on Circuits and Systems II: Express Briefs (Volume 69, Issue 3, March 2022), https://ieeexplore.ieee.org/abstract/document/9576519/
G Bouchard, 2007, Efficient bounds for the softmax function and applications to approximate inference in hybrid models NIPS 2007 workshop for approximate Bayesian, https://www.researchgate.net/publication/251955941_Efficient_Bounds_for_the_Softmax_Function_and_Applications_to_Approximate_Inference_in_Hybrid_models, PDF: https://www.academia.edu/download/6050190/nips_wrkshp_subm.pdf (Lots of mathematical theories of bounds on the log-domain arithmetic related to calculating Softmax.)
Kouretas, I. & Paliouras, V., Hardware implementation of a softmax-like function for deep learning. Technologies 8, 46 (2020). https://www.mdpi.com/2227-7080/8/3/46 PDF: https://www.researchgate.net/publication/343978712_Hardware_Implementation_of_a_Softmax-Like_Function_for_Deep_Learning (Uses the maximum value to approximate the sum in the denominator of Softmax, and has a reduced look-up table size for exponentiation.)
Michalis K. Titsias, 2016, One-vs-each approximation to softmax for scalable estimation of probabilities, TRC AUEB - Advances in Neural Information Processing, https://arxiv.org/abs/1609.07410, PDF: https://proceedings.neurips.cc/paper_files/paper/2016/file/814a9c18f5abff398787c9cfcbf3d80c-Paper.pdf (Analysis of the theory of mathematical bounds on Softmax, including Bouchard's bounds.)
Di Franco, F., Rametta, C., Russo, M. & Vaccaro, M., An hardware implementation of the softmax function on FPGA. In Proceedings of the International Conference for Young Researchers in Informatics, Mathematics, and Engineering 2020, Vol. 2768 (ed. Wozniak, M. C. G.) 21–25 (CEUR-WS, 2020). https://www.semanticscholar.org/paper/An-Hardware-Implementation-of-the-Softmax-Function-Franco-Rametta/09455b639dc4968415ff756e77338e20e68d0983, PDF: https://ceur-ws.org/Vol-2768/p4.pdf (Hardware implementation of vanilla Softmax algorithm.)
Yuan, B., 2016, Efficient hardware architecture of softmax layer in deep neural network. in 2016 29th IEEE International System-on-Chip Conference (SOCC), 323–326 (IEEE, 2016). https://ieeexplore.ieee.org/document/7905501 (Hardware implementation of Softmax using optimizations to downscale exponentiation and other methods; look-up tables used for exponentiation.)
Xue Geng, Jie Lin, Bin Zhao, Anmin Kong, Mohamed M. Sabry Aly & Vijay Chandrasekhar, 2018, Hardware-aware softmax approximation for deep neural networks. Asian Conference on Computer Vision, 107–122 (Springer, 2018). https://link.springer.com/chapter/10.1007/978-3-030-20870-7_7 PDF: https://oar.a-star.edu.sg/storage/p/pv0k3qeq26/0421.pdf (Approximates exponentials with addition, look-up tables, and piecewise linear interpolation, and division with bitshifts.)
Zhenmin Li; Henian Li; Xiange Jiang; Bangyi Chen; Yue Zhang; Gaoming Du, 2018, Efficient FPGA Implementation of Softmax Function for DNN Applications. 2018 12th IEEE International Conference on Anti-counterfeiting, Security, and Identification (ASID), 212–216 (IEEE, 2018). https://ieeexplore.ieee.org/document/8693206 (Uses look-up tables and splits integer and fractional parts.)
Meiqi Wang; Siyuan Lu; Danyang Zhu; Jun Lin; Zhongfeng Wang, 2018, A High-Speed and Low-Complexity Architecture for Softmax Function in Deep Learning. 2018 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), 223–226 (IEEE, 2018). https://ieeexplore.ieee.org/document/8605654 (Uses powers of 2 via look-up tables or linear functions, and other approximations such as reducing the domain of exponentiation.)
Welin Chen, David Grangier, Michael Auli, Dec 2015, Strategies for training large vocabulary neural language models. In: ACL (2016) https://arxiv.org/abs/1512.04906 (Includes hierarchical Softmax and differentiated Softmax techniques.)
Qiwei Sun; Zhixiong Di; Zhengyang Lv; Fengli Song; Qianyin Xiang; Quanyuan Feng; Yibo Fan; Xulin Yu; Wenqiang Wang, 2018, A High Speed SoftMax VLSI Architecture Based on Basic-Split. 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), 1–3 (IEEE, 2018). https://ieeexplore.ieee.org/document/8565706 (Uses look-up tables for a high-accuracy approximation to Softmax.)
Hu, R., Tian, B., Yin, S. & Wei, S., 2018, Efficient hardware architecture of softmax layer in deep neural network. 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), 1–5 (IEEE, 2018). https://ieeexplore.ieee.org/document/7905501
Gaoming Du, Chao Tian, Zhenmin Li, Duoli Zhang, Yongsheng Yin, Yiming Ouyang, 2019, Efficient Softmax Hardware Architecture for Deep Neural Networks. Proceedings of the 2019 on Great Lakes Symposium on VLSI, 75–80 (2019). https://dl.acm.org/doi/abs/10.1145/3299874.3317988 (Hardware implementation with exponential and logarithmic hardware units.)
Kouretas, I. & Paliouras, V., 2019, Simplified hardware implementation of the softmax activation function. 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), 1–4 (IEEE, 2019). https://ieeexplore.ieee.org/document/8741677
Kai-Yen Wang; Yu-De Huang; Yun-Lung Ho; Wai-Chi Fang, 2019, A customized convolutional neural network design using improved softmax layer for real-time human emotion recognition. in 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), 102–106 (IEEE, 2019). https://ieeexplore.ieee.org/document/8771616 (Replaces natural exponential with a dynamic base.)
Gabriele Prato, Ella Charlaix, Mehdi Rezagholizadeh, “Fully quantized transformer for machine translation,” Proc. of EMNLP, 2020. https://arxiv.org/abs/1910.10485 (Quantization of an entire Transformer architecture, including Softmax with replacement of exponentials with a step function.)
Aishwarya Kagalkar, S. Raghuram 2020, CORDIC Based Implementation of the Softmax Activation Function, 2020 24th International Symposium on VLSI Design and Test (VDAT), https://ieeexplore.ieee.org/document/9190498 (Uses the CORDIC method of mathematical approximation of the Software and inverse-Softmax functions, with the McLaurin series expansion and the Pade polynomial approximation.)
T. J. Ham, S. J. Jung, S. Kim et al., “A3: Accelerating attention mechanisms in neural networks with approximation,” in Proc. of HPCA. IEEE, 2020, pp. 328–341. https://arxiv.org/abs/2002.10941 (Optimizes both the attention mechanism and exponentiation in softmax.)
NP Pandey, M Fournarakis, C Patel, M Nagel, 2023, Softmax Bias Correction for Quantized Generative Models, arXiv preprint arXiv:2309.01729, https://arxiv.org/abs/2309.01729 (Correcting Softmax bias problems in quantized models.)
J Zhong, Z Liu, X Chen, Apr 2023, Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, https://arxiv.org/abs/2304.10891 (Has a section on approximating Softmax.)
Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (This paper reports no benefits to training efficiency from Softmax changes. Note: code uses deprecated nvFuser compiler.)
Y Zhang, D Chen, S Kundu, C Li, PA Beerel, 2023, Efficient Private Inference on ViT using Selective Attention Search with a Learnable Softmax Approximation, Proceedings of the IEEE/CVF, 2023, PDF: https://openaccess.thecvf.com/content/ICCV2023/papers/Zhang_SAL-ViT_Towards_Latency_Efficient_Private_Inference_on_ViT_using_Selective_ICCV_2023_paper.pdf
Wenxuan Zeng, Meng Li, Wenjie Xiong, Tong Tong, Wen-jie Lu, Jin Tan, Runsheng Wang, Ru Huang, Aug 2023, MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention, https://arxiv.org/abs/2211.13955, PDF: https://openaccess.thecvf.com/content/ICCV2023/papers/Zeng_MPCViT_Searching_for_Accurate_and_Efficient_MPC-Friendly_Vision_Transformer_with_ICCV_2023_paper.pdf Code: https://github.com/PKU-SEC-Lab/mpcvit (Optimizes Softlayer, GELU, and MatMul.)
Payman Mohassel and Yupeng Zhang, 2017, SecureML: A system for scalable privacy-preserving machine learning. IEEE symposium on security and privacy (SP), pages 19–38. https://ieeexplore.ieee.org/document/7958569, PDF: https://eprint.iacr.org/2017/396.pdf (Includes approximations for Softmax and sigmoid.)
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018. https://arxiv.org/abs/1711.07971, Code: https://github.com/facebookresearch/video-nonlocal-net
Dacheng Li, Rulin Shao, Hongyi Wang, Han Guo, Eric P. Xing, and Hao Zhang, 2022, MPCFormer: fast, performant and private Transformer inference with MPC, https://arxiv.org/abs/2211.01452, Code: https://github.com/MccRee177/MPCFormer
Yashas Samaga B L, Varun Yerram, Chong You, Srinadh Bhojanapalli, Sanjiv Kumar, Prateek Jain, Praneeth Netrapalli, 14 Feb 2024, HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference, https://arxiv.org/abs/2402.09360 (Attempts to estimate the output of top-k decoding, so as to prune computations on two dimensions earlier in the inference computations.)
M. Milakov and N. Gimelshein, “Online Normalizer Calculation for Softmax,” arXiv:1805.02867, 2018, https://arxiv.org/abs/1805.02867
Wenxun Wang; Shuchang Zhou; Wenyu Sun; Peiqin Sun; Yongpan Liu, Nov 2023, SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference, 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), https://ieeexplore.ieee.org/abstract/document/10323725
S Peng, F Yang, N Sun, S Chen, Y Jiang, A Pan, Oct 2023, Exploring Post-Training Quantization of Protein Language Models, arXiv preprint arXiv:2310.19624, https://arxiv.org/abs/2310.19624
Y Liang, Z Wang, X Xu, Y Tang, Z Jie, J Lu, Oct 2023, MCUFormer: Deploying Vision Tranformers on Microcontrollers with Limited Memory, arXiv preprint arXiv:2310.16898, https://arxiv.org/pdf/2310.16898.pdf
M Huang, J Luo, C Ding, Z Wei, S Huang, H Yu, Oct 2023, An Integer-Only and Group-Vector Systolic Accelerator for Efficiently Mapping Vision Transformer on Edge, IEEE Transactions on Circuits and Systems I: Regular Papers ( Early Access ), https://ieeexplore.ieee.org/abstract/document/10288182/
Y Wu, Z Wang, WD Lu, Oct 2023, PIM-GPT: A Hybrid Process-in-Memory Accelerator for Autoregressive Transformers https://arxiv.org/pdf/2310.09385.pdf
J Choi, H Li, B Kim, S Hwang, 2022, Accelerating transformer networks through recomposing softmax layers, https://ieeexplore.ieee.org/abstract/document/9975410/ http://scale.snu.ac.kr/papers/2022-11-Conference-IISWC-Softmax-recomposition.pdf
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, June 2022. https://arxiv.org/abs/2205.14135 Code: https://github.com/HazyResearch/flash-attention (The original FlashAttention version 1, now superceded by FlashAttention 2, which uses tiling and memory-aware kernels to optimize attention.)
David Spuler, March 2024, Chapter 25. SoftMax, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
A Katharopoulos, A Vyas, N Pappas, 2020, Transformers are rnns: Fast autoregressive transformers with linear attention, http://proceedings.mlr.press/v119/katharopoulos20a/katharopoulos20a.pdf
Y Dong, W Lu, Y Zheng, H Wu, D Zhao, J Tan, July 2023, PUMA: Secure Inference of LLaMA-7B in Five Minutes, https://arxiv.org/abs/2307.12533
Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet, 16 Jun 2024, Optimized Speculative Sampling for GPU Hardware Accelerators, https://arxiv.org/abs/2406.11016 (Speculative decoding accelerated with multiple GPUs using approaches such as tiling, and uses a fused sigmoid replacing Softmax.)
Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, 6 May 2024, Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond, https://arxiv.org/abs/2405.03251
Rundong Zuo, Guozhong Li, Rui Cao, Byron Choi, Jianliang Xu, and Sourav S Bhowmick. 2024. DARKER: Efficient Transformer with Data-Driven Attention Mechanism for Time Series. Proc. VLDB Endow. 17, 11 (July 2024), 3229–3242. https://doi.org/10.14778/3681954.3681996 https://dl.acm.org/doi/abs/10.14778/3681954.3681996
Wenjie Li,, Dongxu LYu, Gang Wang, Aokun Hu, Ningyi Xu, Guanghui He, October 2024, Hardware-oriented algorithms for softmax and layer normalization of large language models, Science China, Vol. 67, Iss. 10, 200404:1–200404:15, https://doi.org/10.1007/s11432-024-4137-4 http://scis.scichina.com/en/2024/200404.pdf
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai, 6 Oct 2024, Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective, https://arxiv.org/abs/2410.04466
W. Wang, W. Sun and Y. Liu, "Improving Transformer Inference Through Optimized Non-Linear Operations With Quantization-Approximation-Based Strategy," in IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, doi: 10.1109/TCAD.2024.3488572. https://ieeexplore.ieee.org/abstract/document/10738457

Integer-Only Softmax

Softmax can be optimized by conversion to integer arithmetic. See also the overview of integer arithmetic in neural networks.

Research papers with an integer-only Softmax function:

Z Zhang, B He, Z Zhang, 2023, Practical Edge Kernels for Integer-Only Vision Transformers Under Post-training Quantization, Proceedings of Machine Learning and Systems 5 pre-proceedings (MLSys 2023) mlsys2023, https://proceedings.mlsys.org/paper_files/paper/2023/hash/023560744aae353c03f7ae787f2998dd-Abstract-mlsys2023.html, PDF: https://proceedings.mlsys.org/paper_files/paper/2023/file/023560744aae353c03f7ae787f2998dd-Paper-mlsys2023.pdf (Integer-only-arithmetic quantization with integer-only versions of Softmax, LayerNorm, and GELU.)
Y. Lin, Y. Li, T. Liu et al., “Towards fully 8-bit integer inference for the transformer model,” in Proc. of IJCAI, 2020, pp. 3759–3765. https://arxiv.org/abs/2009.08034 (Integers for weights, but also for Softmax, layer normalization, and other components, by replacing or approximating non-linear functions such as expoential and square-root.)
A. Rock, A. Untether, O. Khalil, O. Shai, and P. Grouchy, 2022, INT8 Transformers for Inference Acceleration, 36th Conference on Neural Information Processing Systems (NeurIPS), PDF: https://neurips2022-enlsp.github.io/papers/paper_52.pdf
Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, 3 Apr 2024, Optimizing the Deployment of Tiny Transformers on Low-Power MCUs, https://arxiv.org/abs/2404.02945 (Uses an approach called "Fused Weight Self-Attention" that fuses some of the QKV matrices and also tiling in multi-head attention, along with 8-bit integer quantization and integerized Softmax.)

Softmax Pruning

In addition to approximating Softmax, consideration has been given to "pruning" the Softmax component, i.e. removing it completely.

Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing Xu, Tao Xiang, Li Zhang, Apr 2022, SOFT: Softmax-free Transformer with Linear Complexity, https://arxiv.org/abs/2110.11945 Code: https://fudan-zvg.github.io/SOFT/
SimA: Simple Softmax-free Attention for Vision Transformers Soroush Abbasi Koohpayegani, Hamed Pirsiavash, June 2022, https://arxiv.org/abs/2206.08898 Code: https://github.com/UCDvision/sima (Analyzes the cost of Softmax in Vision Transformers and proposes a method without Softmax, using l1 normalization instead.)
Huihong Shi, Haikuo Shao, Wendong Mao, Zhongfeng Wang, 6 May 2024, Trio-ViT: Post-Training Quantization and Acceleration for Softmax-Free Efficient Vision Transformer, https://arxiv.org/abs/2405.03882 (Faster quantization of vision transformers by removing non-linear operations such as GELU, LayerNorm, and Softmax.)
3 Feb 2024, Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models, Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi, https://arxiv.org/abs/2402.02244 (A survey of processing long context length using methods such as positional encoding and approximate attention including Softmax-free attention.)
Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith, 17 Oct 2023 (v2), Replacing softmax with ReLU in Vision Transformers, https://arxiv.org/abs/2309.08586
Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian, 13 Feb 2023, A Study on ReLU and Softmax in Transformer, https://arxiv.org/abs/2302.06461
Marcel Keller, Ke Sun, 6 Jul 2021 (v2), Effectiveness of MPC-friendly Softmax Replacement, https://arxiv.org/abs/2011.11202
Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing XU, Tao Xiang, Li Zhang, 2021, SOFT: Softmax-free Transformer with Linear Complexity, Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021), https://proceedings.neurips.cc/paper/2021/hash/b1d10e7bafa4421218a51b1e1f1b0ba2-Abstract.html https://proceedings.neurips.cc/paper/2021/file/b1d10e7bafa4421218a51b1e1f1b0ba2-Paper.pdf
Lu, J., Zhang, J., Zhu, X. et al. 2024, Softmax-Free Linear Transformers. Int J Comput Vis 132, 3355–3374 (2024). https://doi.org/10.1007/s11263-024-02035-5 https://arxiv.org/abs/2207.03341 https://arxiv.org/pdf/2207.03341 https://link.springer.com/article/10.1007/s11263-024-02035-5 https://github.com/fudan-zvg/SOFT

Softmax Alternatives

There is also research on replacing Softmax with a different component:

Kunal Banerjee, Vishak Prasad C, Rishi Raj Gupta, Karthik Vyas, Anushree H, Biswajit Mishra, Nov 2020, Exploring Alternatives to Softmax Function, https://arxiv.org/abs/2011.11538
SimA: Simple Softmax-free Attention for Vision Transformers Soroush Abbasi Koohpayegani, Hamed Pirsiavash, June 2022, https://arxiv.org/abs/2206.08898 Code: https://github.com/UCDvision/sima (Uses l1 normalization instead of Softmax.)
Morin, Frederic and Bengio, Yoshua. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp. 246–252. Citeseer, 2005. https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf (A variant of Softmax called "hierarchical Softmax".)
Gutmann, Michael, Hyvarinen, Aapo, Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. https://jmlr.org/papers/v13/gutmann12a.html, PDF: http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf (An alternative to Softmax called "Noise-contrastive estimation".)
Sharath Nittur Sridhar, Anthony Sarah, and Sairam Sundaresan. TrimBERT: Tailoring BERT for Trade-offs. arXiv:2202.12411 [cs], February 2022. http://arxiv.org/abs/2202.12411 (Optimizations include softmax replacement and removing half of all LayerNorms.)
Oliver Richter and Roger Wattenhofer. Normalized attention without probability cage, 2020, https://arxiv.org/abs/2005.09561 (Replaces Softmax with normalization.)
Jiasen Wang; Xinqi Li; Jun Wang, Dec 2023, Energy Saving Based on Transformer Models with LeakyReLU Activation Function, 2023 13th International Conference on Information Science and Technology (ICIST), https://ieeexplore.ieee.org/abstract/document/10367091
Mitchell Wortsman, Jaehoon Lee, Justin Gilmer, Simon Kornblith, 17 Oct 2023 (v2), Replacing softmax with ReLU in Vision Transformers, https://arxiv.org/abs/2309.08586
Kai Shen, Junliang Guo, Xu Tan, Siliang Tang, Rui Wang, Jiang Bian, 13 Feb 2023, A Study on ReLU and Softmax in Transformer, https://arxiv.org/abs/2302.06461
Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing XU, Tao Xiang, Li Zhang, 2021, SOFT: Softmax-free Transformer with Linear Complexity, Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021), https://proceedings.neurips.cc/paper/2021/hash/b1d10e7bafa4421218a51b1e1f1b0ba2-Abstract.html https://proceedings.neurips.cc/paper/2021/file/b1d10e7bafa4421218a51b1e1f1b0ba2-Paper.pdf
Lu, J., Zhang, J., Zhu, X. et al. 2024, Softmax-Free Linear Transformers. Int J Comput Vis 132, 3355–3374 (2024). https://doi.org/10.1007/s11263-024-02035-5 https://arxiv.org/abs/2207.03341 https://arxiv.org/pdf/2207.03341 https://link.springer.com/article/10.1007/s11263-024-02035-5 https://github.com/fudan-zvg/SOFT
Jiuxiang Gu, Chenyang Li, Yingyu Liang, Zhenmei Shi, Zhao Song, 6 May 2024, Exploring the Frontiers of Softmax: Provable Optimization, Applications in Diffusion Model, and Beyond, https://arxiv.org/abs/2405.03251
Michael Zhang, Kush Bhatia, Hermann Kumbong, Christopher Ré, 6 Feb 2024, The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry, https://arxiv.org/abs/2402.04347
Gabriel Mongaras, Trevor Dohm, Eric C. Larson, 27 Sep 2024, Cottention: Linear Transformers With Cosine Attention, https://arxiv.org/abs/2409.18747 (Nearest-neighbor to replace Softmax attention, for near-linear attention.)
Itamar Zimerman, Allon Adir, Ehud Aharoni, Matan Avitan, Moran Baruch, Nir Drucker, Jenny Lerner, Ramy Masalha, Reut Meiri, Omri Soceanu, 12 Oct 2024, Power-Softmax: Towards Secure LLM Inference over Encrypted Data, https://arxiv.org/abs/2410.09457
Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael I. Jordan, Song Mei, 17 Oct 2024, Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs, https://arxiv.org/abs/2410.13835
Hemanth Saratchandran, Jianqiao Zheng, Yiping Ji, Wenbo Zhang, Simon Lucey, 24 Oct 2024, Rethinking Softmax: Self-Attention with Polynomial Activations, https://arxiv.org/abs/2410.18613

Aussie AI

Softmax Optimization

What is Softmax?

Softmax Optimization

Softmax Optimization

Softmax Approximation

Integer-Only Softmax

Softmax Pruning

Softmax Alternatives

More AI Research

Quick Links

Product

New to Writing?

Writing Styles