Aussie AI

Advanced AI Mathematics Research

Last Updated 24 June, 2025

by David Spuler, Ph.D.

There is much research on advanced mathematics to optimize inference algorithms and training, mainly focused on areas such as:

Numeric representations (e.g. IEEE 754 floating point)
Advanced number systems (e.g. Dyadic numbers, Residue Number System, Posit Number System, Logarithmic Number System)
Matrix multiplication algorithms (see AI matrix algrebra research)
Alternatives to using arithmetic multiplication (e.g. bitshifting, logarithms, other fancy stuff) (see Multiplication-free inference)
Faster algorithms for multiplication arithmetic (mostly hardware acceleration algorithms for chip designers)
Approximate multiplication (faster ways to multiply two numbers by allowing errors)

Advanced Numeric Representations

There are various ways to represent numbers in computers. Although there are common ways in programming languages, these are not the only ways. Here's a list of the basic ones:

Integer 2's complement
Integer 1's complement
Floating-point
Fixed-point

And here are some of the more advanced research types:

Brain float (BF16 and BF32)
Block floating-point (BFP)
Hybroid block floating-point (HBFP)

Although much research focuses on 16-bit floating point or integer representations of weights, there are some more novel and interesting alternatives. Programmers are used to the current ways that computers store numbers into bits, but these are just traditions, and arose via various trade-offs that may no longer apply with modern increases in computing power. None of these techniques seem to be in practical usage yet, but it's possible that a significant breakthrough still lies in this area of research.

General number representation research:

Peter Lindstrom, Scott Lloyd, Jeffrey Hittinger, "Universal Coding of the Reals: Alternatives to IEEE Floating Point", CoNGA, March 28th, 2018, https://dl.acm.org/doi/10.1145/3190339.3190344
Jeff Johnson, "Making floating point math highly efficent for AI hardware", November 2018, Meta (Facebook Research), https://engineering.fb.com/2018/11/08/ai-research/floating-point-math/
G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri, Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035
Youngdeok Hwang; Janghwan Lee; Jiwoong Park; Jieun Lim; Jungwook Choi, Jan 2024, Searching Optimal Floating-Point Format for Sub-8-Bit Large Language Model Inference, 2024 International Conference on Electronics, Information, and Communication (ICEIC), https://ieeexplore.ieee.org/abstract/document/10457111 (Examines floating-point representations below 8 bits, and also the importance of denormalized numbers.)

Fixed-Point Arithmetic

Most programming languages use floating-point arithmetic to represent fractional numbers, but there's a simpler method: fixed-point arithmetic. The idea is to have a single scaling factor, rather than storing an exponent in each number. The main advantage is that it's all integers.

Fixed point has many applications that are simpler than AI. For example, storing money in a dollars and cents format with a scaling factor of 100 (cents), allows $10.23 to be represented as the integer "1023" rather than a more complex floating point number.

To understand how it works, let's do decimal and scale by 1000. The fractional number "1.234" when scaled by 1000 becomes "1234", and "5.67" becomes "5670".

Addition of fixed-point numbers is just integer addition. Adding "1.234+5.67" becomes integer addition of "1234+5670" and yields the integer "6904" which represents "6.904". There is no additional scaling requires after using integer addition.

Multiplication of two fixed-point numbers is integer multiplication. Multiplying "1.234*567" to get "6.99678" becomes "1234*5670" which gives integer "6,996,780". You can see that the number is too high because it's been scaled twice. Hence, we need to divide the integer by the 1,000 scaling factor to get "6,996" (and truncating the lower fractional digits). So fixed-point format changes floating-point multiplication into an integer multiplication, followed by an integer division to re-adjust the scaling factor.

In practice, fixed-point numbers would use a power-of-two scaling factor (e.g. 1,024), rather than a decimal (e.g. 1,000), because that (a) allows bitshifting to re-adjust the scale in multiplication, and (b) also simplifies conversion to/from floating point numbers, because they're stored with a power-of-two exponent.

Note that fixed point is effectively just storing the mantissa bits as an integer, and not storing the exponent anywhere. The sign bit is now simply the first bit of the integer, as in normal integer representations, so negative fixed-point values are represented as negative integers. We've also changed the meaning of the mantissa so it's not "normalized", which means there's no longer any "implicit" or "hidden" leading 1 bit. In fixed-point, the integer is exactly as given, without any hidden leading bit.

Memory reduction. Fixed point numbers require less memory storage. It's like we've taken all the exponent bits and thrown them away. In standard floating-point 32-bit numbers (i.e., float), there are 8 bits for the exponent (plus 1 sign bit, 23 mantissa bits, and 1 extra hidden implicit mantissa bit). With fixed-point, these 8 bits aren't needed, because we've effectively made the exponent a global variable for all the numbers. Hence, memory requirements have reduced from 32 bits to 24 bits.

Precision. We can store 1 decimal digit in about log2(10)=3.32 bits. Hence, the 24 bits of IEEE 32-bit floating-point (FP32) can store about 7 digits of precision. If we store in fixed-point with a 1024 scale factor, then we've only really got 3 digits after the decimal point. On the other hand, a higher scaling factor could be used. Generally, fixed-point has more trouble storing a high range of values (large magnitude to small fractional values) with high precision (number of fractional digits), compared to floating-point.

But this is AI and we may not need that much precision in computations. Huge weight values are not necessary in most models, and could be reduced if needed anyway. At runtime, the activations are frequently normalized, so they shouldn't reach huge magnitudes during dynamic computations. Also, by dismissing the 8-bit exponent, we could even have more precision in fixed-point in terms of fractional digits, because we can use a 32-bit mantissa (rather than only 24 bits).

Integer AI inference. Hence, a simple way to get an integer AI engine is to switch to fixed-point with a power-of-two scaling factor. Vector dot products become a fixed-point multiplication (i.e. an integer multiplication and a bitshift re-scale), followed by addition of all the paired products (i.e. accumulate).

In fact, we can further optimize this. Since all the paired products are over-scaled by the same factor, we can do the bitshift re-scale after adding them up. Furthermore, we could probably merge the paired multiply and then add them to use "fused multiply-add" primitives (FMA or MAC), and then one single bitshift re-scale. The only problem with that idea is that we risk problems with integer overflow if either the computed values or the scaling factor is too large.

End-to-end integers. Using fixed point arithmetic allows integer calculations for the main tensors and their matrix multiplication or vector dot products. However, there are some problems with the non-linear components such as activations (e.g. GELU, although RELU is easy to do in integers), Softmax, and other pieces.

On the other hand, we could use a pre-computed lookup table to map integer-to-integer for any of these non-linear functions. This is a fast calculation, and there is also often hardware acceleration support for LUT-based calculations.

The size of the LUTs can be an issue. A 16-bit LUT with 2^16 values is easy to do with about 65,000 values (2 bytes each), even on a laptop, but if we're using 32-bit integers then the LUT has 2^32 entries (about 4.3 billion), each 4-bytes, so we're looking at over 17GB of read-only storage for the table (and we'll need more than one). This amount of storage may be problematic on low-end devices such as phones and tablets, but is fine for a high-end data center box with lots of RAM. Where it's a problem, we can also approximate a 32-bit LUT with 16 or 24 bits, losing some precision.

Research on Fixed-Point Arithmetic. Papers include:

Arash Fayyazi, Mahdi Nazemi, Armin Abdollahi, Massoud Pedram, 7 Jul 2023, BlendNet: Design and Optimization of a Neural Network-Based Inference Engine Blending Binary and Fixed-Point Convolutions, https://arxiv.org/abs/2307.03784
D. D. Lin, S. S. Talathi and V. S. Annapureddy, Fixed point quantization of deep convolutional networks, Proc. 33rd Int. Conf. Int. Conf. Mach. Learn., vol. 48, pp. 2849-2858, Jun. 2016. https://arxiv.org/abs/1511.06393 (Fixed point quantization.)
S. Gupta, A. Agrawal, K. Gopalakrishnan and P. Narayanan, Deep learning with limited numerical precision, Proc. 32nd Int. Conf. Mach. Learn. (ICML), vol. 37, pp. 1737-1746, 2015. https://arxiv.org/abs/1502.02551 (Fixed point quantization.)
S. Anwar, K. Hwang and W. Sung, Fixed point optimization of deep convolutional neural networks for object recognition, Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 1131-1135, Apr. 2015. https://ieeexplore.ieee.org/document/7178146 (Fixed point quantization.)
P. Gysel, M. Motamedi and S. Ghiasi, Hardware-oriented approximation of convolutional neural networks, Proc. Int. Conf. Learn. Represent., pp. 1-15, 2016. https://arxiv.org/abs/1604.03168 (Fixed point quantization.)
Wikipedia, 2024 (accessed), Fixed-point arithmetic, https://en.wikipedia.org/wiki/Fixed-point_arithmetic
Reza Arablouei, Liang Wang, Caitlin Phillips, Lachlan Currie, Jordan Yates, Greg Bishop-Hurley, 20 Dec 2022 ( v2), In-situ animal behavior classification using knowledge distillation and fixed-point quantization, https://arxiv.org/abs/2209.04130
Ahmad Shawahna, Sadiq M. Sait, Aiman El-Maleh, Irfan Ahmad, 22 Mar 2022, FxP-QNet: A Post-Training Quantizer for the Design of Mixed Low-Precision DNNs with Dynamic Fixed-Point Representation, https://arxiv.org/abs/2203.12091
Qing Jin, Jian Ren, Richard Zhuang, Sumant Hanumante, Zhengang Li, Zhiyu Chen, Yanzhi Wang, Kaiyuan Yang, Sergey Tulyakov, 10 Feb 2022, F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization, https://arxiv.org/abs/2202.05239
Lorenz Kummer, Kevin Sidak, Tabea Reichmann, Wilfried Gansterer, 27 Aug 2021 ( v4), Adaptive Precision Training (AdaPT): A dynamic fixed point quantized training approach for DNNs, https://arxiv.org/abs/2107.13490
Hamed F. Langroudi, Vedant Karia, Tej Pandit, Dhireesha Kudithipudi, 6 Apr 2021, TENT: Efficient Quantization of Neural Networks on the tiny Edge with Tapered FixEd PoiNT, https://arxiv.org/abs/2104.02233
Je Yang, Seongmin Hong, Joo-Young Kim, 24 Feb 2021, FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism, https://arxiv.org/abs/2102.12103
Rishabh Goyal, Joaquin Vanschoren, Victor van Acht, Stephan Nijssen, 3 Feb 2021, Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms, https://arxiv.org/abs/2102.02147
Heming Sun, Zhengxue Cheng, Masaru Takeuchi, Jiro Katto, 9 Jul 2020, End-to-end Learned Image Compression with Fixed Point Weight Quantization, https://arxiv.org/abs/2007.04684
Lukas Enderich, Fabian Timm, Wolfram Burgard, 19 Feb 2020, SYMOG: learning symmetric mixture of Gaussian modes for improved fixed-point quantization, https://arxiv.org/abs/2002.08204
Hongxing Gao, Wei Tao, Dongchao Wen, Tse-Wei Chen, Kinya Osa, Masami Kato, 19 Nov 2019, IFQ-Net: Integrated Fixed-point Quantization Networks for Embedded Vision, https://arxiv.org/abs/1911.08076
MohammadHossein AskariHemmat, Sina Honari, Lucas Rouhier, Christian S. Perone, Julien Cohen-Adad, Yvon Savaria, Jean-Pierre David, 9 Sep 2019 ( v2), U-Net Fixed-Point Quantization for Medical Image Segmentation, https://arxiv.org/abs/1908.01073
Charbel Sakr, Naresh Shanbhag, 31 Dec 2018, Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm, https://arxiv.org/abs/1812.11732
Darryl D. Lin, Sachin S. Talathi, V. Sreekanth Annapureddy, 2 Jun 2016 ( v3), Fixed Point Quantization of Deep Convolutional Networks, https://arxiv.org/abs/1511.06393
Peng Qiao, Sidun Liu, Tao Sun, Ke Yang, Yong Dou, 29 Jan 2023, Towards Vision Transformer Unrolling Fixed-Point Algorithm: a Case Study on Image Restoration, https://arxiv.org/abs/2301.12332
Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Andreas Moshovos, 23 Mar 2022, Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models, https://arxiv.org/abs/2203.12758
Farhad Taheri, Siavash Bayat-Sarmadi, Hatame Mosanaei-Boorani, Reza Taheri, 19 Feb 2023, Fixflow: A Framework to Evaluate Fixed-point Arithmetic in Light-Weight CNN Inference, https://arxiv.org/abs/2302.09564
Rishabh Goyal, Joaquin Vanschoren, Victor van Acht, Stephan Nijssen, 3 Feb 2021, Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms, https://arxiv.org/abs/2102.02147
Sambhav R. Jain, Albert Gural, Michael Wu, Chris H. Dick, 28 Feb 2020 (v3), Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks, https://arxiv.org/abs/1903.08066
Seyed H. F. Langroudi, Tej Pandit, Dhireesha Kudithipudi, 22 May 2018, Deep Learning Inference on Embedded Devices: Fixed-Point vs Posit, https://arxiv.org/abs/1805.08624
Liangzhen Lai, Naveen Suda, Vikas Chandra, 8 Mar 2017, Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations, https://arxiv.org/abs/1703.03073
Naveen Mellempudi, Abhisek Kundu, Dipankar Das, Dheevatsa Mudigere, Bharat Kaul, 1 Feb 2017, Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point, https://arxiv.org/abs/1701.08978
Dingyi Dai, Yichi Zhang, Jiahao Zhang, Zhanqiu Hu, Yaohui Cai, Qi Sun, Zhiru Zhang, 31 Jan 2024, Trainable Fixed-Point Quantization for Deep Learning Acceleration on FPGAs, https://arxiv.org/abs/2401.17544
Sashank Macha, Om Oza, Alex Escott, Francesco Caliva, Robbie Armitano, Santosh Kumar Cheekatmalla, Sree Hari Krishnan Parthasarathi, Yuzong Liu, 4 Mar 2023, Fixed-point quantization aware training for on-device keyword-spotting, https://arxiv.org/abs/2303.02284
Alberto Delmas Lascorz, Mostafa Mahmoud, Ali Hadi Zadeh, Milos Nikolic, Kareem Ibrahim, Christina Giannoula, Ameer Abdelhadi, Andreas Moshovos, 2024, Atalanta: A Bit is Worth a “Thousand” Tensor Values, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 85–102, https://doi.org/10.1145/3620665.3640356 https://dl.acm.org/doi/abs/10.1145/3620665.3640356
Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 19 Apr 2024, decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points, https://arxiv.org/abs/2404.12759 Code: https://github.com/bytedance/decoupleQ (Decouple parameters into integer and floating-point parts for more accurate quantization at low bitwidths.)
SHF Langroudi, 2023, Tapered-Precision Numerical Formats for Deep Learning Inference and Training Ph.D. thesis, Department of Electrical and Computer Engineering Kate Gleason College of Engineering Rochester Institute of Technology Rochester, New York, https://www.proquest.com/openview/a88513887d40ec2e6744025447d7d948/1?pq-origsite=gscholar&cbl=18750&diss=y
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (Dec. 2015). http://arxiv.org/abs/1512.03385 arXiv: 1512.03385.
Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2017. Trained Ternary Quantization. arXiv:1612.01064 [cs] (Feb. 2017). http://arxiv.org/abs/1612.01064 arXiv: 1612.01064
Fabien Geyer, Johannes Freitag, Tobias Schulz, Sascha Uhrig, 16 Jan 2024, Efficient and Mathematically Robust Operations for Certified Neural Networks Inference, https://arxiv.org/abs/2401.08225 https://arxiv.org/pdf/2401.08225 (Finds that fixed point arithmetic is more efficient with comparable accuracy to using floating point.)
Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David, 23 Sep 2015 (v5), Training deep neural networks with low precision multiplications, https://arxiv.org/abs/1412.7024
Mukul Lokhande, Gopal Raut, Santosh Kumar Vishvakarma, 16 Dec 2024, Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads, https://arxiv.org/abs/2412.11702

Block Floating-Point (BFP)

Block floating-point is a numeric representation of floating-point numbers where a group of numbers have their own mantissas (as usual), but share a single exponent. Hence, this saves space by not storing many exponent bits.

Another way to think about this is that the numbers are stored as integers, but with a scaling factor. This means that BFP is very similar to fixed-point arithmetic, except that it has granular per-block or per-vector scaling factors (the exponents). Hence, fixed-point arithmetic with a chosen scaling factor is effectively a global version of BFP with a single exponent.

BFP has an integer (mantissa) and a scaling factor that's a power-of-two (exponent). This is very similar to dyadic numbers, which is another fancy number system. See further below for research papers on the dyadic number system. Another related number system is the posit number system, where numbers have multiple exponent values.

Pros of BFP. The goals of BFP in inference are advantages such as:

Integer-only arithmetic (e.g. for dot product)
Memory saving by not storing exponents.
No loss in accuracy or precision of results.
Hardware-support already exists (various chips).

BFP achieves integer-only arithmetic for dot product, because all of the paired-mantissas are calculated into an integer result with integer multiplication (and integer accumulation), and then the two exponents of each vector of BFP numbers is added up to get the new exponent. Hence, the result of a dot product of two vectorized BFP vectors is a single number, represented as two numbers: mantissa and exponent.

Memory storage is much improved by not storing the bits for the exponent in each number, but only once per block (or once per vector). But unlike quantization, there is no loss of precision because the same number of bits are stored. The same number of bits are stored for each number, but the exponent bits are shared across multiple numbers.

Cons of BFP. Notes on practical problems and issues with BFP:

Outliers. The basic way that BFP works is to scale all the numbers to the large number in a vector. This number sets the exponent because the exponent must be chosen so as to represent this large number. However, having very large weights or activations that are "outliers" is problematic. Similarly, very low fractional numbers or high-magnitude negative values can also be difficult. The actual issue is the range of values in a single block, and whether a single exponent can be used to represent all of the values in the block, without losing precision (digits) for any of the values. Typically, the largest value sets the exponent, and it will be low-magnitude values that may lose precision in the conversion to BFP if the exponent is too high.
Overflow. Numeric overflow of integer arithmetic may need accounting for. However, this issue is not specific to BFP, and probably is neither better nor worse than other numeric representations.

Detailed Notes. Additional notes about BFP formats:

No implicit mantissa bit. Whereas the traditional IEEE floating-point format for numbers has an implicit leading 1 in the mantissa (the "hidden" mantissa bit), this is not the case for BFP. Requiring the "normalization" of all numbers to have a leading 1 in the mantissa is not workable for this shared exponent method. Instead, a number is represented as 2^exponent*0.mantissa, rather than the IEEE version which is 2^exponent*1.mantissa bits. Hence, the mantissa bits in BFP are effectively an integer and can be processed as such with integer multiplication.
Variations. Different numbers of bits can be used for the mantissas and the exponent to get different formats. Also, mixed precision formats are possible where these parameters vary per block, called "Hybrid BFP" (HBFP).
Add-as-integer fails. There is an unusual add-as-integer method for standard IEEE floating-point numbers (Mogami, 2020), whereby simply adding mantissa bits (and exponent bits) implements an approximate multiplication called Mitchells' approximate multiplication. This would be desirable as it is integer addition, whereas BFP dot products require integer multiplication. However, trying to emulate add-as-integer by simply adding the mantissa integers in BFP doesn't work as well. The problem is that BFP mantissa values are not normalized, and there's no implicit leading 1 bit for the mantissa in BFP. One of the reasons add-as-integer is close to multiplication is that the numbers it is adding are fractions after an implicit leading 1 prefix. Using addition of BFP mantissa values is not at all accurate, and not useful as an approximation.
Per-number BFP. Note that if you think of super-sizing BFP so that every number has a separate exponent, such as to overcome the outlier problems, well, that's almost back to standard floating-point arithmetic (and it's lost all of the advantages in reduced memory). The hardware floating-point unit (FPU) does integer arithmetic on the mantissa and exponent bits to implement floating-point arithmetic. Floating-point multiply is an integer multiplication of mantissa bits and integer addition of exponent bits, but it's all done in hardware. However, this super-BFP does differ from standardized IEEE 754 floating-point arithmetic in that (a) it has a leading zero implicitly in the mantissa rather than a leading one bit, and (b) there's no bias factor that adjusts the exponent bits.
Global BFP. And if you go the other way, and think you'll have the same exponent number for every single weight in a model file, well, that's effectively going back to fixed-point arithmetic.

BFP Research. Research papers on Block Floating-Point:

Simla Burcu Harma, Ayan Chakraborty, Babak Falsafi, Martin Jaggi, Yunho Oh, 2023, Accuracy Boosters: Epoch-Driven Mixed-Mantissa Block Floating Point for DNN Training, ML for Computer Architecture and Systems (MLArchSys), ISCA 2023, https://openreview.net/pdf?id=nfmfqzQ4Mwl (Mixed precision version of BFP with per-block bit sizes, and integer arithmetic for dot product, but FP32 for other operations.)
Wikipedia, April 2024 (accessed), Block floating point https://en.wikipedia.org/wiki/Block_floating_point
Nikita Trukhanov, Ilya Soloveychik, 29 Mar 2024, Accurate Block Quantization in LLMs with Outliers, https://arxiv.org/abs/2403.20137 (Analyzes block floating point number formats in block quantization with a focus on the KV cache memory reduction, including the use of permutations to reorder tensor weight rows.)
Nils Kohl, Stephen F. McCormick, Rasmus Tamstorf, 30 Jun 2023, Multigrid Methods using Block Floating Point Arithmetic, https://arxiv.org/abs/2307.00124
Mario Drumond, Tao Lin, Martin Jaggi, Babak Falsafi, 2 Dec 2018 ( v4), Training DNNs with Hybrid Block Floating Point, NeurIPS, https://arxiv.org/abs/1804.01526 PDF: https://proceedings.neurips.cc/paper/2018/file/6a9aeddfc689c1d0e3b9ccc3ab651bc5-Paper.pdf
Kobayashi, S., Fettweis, G.P., 2000, A Hierarchical Block-Floating-Point Arithmetic. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 24, 19–30 (2000). https://doi.org/10.1023/A:1008110410087 https://link.springer.com/article/10.1023/a:1008110410087 (A paper from 2000 on BFP theory in signal processing applications.)
Yeong Foong Choo, Brian L. Evans, Alan Gatherer, 25 Oct 2017 ( v2), Complex Block Floating-Point Format with Box Encoding For Wordlength Reduction in Communication Systems, https://arxiv.org/abs/1705.05217 (Use of BFP for audio sampling in 2017.)
Chhabra, Arun; Iyer, Ramesh (December 1999). "TMS320C55x A Block Floating Point Implementation on the TMS320C54x DSP" (PDF) (Application report). Digital Signal Processing Solutions. Texas Instruments. SPRA610. Archived (PDF) from the original on 2018-07-11. Retrieved 2018-07-11. https://web.archive.org/web/20180711175625/http://www.eeng.dcu.ie/~ee206/pdf/block_flt_pt.pdf
Elam, David; Iovescu, Cesar (September 2003). "A Block Floating Point Implementation for an N-Point FFT on the TMS320C55x DSP" (PDF) (Application report). TMS320C5000 Software Applications. Texas Instruments. SPRA948. Archived (PDF) from the original on 2018-07-11. Retrieved 2015-11-01. https://www.ti.com/lit/an/spra948/spra948.pdf
Wilkinson, James Hardy (1963). Rounding Errors in Algebraic Processes (1 ed.). Englewood Cliffs, NJ, USA: Prentice-Hall, Inc. MR 0161456. https://books.google.com.au/books?id=yFogU9Ot-qsC&redir_esc=y
Microsoft, “MX pytorch emulation library,” https://github.com/microsoft/microxcaling, 2023.
J Wu, M Song, J Zhao, HKH So, 2024, A Case for Low Bitwidth Floating Point Arithmetic on FPGA for Transformer Based DNN Inference, https://wujiajunic.cn/publication/ipdpsw2024/IPDPSW2024.pdf
Nils Kohl, Stephen F. McCormick, Rasmus Tamstorf, 2023, Multigrid Methods Using Block Floating Point Arithmetic, https://doi.org/10.1137/23M1581819 https://epubs.siam.org/doi/abs/10.1137/23M1581819
Lancheng Zou, Wenqian Zhao, Shuo Yin, Chen Bai, Qi Sun, Bei Yu, 2024, BiE: Bi-Exponent Block Floating-Point for Large Language Models Quantization, Proceedings of the 41st International Conference on Machine Learning, PMLR 235:62978-62992, https://proceedings.mlr.press/v235/zou24d.html https://openreview.net/forum?id=DbyHDYslM7 https://openreview.net/pdf?id=DbyHDYslM7 https://www.cse.cuhk.edu.hk/~byu/papers/C229-ICML2024-BiE-slides.pdf
Yongqi Xu, Yujian Lee, Gao Yi, Bosheng Liu, Yucong Chen, Peng Liu, Jigang Wu, Xiaoming Chen, Yinhe Han, 25 Sep 2024, BitQ: Tailoring Block Floating Point Precision for Improved DNN Efficiency on Resource-Constrained Devices. https://arxiv.org/abs/2409.17093
Hui Wang, Yuan Cheng, Xiaomeng Han, Zhengpeng Zhao, Dawei Yang, Zhe Jiang, 21 Jan 2025, Pushing the Limits of BFP on Narrow Precision LLM Inference, https://arxiv.org/abs/2502.00026

Advanced Number Systems

Various alternative number systems have been used in computer software and hardware. They have different trade-offs in terms of precision and arithmetic complexity. Several of these have been examined in terms of model inference algorithms, mainly to reduce the number of multiplications. The main ones include:

Residue Number System (RNS)
Posit Number System (PNS)
Logarithmic Number System (LNS); see logarithmic models
Dyadic numbers

If you enjoy this kind of stuff, there's also some obscure ones:

Tropical algebra, based on max/min with addition; see also Max-Plus networks.
Minimax algebra (related to tropical algebra)
Double-Base Number System (DBNS)
Multiple-Base Number System (MBNS)
Multi-Dimensional Logarithmic Number System (MDLNS)
Semi-Logarithmic Number System (SLNS)

Dyadic numbers

One type of numeric representation that has received some specific attention in relation to neural networks is dyadic numbers (or "dyadic rationals"). These are rational numbers represented as a division with an integer numerator and a power-of-two denominator.

Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680
Nilsson, J. On numbers badly approximable by dyadic rationals. Isr. J. Math. 171, 93–110 (2009), https://doi.org/10.1007/s11856-009-0042-9
Dyadic Rationals and Surreal Number Theory, C. Avalos-Ramos, J. A. Felix-Algandar, J. A. Nieto, IOSR Journal of Mathematics (IOSR-JM), e-ISSN: 2278-5728, p-ISSN: 2319-765X. Volume 16, Issue 5 Ser. IV (Sep– Oct 2020), p.35-43, www.iosrjournals.org, PDF: https://www.academia.edu/download/64798757/E1605043543.pdf
David Spuler, March 2024, Chapter 55. Advanced Number Systems, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9

Residue Number System

The residue number system is an alternative method for floating point arithmetic.

JK Lee, L Mukhanov, AS Molahosseini, Resource-Efficient Convolutional Networks: A Survey on Model-, Arithmetic-, and Implementation-Level Techniques, 2023, https://dl.acm.org/doi/abs/10.1145/3587095, PDF: https://dl.acm.org/doi/pdf/10.1145/3587095
Zhi-Gang Liu and Matthew Mattina, Efficient residue number system based Winograd convolution, 2020, In Proceedings of the European Conference on Computer Vision (ECCV’20). 53–6, https://arxiv.org/abs/2007.12216
S. Salamat, M. Imani, S. Gupta, and T. Rosing, RNSnet: In-memory neural network acceleration using residue number system, 2018, In Proceedings of the 2018 IEEE International Conference on Rebooting Computing (ICRC’18), 1–12, https://ieeexplore.ieee.org/document/8638592
N. Samimi, M. Kamal, A. Afzali-Kusha, and M. Pedram, Res-DNN: A residue number system-based DNN accelerator unit, 2020, IEEE Transactions on Circuits and Systems I: Regular Papers 67, 2 (2020), 658–671, https://ieeexplore.ieee.org/document/8903528
G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri, Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035
CY Hung, B Parhami, 1994, An approximate sign detection method for residue numbers and its application to RNS division Computers & Mathematics with Applications, Elsevier, https://doi.org/10.1016/0898-1221(94)90052-3, https://www.sciencedirect.com/science/article/pii/0898122194900523
Molahosseini AS, De Sousa LS, Chang C-H, 2017, Embedded systems design with special arithmetic and number systems. Springer. https://doi.org/10.1007/978-3-319-49742-6, https://link.springer.com/book/10.1007/978-3-319-49742-6 (A text that contains multiple papers on LNS and RNS.)
V Arrigoni, B Rossi, P Fragneto, G Desoli, Approximate operations in Convolutional Neural Networks with RNS data representation. ESANN, 2017, https://www.esann.org/sites/default/files/proceedings/legacy/es2017-30.pdf
Hiroki Nakahara and Tsutomu Sasao. A deep convolutional neural network based on nested residue number system. In International Conference on Field Programmable Logic and Applications, 2015, https://ieeexplore.ieee.org/document/7293933
Nicholas S Szabo and Richard I Tanaka. Residue arithmetic and its applications to computer technology. McGraw-Hill, 1967, https://www.amazon.com/Residue-Arithmetic-Application-Computer-Technology/dp/0070626596/

Posit Number System

The posit number system is an alternative arithmetic using multiple exponent fields. Posit numbers have been used in neural networks as an alternative to floating point numbers. Also possible is "posit quantization".

JK Lee, L Mukhanov, AS Molahosseini, Resource-Efficient Convolutional Networks: A Survey on Model-, Arithmetic-, and Implementation-Level Techniques, 2023, https://dl.acm.org/doi/abs/10.1145/3587095, PDF: https://dl.acm.org/doi/pdf/10.1145/3587095
Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson, and D. Kudithipudi, Deep positron: A deep neural network using the posit number system. 2019, In Proceedings of the 2019 Design, Automation, and Test in Europe Conference and Exhibition (DATE’19). 1421–1426, https://arxiv.org/abs/1812.01762
Zachariah Carmichael, Hamed F. Langroudi, Char Khazanov, Jeffrey Lillie, John L. Gustafson, and Dhireesha Kudithipudi, Performance-efficiency trade-off of low-precision numerical formats in deep neural networks, 2019, In Proceedings of the 2019 Conference for Next Generation Arithmetic (CoNGA’19), ACM, New York, NY, Article 3, 9 pages, https://doi.org/10.1145/3316279.3316282
G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri, Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035 (Survey of number systems with good coverage of Posits.)
Jinming Lu, Siyuan Lu, Zhisheng Wang, Chao Fang, Jun Lin, Zhongfeng Wang, Li Du, Sep 2019, Training Deep Neural Networks Using Posit Number System, 2019 32nd IEEE International System-on-Chip Conference (SOCC), https://ieeexplore.ieee.org/abstract/document/9088105/, https://arxiv.org/pdf/1909.03831
Raul Murillo, Alberto A. Del Barrio, Guillermo Botella, Min Soo Kim, HyunJin Kim, Nader Bagherzadeh, R Murillo, 2021, PLAM: A posit logarithm-approximate multiplier, https://arxiv.org/pdf/2102.09262
F. de Dinechin, L. Forget, J.-M. Muller, and Y. Uguen, “Posits: the good, the bad and the ugly,” in Proceedings of the Conference for Next Generation Arithmetic 2019. New York, NY, USA: ACM, mar 2019, pp. 1–10. https://dl.acm.org/doi/10.1145/3316279.3316285
R. Murillo, A. A. Del Barrio, and G. Botella, “Deep PeNSieve: A deep learning framework based on the posit number system,” Digital Signal Processing, vol. 102, p. 102762, Jul 2020. https://www.sciencedirect.com/science/article/abs/pii/S105120042030107X (An example of "posit quantization".)
H. F. Langroudi, Z. Carmichael, and D. Kudithipudi, “Deep Learning Training on the Edge with Low-Precision Posits,” arXiv e-prints, pp. 1474–1479, Jul 2019. https://arxiv.org/abs/1907.13216
M. K. Jaiswal and H. K. So, “Universal number posit arithmetic generator on FPGA,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), vol. 2018-Janua. IEEE, mar 2018, pp. 1159–1162. https://ieeexplore.ieee.org/document/8342187
R. Chaurasiya et al., “Parameterized Posit Arithmetic Hardware Generator,” in 2018 IEEE 36th International Conference on Computer Design (ICCD). IEEE, oct 2018, pp. 334–341. https://ieeexplore.ieee.org/document/8615707
M. K. Jaiswal and H. K. So, “PACoGen: A hardware posit arithmetic core generator,” IEEE Access, vol. 7, pp. 74 586–74 601, 2019. https://ieeexplore.ieee.org/document/8731915
Y. Uguen, L. Forget, and F. de Dinechin, “Evaluating the Hardware Cost of the Posit Number System,” in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, sep 2019, pp. 106–113. https://ieeexplore.ieee.org/document/8892116
R. Murillo, A. A. Del Barrio, and G. Botella, “Customized posit adders and multipliers using the FloPoCo core generator,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, oct 2020, pp. 1–5. https://ieeexplore.ieee.org/document/9180771
M. Cococcioni, F. Rossi, E. Ruffaldi, and S. Saponara, “Fast deep neural networks for image processing using posits and ARM scalable vector extension,” Journal of Real-Time Image Processing volume 17, pages759–771 (2020), https://link.springer.com/article/10.1007/s11554-020-00984-x
Jinming Lu; Chao Fang; Mingyang Xu; Jun Lin; Zhongfeng Wang, 2020, “Evaluations on Deep Neural Networks Training Using Posit Number System,” IEEE Transactions on Computers, vol. 14, no. 8, pp. 1–1, 2020. https://ieeexplore.ieee.org/document/9066876
M. K. Jaiswal and H. K. So, “Architecture Generator for Type-3 Unum Posit Adder/Subtractor,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), vol. 2018-May. IEEE, 2018, pp. 1–5. https://ieeexplore.ieee.org/document/8351142
J. L. Gustafson and I. Yonemoto, “Beating Floating Point at its Own Game: Posit Arithmetic,” Supercomputing Frontiers and Innovations, vol. 4, no. 2, pp. 71–86, Jun 2017, https://dl.acm.org/doi/10.14529/jsfi170206, PDF: https://gwern.net/doc/ai/nn/sparsity/low-precision/2017-gustafson.pdf
S. H. F. Langroudi, T. Pandit and D. Kudithipudi, "Deep learning inference on embedded devices: Fixed-point vs posit", Proc. 1st Workshop Energy Efficient Mach. Learn. Cognit. Comput. Embedded Appl. (EMC2), pp. 19-23, Mar. 2018. https://arxiv.org/abs/1805.08624

Tropical Algebra (Max-Plus)

Tropical algebra is a number system based on addition and maximum operations. The "max-plus" tropical algebra can be used to analyze zero-multiplication AI models that use addition and maximum functions; see Max-Plus networks. There is also a "min-plus" tropical algebra using minimum and addition operations. The tropical algebra is also closely related to the "minimax" algebra.

Some other areas of theory are somewhat related to tropical algebra. One method to approximate Logarithmic Number System (LNS) addition is to use maximum and addition. Also related is the calculation of Softmax, so tropical algebra may have relevance to approximation of Softmax using maximum and addition.

Research papers on tropical algebra:

L. Zhang, G. Naitzat, and L.-H. Lim, “Tropical geometry of deep neural networks,” in Proc. Int’l Conf. on Machine Learning, vol. 80, pp. 5824–5832, PMLR, 2018. https://arxiv.org/abs/1805.07091 (Analysis of neural network architecture using tropical algebra.)
P. Maragos, V. Charisopoulos, and E. Theodosis, 2021, Tropical Geometry and Machine Learning, https://ieeexplore.ieee.org/document/9394420
G Smyrnis, P Maragos, 2019, Tropical polynomial division and neural networks arXiv preprint arXiv:1911.12922, https://arxiv.org/abs/1911.12922 (Lots of tropical algebra theory, but also a neural network approximation tested.)
V. Charisopoulos and P. Maragos, 2018, “A tropical approach to neural networks with piecewise linear activations,” arXiv preprint arXiv:1805.08749, https://arxiv.org/abs/1805.08749
Diane Maclagan and Bernd Sturmfels. Introduction to tropical geometry, volume 161. American Mathematical Soc., 2015, https://bookstore.ams.org/gsm-161
Wikipedia, Tropical geometry, https://en.wikipedia.org/wiki/Tropical_geometry
Smyrnis G and Maragos P. Multiclass neural network minimization via tropical Newton polytope approximation. Proceedings of the 37th International Conference on Machine Learning. (9068-9077). /doi/10.5555/3524938.3525779
David Spuler, March 2024, Chapter 51. Zero-Multiplication Models, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Ioannis Kordonis, Petros Maragos, 27 Jun 2023, Revisiting Tropical Polynomial Division: Theory, Algorithms and Application to Neural Networks, https://arxiv.org/abs/2306.15157

MiniMax Algebra

Papers on minimax algebra:

R. Cuninghame-Green, Minimax Algebra. Springer-Verlag, 1979. https://books.google.com.au/books/about/Minimax_Algebra.html?id=l7z_CAAAQBAJ&redir_esc=y
David Spuler, March 2024, Chapter 55. Advanced Number Systems, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Alexandra Sternlicht, June 18, 2025, China’s MiniMax debuts M1 AI model that it says costs 200x less to train than OpenAI’s GPT-4, https://fortune.com/2025/06/18/chinas-minimax-m1-ai-model-200x-less-expensive-to-train-than-openai-gpt-4/
MiniMax: Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, (and many more authors), 16 Jun 2025, MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention, https://arxiv.org/abs/2506.13585 https://github.com/MiniMax-AI/MiniMax-M1 (A 456B MoE reasoning model trained with RL and has various optimizations in training efficiency and attention kernel.)

Log-Sum-Exp Networks

Log-sum-exp (LSE) networks involve the formula that is a triple sequence on a vector of numbers: take the logarithm of a sum of exponentials. This is a theoretically interesting area, but not a mainstream neural network architecture.

Haven't we seen the log-sum-exp pattern elsewhere? Yes, several other areas of research are related to log-sum-exp theory. Because logarithmic number system (LNS) addition involves computing exponentials of log-domain values (i.e. antilogarithms), adding them, and then re-converting them to log-domain, this is also emulating a "log of a sum of exponentials" calculation. Hence, log-sum-exp theory relates to approximating LNS addition (for a zero-multiplication logarithmic model). Also, the "sum of exponentials" is the same as the calculation required for the denominator of Softmax calculations, so log-sum-exp theory is also indirectly related to Softmax approximation. Finally, since the use of the maximum function is one way to approximate log-sum-exp (and also LNS addition), the theory of "max-plus networks" based on "tropical algebra" is indirectly related to log-sum-exp networks.

Research papers on log-sum-exp (LSE) networks:

GC Calafiore, C Possieri, 2020, Efficient model-free Q-faetor approximation in value space via log-sum-exp neural networks, 2020 European Control Conference, https://ieeexplore.ieee.org/abstract/document/9143765/, PDF: https://core.ac.uk/download/pdf/327178231.pdf
G. Calafiore, S. Gaubert, and C. Possieri, “Log-sum-exp neural networks and posynomial models for convex and log-log-convex data,” IEEE Transactions on Neural Networks and Learning Systems, 2019. https://ieeexplore.ieee.org/abstract/document/8715799/, https://arxiv.org/pdf/1806.07850
GC Calafiore, S Gaubert, 2020, A universal approximation result for difference of log-sum-exp neural networks, IEEE Transactions on Neural Networks and Learning Systems (Volume 31, Issue 12, December 2020), https://ieeexplore.ieee.org/abstract/document/9032340/, PDF: https://arxiv.org/pdf/1905.08503
Giuseppe C. Calafiore, Stephane Gaubert, Member, Corrado Possieri, 21 May 2019, A Universal Approximation Result for Difference of log-sum-exp Neural Networks, https://arxiv.org/abs/1905.08503
Giuseppe C. Calafiore, Stephane Gaubert, Corrado Possieri, 8 Dec 2018 (v2), Log-sum-exp neural networks and posynomial models for convex and log-log-convex data, https://arxiv.org/abs/1806.07850
Lingyun Yao, Martin Trapp, Jelin Leslin, Gaurav Singh, Peng Zhang, Karthekeyan Periasamy, Martin Andraud, 22 May 2024, On Hardware-efficient Inference in Probabilistic Circuits, https://arxiv.org/abs/2405.13639

Trigonometric Approximations

A very surprising method to replace multiplications with addition is to do so using trigonometric approximations. Note that this is not the same topic as "trigonometric neural networks" in other papers.

Jingyong Cai, Masashi Takemoto,Yuming Qiu andHironori Nakajo, Trigonometric Inference Providing Learning in Deep Neural Networks, Appl. Sci. 2021, 11(15), 6704; https://doi.org/10.3390/app11156704, https://www.mdpi.com/2076-3417/11/15/6704, PDF: https://www.mdpi.com/2076-3417/11/15/6704/pdf
Jingyong Cai, 2022, Log-or-Trig: Towards efficient learning in deep neural networks Thesis, Graduate School of Engineering, Tokyo University of Agriculture and Technology, https://tuat.repo.nii.ac.jp/?action=repository_action_common_download&item_id=1994&item_no=1&attribute_id=16&file_no=3, PDF: https://tuat.repo.nii.ac.jp/index.php?action=pages_view_main&active_action=repository_action_common_download&item_id=1994&item_no=1&attribute_id=16&file_no=1&page_id=13&block_id=39 (Examines LNS and trigonometric approximations.)

Double-Base Number System (DBNS)

The DBNS is an advanced number system, where a number is represented by two or more bases, rather than a normal base number with only a single base. In non-DBNS, decimal is base 10, and binary is base 2. But in DBNS, you can have a "base 2 and 3" number, which is the sum of multiples of 2 and 3.

You can also have more than 2 bases, in which case it is called the Multiple-Base Number System (MBNS).

Note that there is also an extension of the logarithmic number system (LNS) called Multi-dimensional LNS (MDLNS).

Research papers on the DBNS or MBNS:

R. Muscedere, G.A. Jullien,V. Dimitrov, W.C. Miller, 2000, “Non-linear signal processing using index calculus DBNS arithmetic”, Proc. of the 2090 SPIE conf: on Advanced Algorithms and Architectures in Signal Processing, San Diego, August 2000, https://spie.org/Publications/Proceedings/Paper/10.1117/12.406502?SSO=1
R. Muscedere, G.A. Jullien, V.S. Dimitrov and W.C. Miller, 2000, On efficient techniques for difficult operations in one and two-digit DBNS index calculus, Conference Record of the Thirty-Fourth Asilomar Conference on Signals, Systems and Computers (Cat. No.00CH37154), Oct. 2000, https://ieeexplore.ieee.org/document/910637
Vassil Dimitrov, Graham Jullien, Roberto Muscedere, Multiple-Base Number System: Theory and Applications (Circuits and Electrical Engineering Book 2), Part of: Circuits and Electrical Engineering (2 books), Jan 24, 2012 https://www.amazon.com/Multiple-Base-Number-System-Applications-Engineering-ebook/dp/B00847CSAG/
Jiajia Chen & Chip-Hong Chang, 2017, Double-Base Number System and Its Application in FIR Filter Design, Embedded Systems Design with Special Arithmetic and Number Systems, pp. 277–310, https://link.springer.com/chapter/10.1007/978-3-319-49742-6_11
V. S. Dimitrov, G. A. Jullien, and W. C. Liller, “Theory and applications of the double-base number system,” IEEE Trans. Comput., vol. 48, no. 10, pp. 1098–1106, Oct. 1999. https://ieeexplore.ieee.org/document/805158
Jiajia Chen; Chip-Hong Chang; Feng Feng; Weiao Ding; Jiatao Ding, 2015, Novel Design Algorithm for Low Complexity Programmable FIR Filters Based on Extended Double Base Number System, IEEE Transactions on Circuits and Systems I: Regular Papers ( Volume 62, Issue 1, January 2015), https://ieeexplore.ieee.org/document/6914616
V. S. Dimitrov, L. Imbert, and A. Zakaluzny, 2007, “Multiplication by a constant is sublinear,” in Proc. 18th IEEE Symp. Comput. Arith., Montpellier, France, Jun. 2007, pp. 261–268. https://ieeexplore.ieee.org/abstract/document/4272873

Hybrid Number Systems

As if the DBNS/MBNS systems were not enough, there are various "hybrid" number systems that combine features from two or more different number systems. Another idea is a "dynamic number system" which is a hybrid system, where the model changes aspects or parameters of its number systems on the fly during inference or training.

Papers on hybrid number systems include:

G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri, Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035 (This survey paper has a section on hybrid number systems and dynamic number systems.)