Aussie AI

Advanced AI Mathematics Research

  • Last Updated 29 September, 2024
  • by David Spuler, Ph.D.

There is much research on advanced mathematics to optimize inference algorithms and training, mainly focused on areas such as:

Advanced Numeric Representations

There are various ways to represent numbers in computers. Although there are common ways in programming languages, these are not the only ways. Here's a list of the basic ones:

  • Integer 2's complement
  • Integer 1's complement
  • Floating-point
  • Fixed-point

And here are some of the more advanced research types:

  • Brain float (BF16 and BF32)
  • Block floating-point (BFP)
  • Hybroid block floating-point (HBFP)

Although much research focuses on 16-bit floating point or integer representations of weights, there are some more novel and interesting alternatives. Programmers are used to the current ways that computers store numbers into bits, but these are just traditions, and arose via various trade-offs that may no longer apply with modern increases in computing power. None of these techniques seem to be in practical usage yet, but it's possible that a significant breakthrough still lies in this area of research.

General number representation research:

Fixed-Point Arithmetic

Most programming languages use floating-point arithmetic to represent fractional numbers, but there's a simpler method: fixed-point arithmetic. The idea is to have a single scaling factor, rather than storing an exponent in each number. The main advantage is that it's all integers.

Fixed point has many applications that are simpler than AI. For example, storing money in a dollars and cents format with a scaling factor of 100 (cents), allows $10.23 to be represented as the integer "1023" rather than a more complex floating point number.

To understand how it works, let's do decimal and scale by 1000. The fractional number "1.234" when scaled by 1000 becomes "1234", and "5.67" becomes "5670".

Addition of fixed-point numbers is just integer addition. Adding "1.234+5.67" becomes integer addition of "1234+5670" and yields the integer "6904" which represents "6.904". There is no additional scaling requires after using integer addition.

Multiplication of two fixed-point numbers is integer multiplication. Multiplying "1.234*567" to get "6.99678" becomes "1234*5670" which gives integer "6,996,780". You can see that the number is too high because it's been scaled twice. Hence, we need to divide the integer by the 1,000 scaling factor to get "6,996" (and truncating the lower fractional digits). So fixed-point format changes floating-point multiplication into an integer multiplication, followed by an integer division to re-adjust the scaling factor.

In practice, fixed-point numbers would use a power-of-two scaling factor (e.g. 1,024), rather than a decimal (e.g. 1,000), because that (a) allows bitshifting to re-adjust the scale in multiplication, and (b) also simplifies conversion to/from floating point numbers, because they're stored with a power-of-two exponent.

Note that fixed point is effectively just storing the mantissa bits as an integer, and not storing the exponent anywhere. The sign bit is now simply the first bit of the integer, as in normal integer representations, so negative fixed-point values are represented as negative integers. We've also changed the meaning of the mantissa so it's not "normalized", which means there's no longer any "implicit" or "hidden" leading 1 bit. In fixed-point, the integer is exactly as given, without any hidden leading bit.

Memory reduction. Fixed point numbers require less memory storage. It's like we've taken all the exponent bits and thrown them away. In standard floating-point 32-bit numbers (i.e., float), there are 8 bits for the exponent (plus 1 sign bit, 23 mantissa bits, and 1 extra hidden implicit mantissa bit). With fixed-point, these 8 bits aren't needed, because we've effectively made the exponent a global variable for all the numbers. Hence, memory requirements have reduced from 32 bits to 24 bits.

Precision. We can store 1 decimal digit in about log2(10)=3.32 bits. Hence, the 24 bits of IEEE 32-bit floating-point (FP32) can store about 7 digits of precision. If we store in fixed-point with a 1024 scale factor, then we've only really got 3 digits after the decimal point. On the other hand, a higher scaling factor could be used. Generally, fixed-point has more trouble storing a high range of values (large magnitude to small fractional values) with high precision (number of fractional digits), compared to floating-point.

But this is AI and we may not need that much precision in computations. Huge weight values are not necessary in most models, and could be reduced if needed anyway. At runtime, the activations are frequently normalized, so they shouldn't reach huge magnitudes during dynamic computations. Also, by dismissing the 8-bit exponent, we could even have more precision in fixed-point in terms of fractional digits, because we can use a 32-bit mantissa (rather than only 24 bits).

Integer AI inference. Hence, a simple way to get an integer AI engine is to switch to fixed-point with a power-of-two scaling factor. Vector dot products become a fixed-point multiplication (i.e. an integer multiplication and a bitshift re-scale), followed by addition of all the paired products (i.e. accumulate).

In fact, we can further optimize this. Since all the paired products are over-scaled by the same factor, we can do the bitshift re-scale after adding them up. Furthermore, we could probably merge the paired multiply and then add them to use "fused multiply-add" primitives (FMA or MAC), and then one single bitshift re-scale. The only problem with that idea is that we risk problems with integer overflow if either the computed values or the scaling factor is too large.

End-to-end integers. Using fixed point arithmetic allows integer calculations for the main tensors and their matrix multiplication or vector dot products. However, there are some problems with the non-linear components such as activations (e.g. GELU, although RELU is easy to do in integers), Softmax, and other pieces.

On the other hand, we could use a pre-computed lookup table to map integer-to-integer for any of these non-linear functions. This is a fast calculation, and there is also often hardware acceleration support for LUT-based calculations.

The size of the LUTs can be an issue. A 16-bit LUT with 2^16 values is easy to do with about 65,000 values (2 bytes each), even on a laptop, but if we're using 32-bit integers then the LUT has 2^32 entries (about 4.3 billion), each 4-bytes, so we're looking at over 17GB of read-only storage for the table (and we'll need more than one). This amount of storage may be problematic on low-end devices such as phones and tablets, but is fine for a high-end data center box with lots of RAM. Where it's a problem, we can also approximate a 32-bit LUT with 16 or 24 bits, losing some precision.

Research on Fixed-Point Arithmetic. Papers include:

  • Arash Fayyazi, Mahdi Nazemi, Armin Abdollahi, Massoud Pedram, 7 Jul 2023, BlendNet: Design and Optimization of a Neural Network-Based Inference Engine Blending Binary and Fixed-Point Convolutions, https://arxiv.org/abs/2307.03784
  • D. D. Lin, S. S. Talathi and V. S. Annapureddy, Fixed point quantization of deep convolutional networks, Proc. 33rd Int. Conf. Int. Conf. Mach. Learn., vol. 48, pp. 2849-2858, Jun. 2016. https://arxiv.org/abs/1511.06393 (Fixed point quantization.)
  • S. Gupta, A. Agrawal, K. Gopalakrishnan and P. Narayanan, Deep learning with limited numerical precision, Proc. 32nd Int. Conf. Mach. Learn. (ICML), vol. 37, pp. 1737-1746, 2015. https://arxiv.org/abs/1502.02551 (Fixed point quantization.)
  • S. Anwar, K. Hwang and W. Sung, Fixed point optimization of deep convolutional neural networks for object recognition, Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), pp. 1131-1135, Apr. 2015. https://ieeexplore.ieee.org/document/7178146 (Fixed point quantization.)
  • P. Gysel, M. Motamedi and S. Ghiasi, Hardware-oriented approximation of convolutional neural networks, Proc. Int. Conf. Learn. Represent., pp. 1-15, 2016. https://arxiv.org/abs/1604.03168 (Fixed point quantization.)
  • Wikipedia, 2024 (accessed), Fixed-point arithmetic, https://en.wikipedia.org/wiki/Fixed-point_arithmetic
  • Reza Arablouei, Liang Wang, Caitlin Phillips, Lachlan Currie, Jordan Yates, Greg Bishop-Hurley, 20 Dec 2022 ( v2), In-situ animal behavior classification using knowledge distillation and fixed-point quantization, https://arxiv.org/abs/2209.04130
  • Ahmad Shawahna, Sadiq M. Sait, Aiman El-Maleh, Irfan Ahmad, 22 Mar 2022, FxP-QNet: A Post-Training Quantizer for the Design of Mixed Low-Precision DNNs with Dynamic Fixed-Point Representation, https://arxiv.org/abs/2203.12091
  • Qing Jin, Jian Ren, Richard Zhuang, Sumant Hanumante, Zhengang Li, Zhiyu Chen, Yanzhi Wang, Kaiyuan Yang, Sergey Tulyakov, 10 Feb 2022, F8Net: Fixed-Point 8-bit Only Multiplication for Network Quantization, https://arxiv.org/abs/2202.05239
  • Lorenz Kummer, Kevin Sidak, Tabea Reichmann, Wilfried Gansterer, 27 Aug 2021 ( v4), Adaptive Precision Training (AdaPT): A dynamic fixed point quantized training approach for DNNs, https://arxiv.org/abs/2107.13490
  • Hamed F. Langroudi, Vedant Karia, Tej Pandit, Dhireesha Kudithipudi, 6 Apr 2021, TENT: Efficient Quantization of Neural Networks on the tiny Edge with Tapered FixEd PoiNT, https://arxiv.org/abs/2104.02233
  • Je Yang, Seongmin Hong, Joo-Young Kim, 24 Feb 2021, FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism, https://arxiv.org/abs/2102.12103
  • Rishabh Goyal, Joaquin Vanschoren, Victor van Acht, Stephan Nijssen, 3 Feb 2021, Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms, https://arxiv.org/abs/2102.02147
  • Heming Sun, Zhengxue Cheng, Masaru Takeuchi, Jiro Katto, 9 Jul 2020, End-to-end Learned Image Compression with Fixed Point Weight Quantization, https://arxiv.org/abs/2007.04684
  • Lukas Enderich, Fabian Timm, Wolfram Burgard, 19 Feb 2020, SYMOG: learning symmetric mixture of Gaussian modes for improved fixed-point quantization, https://arxiv.org/abs/2002.08204
  • Hongxing Gao, Wei Tao, Dongchao Wen, Tse-Wei Chen, Kinya Osa, Masami Kato, 19 Nov 2019, IFQ-Net: Integrated Fixed-point Quantization Networks for Embedded Vision, https://arxiv.org/abs/1911.08076
  • MohammadHossein AskariHemmat, Sina Honari, Lucas Rouhier, Christian S. Perone, Julien Cohen-Adad, Yvon Savaria, Jean-Pierre David, 9 Sep 2019 ( v2), U-Net Fixed-Point Quantization for Medical Image Segmentation, https://arxiv.org/abs/1908.01073
  • Charbel Sakr, Naresh Shanbhag, 31 Dec 2018, Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm, https://arxiv.org/abs/1812.11732
  • Darryl D. Lin, Sachin S. Talathi, V. Sreekanth Annapureddy, 2 Jun 2016 ( v3), Fixed Point Quantization of Deep Convolutional Networks, https://arxiv.org/abs/1511.06393
  • Peng Qiao, Sidun Liu, Tao Sun, Ke Yang, Yong Dou, 29 Jan 2023, Towards Vision Transformer Unrolling Fixed-Point Algorithm: a Case Study on Image Restoration, https://arxiv.org/abs/2301.12332
  • Ali Hadi Zadeh, Mostafa Mahmoud, Ameer Abdelhadi, Andreas Moshovos, 23 Mar 2022, Mokey: Enabling Narrow Fixed-Point Inference for Out-of-the-Box Floating-Point Transformer Models, https://arxiv.org/abs/2203.12758
  • Farhad Taheri, Siavash Bayat-Sarmadi, Hatame Mosanaei-Boorani, Reza Taheri, 19 Feb 2023, Fixflow: A Framework to Evaluate Fixed-point Arithmetic in Light-Weight CNN Inference, https://arxiv.org/abs/2302.09564
  • Rishabh Goyal, Joaquin Vanschoren, Victor van Acht, Stephan Nijssen, 3 Feb 2021, Fixed-point Quantization of Convolutional Neural Networks for Quantized Inference on Embedded Platforms, https://arxiv.org/abs/2102.02147
  • Sambhav R. Jain, Albert Gural, Michael Wu, Chris H. Dick, 28 Feb 2020 (v3), Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks, https://arxiv.org/abs/1903.08066
  • Seyed H. F. Langroudi, Tej Pandit, Dhireesha Kudithipudi, 22 May 2018, Deep Learning Inference on Embedded Devices: Fixed-Point vs Posit, https://arxiv.org/abs/1805.08624
  • Liangzhen Lai, Naveen Suda, Vikas Chandra, 8 Mar 2017, Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations, https://arxiv.org/abs/1703.03073
  • Naveen Mellempudi, Abhisek Kundu, Dipankar Das, Dheevatsa Mudigere, Bharat Kaul, 1 Feb 2017, Mixed Low-precision Deep Learning Inference using Dynamic Fixed Point, https://arxiv.org/abs/1701.08978
  • Dingyi Dai, Yichi Zhang, Jiahao Zhang, Zhanqiu Hu, Yaohui Cai, Qi Sun, Zhiru Zhang, 31 Jan 2024, Trainable Fixed-Point Quantization for Deep Learning Acceleration on FPGAs, https://arxiv.org/abs/2401.17544
  • Sashank Macha, Om Oza, Alex Escott, Francesco Caliva, Robbie Armitano, Santosh Kumar Cheekatmalla, Sree Hari Krishnan Parthasarathi, Yuzong Liu, 4 Mar 2023, Fixed-point quantization aware training for on-device keyword-spotting, https://arxiv.org/abs/2303.02284
  • Alberto Delmas Lascorz, Mostafa Mahmoud, Ali Hadi Zadeh, Milos Nikolic, Kareem Ibrahim, Christina Giannoula, Ameer Abdelhadi, Andreas Moshovos, 2024, Atalanta: A Bit is Worth a “Thousand” Tensor Values, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, April 2024, Pages 85–102, https://doi.org/10.1145/3620665.3640356 https://dl.acm.org/doi/abs/10.1145/3620665.3640356
  • Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu, 19 Apr 2024, decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points, https://arxiv.org/abs/2404.12759 Code: https://github.com/bytedance/decoupleQ (Decouple parameters into integer and floating-point parts for more accurate quantization at low bitwidths.)
  • SHF Langroudi, 2023, Tapered-Precision Numerical Formats for Deep Learning Inference and Training Ph.D. thesis, Department of Electrical and Computer Engineering Kate Gleason College of Engineering Rochester Institute of Technology Rochester, New York, https://www.proquest.com/openview/a88513887d40ec2e6744025447d7d948/1?pq-origsite=gscholar&cbl=18750&diss=y
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs] (Dec. 2015). http://arxiv.org/abs/1512.03385 arXiv: 1512.03385.
  • Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. 2017. Trained Ternary Quantization. arXiv:1612.01064 [cs] (Feb. 2017). http://arxiv.org/abs/1612.01064 arXiv: 1612.01064
  • Fabien Geyer, Johannes Freitag, Tobias Schulz, Sascha Uhrig, 16 Jan 2024, Efficient and Mathematically Robust Operations for Certified Neural Networks Inference, https://arxiv.org/abs/2401.08225 https://arxiv.org/pdf/2401.08225 (Finds that fixed point arithmetic is more efficient with comparable accuracy to using floating point.)
  • Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David, 23 Sep 2015 (v5), Training deep neural networks with low precision multiplications, https://arxiv.org/abs/1412.7024

Block Floating-Point (BFP)

Block floating-point is a numeric representation of floating-point numbers where a group of numbers have their own mantissas (as usual), but share a single exponent. Hence, this saves space by not storing many exponent bits.

Another way to think about this is that the numbers are stored as integers, but with a scaling factor. This means that BFP is very similar to fixed-point arithmetic, except that it has granular per-block or per-vector scaling factors (the exponents). Hence, fixed-point arithmetic with a chosen scaling factor is effectively a global version of BFP with a single exponent.

BFP has an integer (mantissa) and a scaling factor that's a power-of-two (exponent). This is very similar to dyadic numbers, which is another fancy number system. See further below for research papers on the dyadic number system. Another related number system is the posit number system, where numbers have multiple exponent values.

Pros of BFP. The goals of BFP in inference are advantages such as:

  • Integer-only arithmetic (e.g. for dot product)
  • Memory saving by not storing exponents.
  • No loss in accuracy or precision of results.
  • Hardware-support already exists (various chips).

BFP achieves integer-only arithmetic for dot product, because all of the paired-mantissas are calculated into an integer result with integer multiplication (and integer accumulation), and then the two exponents of each vector of BFP numbers is added up to get the new exponent. Hence, the result of a dot product of two vectorized BFP vectors is a single number, represented as two numbers: mantissa and exponent.

Memory storage is much improved by not storing the bits for the exponent in each number, but only once per block (or once per vector). But unlike quantization, there is no loss of precision because the same number of bits are stored. The same number of bits are stored for each number, but the exponent bits are shared across multiple numbers.

Cons of BFP. Notes on practical problems and issues with BFP:

  • Outliers. The basic way that BFP works is to scale all the numbers to the large number in a vector. This number sets the exponent because the exponent must be chosen so as to represent this large number. However, having very large weights or activations that are "outliers" is problematic. Similarly, very low fractional numbers or high-magnitude negative values can also be difficult. The actual issue is the range of values in a single block, and whether a single exponent can be used to represent all of the values in the block, without losing precision (digits) for any of the values. Typically, the largest value sets the exponent, and it will be low-magnitude values that may lose precision in the conversion to BFP if the exponent is too high.
  • Overflow. Numeric overflow of integer arithmetic may need accounting for. However, this issue is not specific to BFP, and probably is neither better nor worse than other numeric representations.

Detailed Notes. Additional notes about BFP formats:

  • No implicit mantissa bit. Whereas the traditional IEEE floating-point format for numbers has an implicit leading 1 in the mantissa (the "hidden" mantissa bit), this is not the case for BFP. Requiring the "normalization" of all numbers to have a leading 1 in the mantissa is not workable for this shared exponent method. Instead, a number is represented as 2^exponent*0.mantissa, rather than the IEEE version which is 2^exponent*1.mantissa bits. Hence, the mantissa bits in BFP are effectively an integer and can be processed as such with integer multiplication.
  • Variations. Different numbers of bits can be used for the mantissas and the exponent to get different formats. Also, mixed precision formats are possible where these parameters vary per block, called "Hybrid BFP" (HBFP).
  • Add-as-integer fails. There is an unusual add-as-integer method for standard IEEE floating-point numbers (Mogami, 2020), whereby simply adding mantissa bits (and exponent bits) implements an approximate multiplication called Mitchells' approximate multiplication. This would be desirable as it is integer addition, whereas BFP dot products require integer multiplication. However, trying to emulate add-as-integer by simply adding the mantissa integers in BFP doesn't work as well. The problem is that BFP mantissa values are not normalized, and there's no implicit leading 1 bit for the mantissa in BFP. One of the reasons add-as-integer is close to multiplication is that the numbers it is adding are fractions after an implicit leading 1 prefix. Using addition of BFP mantissa values is not at all accurate, and not useful as an approximation.
  • Per-number BFP. Note that if you think of super-sizing BFP so that every number has a separate exponent, such as to overcome the outlier problems, well, that's almost back to standard floating-point arithmetic (and it's lost all of the advantages in reduced memory). The hardware floating-point unit (FPU) does integer arithmetic on the mantissa and exponent bits to implement floating-point arithmetic. Floating-point multiply is an integer multiplication of mantissa bits and integer addition of exponent bits, but it's all done in hardware. However, this super-BFP does differ from standardized IEEE 754 floating-point arithmetic in that (a) it has a leading zero implicitly in the mantissa rather than a leading one bit, and (b) there's no bias factor that adjusts the exponent bits.
  • Global BFP. And if you go the other way, and think you'll have the same exponent number for every single weight in a model file, well, that's effectively going back to fixed-point arithmetic.

BFP Research. Research papers on Block Floating-Point:

Advanced Number Systems

Various alternative number systems have been used in computer software and hardware. They have different trade-offs in terms of precision and arithmetic complexity. Several of these have been examined in terms of model inference algorithms, mainly to reduce the number of multiplications. The main ones include:

  • Residue Number System (RNS)
  • Posit Number System (PNS)
  • Logarithmic Number System (LNS); see logarithmic models
  • Dyadic numbers

If you enjoy this kind of stuff, there's also some obscure ones:

  • Tropical algebra, based on max/min with addition; see also Max-Plus networks.
  • Minimax algebra (related to tropical algebra)
  • Double-Base Number System (DBNS)
  • Multiple-Base Number System (MBNS)
  • Multi-Dimensional Logarithmic Number System (MDLNS)
  • Semi-Logarithmic Number System (SLNS)

Dyadic numbers

One type of numeric representation that has received some specific attention in relation to neural networks is dyadic numbers (or "dyadic rationals"). These are rational numbers represented as a division with an integer numerator and a power-of-two denominator.

  • Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, Kurt Keutzer, HAWQ-V3: Dyadic Neural Network Quantization, Proceedings of the 38th International Conference on Machine Learning, PMLR 139:11875-11886, 2021, https://arxiv.org/abs/2011.10680
  • Nilsson, J. On numbers badly approximable by dyadic rationals. Isr. J. Math. 171, 93–110 (2009), https://doi.org/10.1007/s11856-009-0042-9
  • Dyadic Rationals and Surreal Number Theory, C. Avalos-Ramos, J. A. Felix-Algandar, J. A. Nieto, IOSR Journal of Mathematics (IOSR-JM), e-ISSN: 2278-5728, p-ISSN: 2319-765X. Volume 16, Issue 5 Ser. IV (Sep– Oct 2020), p.35-43, www.iosrjournals.org, PDF: https://www.academia.edu/download/64798757/E1605043543.pdf
  • David Spuler, March 2024, Chapter 55. Advanced Number Systems, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9

See also dyadic quantization.

Residue Number System

The residue number system is an alternative method for floating point arithmetic.

Posit Number System

The posit number system is an alternative arithmetic using multiple exponent fields. Posit numbers have been used in neural networks as an alternative to floating point numbers. Also possible is "posit quantization".

  • JK Lee, L Mukhanov, AS Molahosseini, Resource-Efficient Convolutional Networks: A Survey on Model-, Arithmetic-, and Implementation-Level Techniques, 2023, https://dl.acm.org/doi/abs/10.1145/3587095, PDF: https://dl.acm.org/doi/pdf/10.1145/3587095
  • Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson, and D. Kudithipudi, Deep positron: A deep neural network using the posit number system. 2019, In Proceedings of the 2019 Design, Automation, and Test in Europe Conference and Exhibition (DATE’19). 1421–1426, https://arxiv.org/abs/1812.01762
  • Zachariah Carmichael, Hamed F. Langroudi, Char Khazanov, Jeffrey Lillie, John L. Gustafson, and Dhireesha Kudithipudi, Performance-efficiency trade-off of low-precision numerical formats in deep neural networks, 2019, In Proceedings of the 2019 Conference for Next Generation Arithmetic (CoNGA’19), ACM, New York, NY, Article 3, 9 pages, https://doi.org/10.1145/3316279.3316282
  • G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri, Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035 (Survey of number systems with good coverage of Posits.)
  • Jinming Lu, Siyuan Lu, Zhisheng Wang, Chao Fang, Jun Lin, Zhongfeng Wang, Li Du, Sep 2019, Training Deep Neural Networks Using Posit Number System, 2019 32nd IEEE International System-on-Chip Conference (SOCC), https://ieeexplore.ieee.org/abstract/document/9088105/, https://arxiv.org/pdf/1909.03831
  • Raul Murillo, Alberto A. Del Barrio, Guillermo Botella, Min Soo Kim, HyunJin Kim, Nader Bagherzadeh, R Murillo, 2021, PLAM: A posit logarithm-approximate multiplier, https://arxiv.org/pdf/2102.09262
  • F. de Dinechin, L. Forget, J.-M. Muller, and Y. Uguen, “Posits: the good, the bad and the ugly,” in Proceedings of the Conference for Next Generation Arithmetic 2019. New York, NY, USA: ACM, mar 2019, pp. 1–10. https://dl.acm.org/doi/10.1145/3316279.3316285
  • R. Murillo, A. A. Del Barrio, and G. Botella, “Deep PeNSieve: A deep learning framework based on the posit number system,” Digital Signal Processing, vol. 102, p. 102762, Jul 2020. https://www.sciencedirect.com/science/article/abs/pii/S105120042030107X (An example of "posit quantization".)
  • H. F. Langroudi, Z. Carmichael, and D. Kudithipudi, “Deep Learning Training on the Edge with Low-Precision Posits,” arXiv e-prints, pp. 1474–1479, Jul 2019. https://arxiv.org/abs/1907.13216
  • M. K. Jaiswal and H. K. So, “Universal number posit arithmetic generator on FPGA,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), vol. 2018-Janua. IEEE, mar 2018, pp. 1159–1162. https://ieeexplore.ieee.org/document/8342187
  • R. Chaurasiya et al., “Parameterized Posit Arithmetic Hardware Generator,” in 2018 IEEE 36th International Conference on Computer Design (ICCD). IEEE, oct 2018, pp. 334–341. https://ieeexplore.ieee.org/document/8615707
  • M. K. Jaiswal and H. K. So, “PACoGen: A hardware posit arithmetic core generator,” IEEE Access, vol. 7, pp. 74 586–74 601, 2019. https://ieeexplore.ieee.org/document/8731915
  • Y. Uguen, L. Forget, and F. de Dinechin, “Evaluating the Hardware Cost of the Posit Number System,” in 2019 29th International Conference on Field Programmable Logic and Applications (FPL). IEEE, sep 2019, pp. 106–113. https://ieeexplore.ieee.org/document/8892116
  • R. Murillo, A. A. Del Barrio, and G. Botella, “Customized posit adders and multipliers using the FloPoCo core generator,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, oct 2020, pp. 1–5. https://ieeexplore.ieee.org/document/9180771
  • M. Cococcioni, F. Rossi, E. Ruffaldi, and S. Saponara, “Fast deep neural networks for image processing using posits and ARM scalable vector extension,” Journal of Real-Time Image Processing volume 17, pages759–771 (2020), https://link.springer.com/article/10.1007/s11554-020-00984-x
  • Jinming Lu; Chao Fang; Mingyang Xu; Jun Lin; Zhongfeng Wang, 2020, “Evaluations on Deep Neural Networks Training Using Posit Number System,” IEEE Transactions on Computers, vol. 14, no. 8, pp. 1–1, 2020. https://ieeexplore.ieee.org/document/9066876
  • M. K. Jaiswal and H. K. So, “Architecture Generator for Type-3 Unum Posit Adder/Subtractor,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS), vol. 2018-May. IEEE, 2018, pp. 1–5. https://ieeexplore.ieee.org/document/8351142
  • J. L. Gustafson and I. Yonemoto, “Beating Floating Point at its Own Game: Posit Arithmetic,” Supercomputing Frontiers and Innovations, vol. 4, no. 2, pp. 71–86, Jun 2017, https://dl.acm.org/doi/10.14529/jsfi170206, PDF: https://gwern.net/doc/ai/nn/sparsity/low-precision/2017-gustafson.pdf
  • S. H. F. Langroudi, T. Pandit and D. Kudithipudi, "Deep learning inference on embedded devices: Fixed-point vs posit", Proc. 1st Workshop Energy Efficient Mach. Learn. Cognit. Comput. Embedded Appl. (EMC2), pp. 19-23, Mar. 2018. https://arxiv.org/abs/1805.08624

Tropical Algebra (Max-Plus)

Tropical algebra is a number system based on addition and maximum operations. The "max-plus" tropical algebra can be used to analyze zero-multiplication AI models that use addition and maximum functions; see Max-Plus networks. There is also a "min-plus" tropical algebra using minimum and addition operations. The tropical algebra is also closely related to the "minimax" algebra.

Some other areas of theory are somewhat related to tropical algebra. One method to approximate Logarithmic Number System (LNS) addition is to use maximum and addition. Also related is the calculation of Softmax, so tropical algebra may have relevance to approximation of Softmax using maximum and addition.

Research papers on tropical algebra:

  • L. Zhang, G. Naitzat, and L.-H. Lim, “Tropical geometry of deep neural networks,” in Proc. Int’l Conf. on Machine Learning, vol. 80, pp. 5824–5832, PMLR, 2018. https://arxiv.org/abs/1805.07091 (Analysis of neural network architecture using tropical algebra.)
  • P. Maragos, V. Charisopoulos, and E. Theodosis, 2021, Tropical Geometry and Machine Learning, https://ieeexplore.ieee.org/document/9394420
  • G Smyrnis, P Maragos, 2019, Tropical polynomial division and neural networks arXiv preprint arXiv:1911.12922, https://arxiv.org/abs/1911.12922 (Lots of tropical algebra theory, but also a neural network approximation tested.)
  • V. Charisopoulos and P. Maragos, 2018, “A tropical approach to neural networks with piecewise linear activations,” arXiv preprint arXiv:1805.08749, https://arxiv.org/abs/1805.08749
  • Diane Maclagan and Bernd Sturmfels. Introduction to tropical geometry, volume 161. American Mathematical Soc., 2015, https://bookstore.ams.org/gsm-161
  • Wikipedia, Tropical geometry, https://en.wikipedia.org/wiki/Tropical_geometry
  • Smyrnis G and Maragos P. Multiclass neural network minimization via tropical Newton polytope approximation. Proceedings of the 37th International Conference on Machine Learning. (9068-9077). /doi/10.5555/3524938.3525779
  • David Spuler, March 2024, Chapter 51. Zero-Multiplication Models, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
  • Ioannis Kordonis, Petros Maragos, 27 Jun 2023, Revisiting Tropical Polynomial Division: Theory, Algorithms and Application to Neural Networks, https://arxiv.org/abs/2306.15157

MiniMax Algebra

Papers on minimax algebra:

Log-Sum-Exp Networks

Log-sum-exp (LSE) networks involve the formula that is a triple sequence on a vector of numbers: take the logarithm of a sum of exponentials. This is a theoretically interesting area, but not a mainstream neural network architecture.

Haven't we seen the log-sum-exp pattern elsewhere? Yes, several other areas of research are related to log-sum-exp theory. Because logarithmic number system (LNS) addition involves computing exponentials of log-domain values (i.e. antilogarithms), adding them, and then re-converting them to log-domain, this is also emulating a "log of a sum of exponentials" calculation. Hence, log-sum-exp theory relates to approximating LNS addition (for a zero-multiplication logarithmic model). Also, the "sum of exponentials" is the same as the calculation required for the denominator of Softmax calculations, so log-sum-exp theory is also indirectly related to Softmax approximation. Finally, since the use of the maximum function is one way to approximate log-sum-exp (and also LNS addition), the theory of "max-plus networks" based on "tropical algebra" is indirectly related to log-sum-exp networks.

Research papers on log-sum-exp (LSE) networks:

Trigonometric Approximations

A very surprising method to replace multiplications with addition is to do so using trigonometric approximations. Note that this is not the same topic as "trigonometric neural networks" in other papers.

Double-Base Number System (DBNS)

The DBNS is an advanced number system, where a number is represented by two or more bases, rather than a normal base number with only a single base. In non-DBNS, decimal is base 10, and binary is base 2. But in DBNS, you can have a "base 2 and 3" number, which is the sum of multiples of 2 and 3.

You can also have more than 2 bases, in which case it is called the Multiple-Base Number System (MBNS).

Note that there is also an extension of the logarithmic number system (LNS) called Multi-dimensional LNS (MDLNS).

Research papers on the DBNS or MBNS:

Hybrid Number Systems

As if the DBNS/MBNS systems were not enough, there are various "hybrid" number systems that combine features from two or more different number systems. Another idea is a "dynamic number system" which is a hybrid system, where the model changes aspects or parameters of its number systems on the fly during inference or training.

Papers on hybrid number systems include:

  • G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri, Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035 (This survey paper has a section on hybrid number systems and dynamic number systems.)

More AI Research

Read more about: