Aussie AI

Logarithmic LLMs

  • Last Updated 3 November, 2024
  • by David Spuler, Ph.D.

Logarithms use high-school mathematics to change multiplications into additions. And this is an interesting idea, given all the shade that's been thrown on the expensive multiplication operators in neural networks, even with all this hardware acceleration going on.

A few people have thought of this already, and the amount of literature is beyond extensive. There was much theory in the 1970s and 1980s, and some real hardware implementations in the 1990s and 2000s. There's also been a lot of more recent research on this area in the 2010s and 2020s.

Why Logarithms?

The basic idea is that logarithms have this property:

        log (x * y) = log(x) + log(y)

Unfortunately, the same is not true for addition:

        log (x + y) != log(x) + log(y)

Types of AI Logarithmic Algorithms

Some of the ways in which the use of logarithms has been researched in relation to neural networks include:

  • Logarithmic bitshift quantization usings powers-of-two (see quantization)
  • Dyadic number quantization (see dyadic quantization)
  • Multiplication approximations using logs (see advanced math)
  • Logarithmic number system-based models (see LNS section below)
  • Logarithmic arbitrary-base quantization (see quantization)

Logarithmic Number System (LNS)

The LNS is a numeric representation that uses logarithms, but isn't the standard mathematical logarithmic arithmetic. It has been considered for use with neural networks as far back as 1997. In the LNS, multiplication changes to a fast additive operation, but addition itself becomes expensive. Thus, computation of vector dot products or multiply-add operations in machine learning are problematic, with various theoretical attempts to overcome the difficulties (with addition operators) researched in the literature.

LNS is not the only unusual theoretical number system available. In addition to the simple floating-point and fixed-point representations, LNS should be compared to other complicated number systems considered for machine learning, including the Residue Number System (RNS), Posit number system, and Dyadic numbers (see advanced mathematics).

Logarithmic Models

A pure logarithmic model is one that maintains its calculations using the Logarithmic Number System. Alsuhli et al. (2023) refers to this approach as an "end-to-end" LNS model, which means performing all calculations the "log-domain" (i.e. working on logarithms of values, rather than the original values). The idea is basically to change a multiplication by a weight into an addition, and any division into a subtraction. Instead of weights, the logarithm of a weight is stored and used throughout the layers. Also, intermediate computations should be stored as a logarithmic value, such as embeddings or probabilities, so that both sides of a MatMul are logarithmic, allowing addition to be used instead of arithmetic multiplication operations. This requires adjustments to other Transformer architectural components, such as normalization and Softmax. Theoretically, it should be workable once everything is changed to log-domain. However, practical problems arise because MatMul and vector dot product also require addition operations (after the multiplications), and LNS addition is slow because log-domain addition isn't normal addition, and cannot be easily hardware-accelerated.

Logarithmic weight arithmetic differs from normal weight multiplication. For weights greater than 1, the logarithm is positive and addition occurs; for weights from 0..1, which are effectively a division, the logarithm is negative and subtraction is used (or adding of a negative value, equivalently). If the weight is exactly 1, the logarithm is exactly 0, and adding 0 is as harmless as multiplying by 1. Potentially, the technique could involve integers or floating-point numbers to represent the logarithm.

Several problems need to be overcome to use LNS for models, including the cost of addition and handling of zero and negative numbers. Addition and subtraction are slow and problematic in LNS-based systems, so must be approximated or accelerated in various ways. It seems ironic to need to accelerate addition, since the whole point of the use of LNS is to accelerate multiplication by changing it into addition! But it's two different types of addition: the original linear-domain multiplication changes to normal fast addition, but then the original addition needs to change to log-domain addition, which is hard.

Zero weights must be handled separately, since the logarithm of zero is infinite. This requires a test for zero as part of the logic, or an algorithmic method to avoid zero values (e.g. using an extra bit flag to represent zero-ness). Alternatively, a hardware version of LNS would need to handle a zero reasonably.

Negative numbers are also problematic in the LNS, and models usually have both positive and negative weights. Since logarithms cannot be used on a negative number, the logarithm of the absolute value of the weight must be used, with an alternative method (e.g. sign bit) used to handle negative weights differently, so that the engine knows to subtract the weight's logarithm, rather than add in the LNS arithmetic. Alternatively, weights might be scaled so they are all positive, to avoid the log-of-negatives problem.

Does it work? The use of logarithmic numbers hasn't become widely used in AI models, possibly because vector dot product and matrix multiplication require not just multiplication, but addition of multiplications, and addition is difficult in LNS (usually approximate). Both training and inference need to be performed in LNS because it is approximate. Conversion back-and-forth between LNS and floating-point weights and probabilities also adds some overhead (in both training and inference), and possibly some more inaccuracy for inference. These issues might limit the model's accuracy compared to non-logarithmic floating point.

Furthermore, an LNS model stores the logarithms of weights as floating point numbers, and thus requires floating point addition rather than integer addition. The gain from changing floating point multiplication to floating point addition is nowhere near as large as changing it to integer arithmetic operations (e.g. as used in logarithmic quantization or integer-only quantization methods). Indeed, paradoxically, there are even circumstances where floating point addition is worse than floating point multiplication, because addition requires sequential non-parallelizable sub-operations, but this depends on the hardware acceleration and the exact representation of floating point numbers used.

Another concern is that some papers report that model inference is memory-bound rather than CPU-bound. In such cases, the conversion of arithmetic from multiplicaton to addition does not address the main bottleneck, and the LNS may have reduced benefit. The LNS does not allow the use of smaller data sizes, since it stores logarithms of weights and internal computations as floating-point, whereas quantization can use integers or smaller bit widths.

Some of the problematic issues with additions involving weights and activation functions, and in relation to training with LNS weights, are described in Alsuhli et al. (2023). These concerns limit the use of LNS numbers in an end-to-end method, and suggest the use of alternatives such as approximate logarithmic multiplication or logarithm-antilogarithm multiplications (Alsuhli et al., 2023). Nevertheless, there are several attempts in the literature to use LNS for model training and inference in various ways, starting with Arnold et al. (1991), using theory dating back to the 1980s.

End-to-End Logarithmic Model Research

Papers on the "end-to-end" use of logarithmic weights are below. Some papers revert to linear-addition to resolve the problem with slow accumulation, whereas others used various approximations.

  • D. Miyashita, E. H. Lee, and B. Murmann, “Convolutional neural networks using logarithmic data representation,” arXiv preprint arXiv:1603.01025, 2016. https://arxiv.org/abs/1603.01025 (A major paper on using log-domain weights and activations, using addition of log-domain values instead of multiplication, which also covers the difficulties with accumulation.)
  • G. Alsuhli, V. Sakellariou, H. Saleh, M. Al-Qutayri, Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035 (Extensive survey paper with a deep dive into the theory of LNS and other systems such as Residue Number System and Posit numbers, with application to neural networks. Also covers LNS usage with activation functions and Softmax.)
  • Saeedeh Jahanshahi, Amir Sabbagh Molahosseini & Azadeh Alsadat Emrani Zarandi, uLog: a software-based approximate logarithmic number system for computations on SIMD processors, 2023, Journal of Supercomputing 79, pages 1750–1783 (2023), https://link.springer.com/article/10.1007/s11227-022-04713-y (Paper licensed under CC-BY-4.0, unchanged: http://creativecommons.org/licenses/by/4.0/)
  • A. Sanyal, P. A. Beerel, and K. M. Chugg, 2020, “Neural network training with approximate logarithmic computations,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 3122–3126. https://arxiv.org/abs/1910.09876 (End-to-end LNS model for both training and inference. Converts "leaky-ReLU" activation function and Softmax to log-domain.)
  • J Zhao, S Dai, R Venkatesan, B Zimmer, 2022, LNS-Madam: Low-precision training in logarithmic number system using multiplicative weight update, IEEE Transactions on Computers, Vol. 71, No. 12, Dec 2022, https://ieeexplore.ieee.org/abstract/document/9900267/, PDF: https://ieeexplore.ieee.org/iel7/12/4358213/09900267.pdf (LNS in training of models. Uses different logarithm bases, including fractional powers of two, and LNS addition via table lookups.)
  • E. H. Lee, D. Miyashita, E. Chai, B. Murmann, and S. S. Wong, “Lognet: Energy-efficient neural networks using logarithmic computation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2017, pp. 5900–5904. https://ieeexplore.ieee.org/document/7953288 (Uses LNS multiplication in the log-domain, but still does accumulate/addition in the linear-domain.)
  • Maxime Christ, Florent de Dinechin, Frédéric Pétrot, 2022, Low-precision logarithmic arithmetic for neural network accelerators, 33rd IEEE International Conference on Application-specific Systems, Architectures and Processors (ASAP 2022), IEEE, Jul 2022, Gothenburg, Sweden. ff10.1109/ASAP54787.2022.00021ff. ffhal-03684585f, https://ieeexplore.ieee.org/abstract/document/9912091/, PDF: https://inria.hal.science/hal-03684585/document (Use of LNS in model inference, with coverage of dropping the sign bit and handling of zeros.)
  • J. Johnson, “Rethinking floating point for deep learning,” arXiv preprint arXiv:1811.01721, 2018, https://arxiv.org/abs/1811.01721 (Uses an end-to-end LNS version called "exact log-linear multiply-add (ELMA)" which is a "hybrid log multiply/linear add" method. Uses a Kulisch accumulator for addition.)
  • David Spuler, March 2024, Chapter 52. Logarithmic Models, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9

LNS Addition and Subtraction Theory

Log-domain addition and subtraction is problematic and requires Gaussian logarithm functions to compute. There are various papers that cover different approximations or methods such as Look-Up Tables (LUTs), Taylor series, interpolations, co-transformations, and other methods.

There are several other areas of theory that are relevant to LNS addition. Because LNS addition is computing exponentials of log-domain values (i.e. antilogarithms), adding them, and then re-converting them to log-domain, this is a "log of a sum of exponentials" calculation, which is the same as "log-sum-exp networks". Also, the "sum of exponentials" is the same calculation required for part of Softmax calculations (the denominator), so the theory of Softmax approximation is relevant. Finally, since the use of the maximum function is one way to approximate LNS addition, the theory of "max-plus networks" based on "tropical algebra" is relevant to optimizing LNS addition.

Papers on LNS addition and subtraction:

LNS in AI Models (and other Applications)

Other papers on the use of LNS in machine learning applications include:

LNS Hardware Acceleration

Papers on the use of the LNS in hardware-accelerated implementations include:

LNS Mathematical and Algorithmic Theory

Papers on the mathematical basis of the Logarithmic Number System (LNS) and its applied algorithms in theory include:

Logarithmic Algebra

Papers looking at the mathematical theory of logarithms.

LNS Extensions

If you scare easily, might want to look away... but there's an extension of the LNS that's called the "Multi-Dimensional Logarithmic Number System" (MDLNS). Its theory is based on the "Multiple-Base Number System" (MBNS). MDLNS and MBNS have both found some applications in digital signal processing. Some papers include:

Some Other Weird Non-Multiplication Alternatives

The use of logarithms is not the only way that researchers have considered to get rid of all those multiplication computations. The main attempts involve either addition, bitwise shifting, or both, but there are obscure attempts using max/min and even trigonometric functions.

Here are some other non-multiplication research areas:

More AI Research

Read more about: