Aussie AI

Zero Multiplication Inference Algorithms

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

Multiplication causes a lot of trouble. It's slower than addition or bitshifting, and AI models need to calculate the times tables lots of times (literally billions). That adds up to a lot of CPU and GPU time spent doing the same thing.

So why not do zero multiplication instead?

Types of Zero-Multiplication Models

It turns out that there are several ways to get rid of multiplication in LLM inference.

Newest Zero-Multiplication Models

Most of these models remain in the research labs, and are not widely used in industry. Even binary quantization and ternary quantization are not often implemented commercially because of accuracy loss.

Hadamard models. The latest research to get some attention is element-wise multiplication models, which is the "Hadamard product" of two matrices. See Hadamard multiplication models.

Low Bit Quantization for Zero-Multiplication Networks

Firstly, an interesting point is that quantization with a low number of bits can achieve zero-multiplication inference.

Binary quantization: 1-bit binary quantization achieves the replacement of multiplication with addition, or with sign-flips. If the weights are only 1 or 0, then the "multiplication" by 1 is an addition, and multiplication by zero becomes a null-test. If the weights are +1 and -1, then it's a sign test followed by an addition or a subtraction, or simply by a sign-flip. Oftentimes, these are optimized with bit arithmetic, since binary quantization is 1-bit quantization. Binary quantization is very fast, but has well-known problems with model accuracy. Read more about binary quantization.

Ternary quantization: Similarly, ternary quantization with weights -1, 0, and 1, can be implemented as a sign test, null test, addition and subtraction. However, ternary quantization also has problems with model accuracy. Read more about ternary quantization.

2-bit quantization: The 4 possible weights could be implemented by one or two additions, instead of multiplication. This type of 2-bit quantization does not receive as much attention in the literature.

Adder Neural Networks (using Addition-based Metrics)

If multiplication is so bad, can't we just use addition? Yes, we sure can. Cue the "adder" neural networks.

But it's not "additive models" (or "additive neural networks"), which is a term that is often used in the literature meaning something else, rather than arithmetic addition. Generalized Additive Neural Networks (GANNs) are a different concept.

So, can we change the multiplication operation generically to addition without quantization? I mean, we can change the matrix multiplication C++ code from "*" to "+" and we're done, right? Unsurprisingly, it's not a new idea to build a "dot product-like operation" using addition and subtraction. The earliest replacement of multiplication with addition seems to be Ritter and Sussner (1996).

There are various inference methods to use addition, and a few papers:

Approximate Multiplier Zero-Multiplication Networks

Approximate multiplication algorithms can be used to avoid full multiplications. There is extensive literature on various approximations to multiplications; see Approximate multiplication.

A few of the approximate multiplication papers specific to zero-multiplication neural networks include:

  • Min Soo Kim, Alberto Antonio Del Barrio Garcia, Hyunjin Kim, and Nader Bagherzadeh, The effects of approximate multiplication on convolutional neural networks, July 2020, IEEE Transactions on Emerging Topics, https://arxiv.org/abs/2007.10500
  • M. S. Kim, A. A. Del Barrio, L. T. Oliveira, R. Hermida, and N. Bagherzadeh, “Efficient Mitchell’s approximate log multipliers for convolutional neural networks,” IEEE Transactions on Computers, vol. 68, no. 5, pp. 660–675, 2018, https://ieeexplore.ieee.org/document/8532287
  • M. S. Ansari, V. Mrazek, B. F. Cockburn, L. Sekanina, Z. Vasicek, and J. Han, “Improving the accuracy and hardware efficiency of neural networks using approximate multipliers,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 2, pp. 317–328, 2019, https://ieeexplore.ieee.org/document/8863138
  • V. Mrazek, Z. Vasicek, L. Sekanina, M. A. Hanif, and M. Shafique, “Alwann: Automatic layer-wise approximation of deep neural network accelerators without retraining,” in 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 2019, pp. 1–8, https://arxiv.org/abs/1907.07229, Code: https://github.com/ehw-fit/tf-approximate
  • V. Mrazek, S. S. Sarwar, L. Sekanina, Z. Vasicek, and K. Roy, “Design of power-efficient approximate multipliers for approximate artificial neural networks,” in 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), 2016, pp. 1–7, https://ieeexplore.ieee.org/document/7827658
  • S. S. Sarwar, S. Venkataramani, A. Raghunathan, and K. Roy. Multiplier-less artificial neurons exploiting error resiliency for energy-efficient neural computing. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pages 145–150. IEEE, 2016, https://arxiv.org/abs/1602.08557 (Uses an approximate multiplier.)
  • MINT: Multiplier-less Integer Quantization for Spiking Neural Networks, R Yin, Y Li, A Moitra, P Panda, Sep 2023, https://arxiv.org/abs/2305.09850

Shift-Add Networks

Multiplication can be simulated via bitshifts and addition. See also Logarithmic quantization for power-of-two multiplications via bitshifts.

Papers on "shift-add networks" using a combination of bitshift and addition for zero-multiplication:

Add-as-Integer Approximate Multiplication Networks

The use of an approximate multiplication that is implemented via integer addition. This is a very weird idea and it seems almost magical that it works. It's basically pretend that 32-bit floating point (with its 1 sign bit, 8 exponent bits, and 23 mantissa bits) is actually a 32-bit integer (signed), and add them together. It doesn't do full multiplication, but it does an approximation called Mitchell's approximate multiplication.

Example: Add-as-Int Mogami Approximate Multiplication: The method uses C++ casts to trick the compiler into using the floats as if they were ints. And then it needs to subtract an offset to correct extra bits. Let's say we want to try optimizing a basic float multiply:

     float fc = f1 * f2;   // Floating-point multiply

This is slow, so we want to try the Mogami (2020) idea to change it into addition instead. Note that fancy coding is required. A simple version doesn't work:

     int c = (int)f1 + (int)f2;  // Not multiplication!
     float fc = (float)c;

That code isn't tricking the compiler and it isn't doing multiplication at all. It does a full conversion from float to int, with all that entails, and this is nothing like floating point multiplication.

Instead, type casting is required. Assuming that both int and float are 32-bit types, a coded version in C++ looks like:

     int c = *(int*)&(f1) + *(int*)&(f2) - 0x3f800000;  // Mogami(2020)
     float fc = *(float*)&c;

How does this even work? I mean, it seems like hocus pocus. The effect is that integer addition on the 8-bit exponent is like doing a multiplication (because exponent bits are the powers). Adding the 23 mantissa bits together isn't really the same, it's not doing multiplication, but it's close enough that it's doing an approximate version of multiplication. Some of the theory of why this works is examined in Kosson & Jaggi (2022). Overall, it seems to work like multiplication on both positive and negative floating point, but faster because it's using integer addition. The accuracy of the multiplication is such that the difference from regular float multiplication (i.e. the error) is less than 15%. In my testing it seemed like it was usually less than 12%, so it's a very good approximation of multiplication, for a significant speedup in arithmetic calculations.

Note that the temporary integer variable is hard to get rid of in C++, and might require assembler instead. The "+" operator puts the 32-bit integer into a C++ register, but I can't find a way to re-interpret that temporary int value as a 32-bit float without first storing it to a temporary variable. A simple typecast to float doesn't work in C++:

     float fc = (float) ( *(int*)&(f1) + *(int*)&(f2) - 0x3f800000 );  // Fails...

The above doesn't work because the integer is converted by the float typecast, which is very different from re-interpreting the 32-bit temporary integer as a 32-bit float. In fact, the code above is really just a bug, as I discovered myself. It doesn't really compute anything very meaningful, not even approximately.

Example: Add-as-Integer Vector Dot Product: Here's what it looks like to put Mogami's method into a vector dot product to create an approximate version (but faster):

    float yapi_vecdot_add_as_int_mogami(float v1[], float v2[], int n)   // Add as integer
    {
	float sum = 0.0;
	for (int i = 0; i < n; i++) {
		int c = *(int*)&(v1[i]) + *(int*)&(v2[i]) - 0x3f800000;  // Mogami(2020)
		sum += *(float*)&c;
	}
	return sum;
    }

This is not a fully optimized version. For example, the iterator variable i should be removed via pointer arithmetic.

Why does Mogami's add-as-integer idea even work? There are a few points that help to explain why it is a close approximation:

  • Implicit leading-1 in the mantissa bits. It is important to note that because both mantissa bits don't include the impliciting leading 1, that the integer addition is not adding these bits together. Instead, it is only adding the lower-order bits of the mantissa, which helps explain why the error rate is low.
  • Fails for fix-point representations. the Mogami add-as-integer approach is unlikely to be a useful approximation of fixed-point numbers. Note that multiplication of fixed-point numbers are already implemented as integer multiplication. Also note that fixed-point does not use an implicit 1 leading mantissa bit. Further, the same comments apply to hybrid fixed-point approaches such as block floating-point, where the add-as-integer approximation is unlikely to be very accurate.
  • Adjacency of exponent and mantissa bits. With integer addition used on the mantissa bits, if their highest order bits are both set, then a 1 is carried up to the next bit position. This is in the exponent, since it is next to the mantissa in the IEEE 754 format. Hence, an extra 1 is added to the exponent in this case. Note that the leading 1 in the explicit mantissa bits is not actually the highest order bit overall, but the second-highest, because there is an implicit leading-1 bit in both mantissas. So, addition of the second-highest bit in the mantissa, where both are 1, may then add 1 to the exponent (in addition to the two exponents being added together).

Research Papers on Add-as-Int Approximate Multiplication:

Max-Plus and Related Tropical Networks

The "max-plus" or "min-max-plus" networks use maximum or minimum operations, combined with addition, rather than multiplication. The theoretical basis of this idea is called "tropical algebra" which is a specialized mathematics consistenting of min/max and addition to define a pseudo-multiplication operation.

Some other areas of theory are related to this area. Addition in the logarithmic number system can be approximated with maximum and addition (like "max-plus"). The tropical algebra is also relevant for "log-sum-exp networks" which calculate the logarithm of the sum of exponentials, which is similar to LNS addition, and can possibly be approximated similarly. Also, the calculation of the Softmax function has a denominator that is the sum-of-exponentials, so Softmax approximation is similar to LNS addition, and could involve theory from max-plus networks and tropical algebra.

Papers on max-plus neural networks, and others based on max or min operations, include:

Morphological Networks

Another type of neural network that uses max operations is called the "morphological network". This uses maximum, addition, and subtraction operations.

Other Addition-Related Zero-Multiplication Networks

These are papers that use addition to attain zero-multiplication, but not the specific techniques above.

Table Lookups Replace Multiplication

Table lookups of precomputed data are a well-known code optimization that has been applied to inference optimization. Pre-computation can be effective for low-bit arithmetic or for approximating the value of non-linear functions that are computationally expensive to evaluate exactly.

One partial method to remove multiplication is to use table lookups instead. This would seem to remove multiplication, although there's actually a hidden multiplication of the array indices involved in table lookups, though hopefully handled efficiently by the compiler (probably as a bitshift).

  • Zhou, A.; Yao, A.; Guo, Y.; Xu, L.; and Chen, Y., 2017, Incremental network quantization: Towards lossless CNNs with low-precision weight, arXiv preprint arXiv:1702.03044, https://arxiv.org/abs/1702.03044 (bitshifting)
  • S Fanning, Fixed Point Multiplication-Free Implementation of Deep Neural Networks for Embedded Systems, Masters Thesis, School of Electrical and Electronic Engineering, University College Dublin 2018, https://seanfanning.eu/posts/projects/low-bitwidth-neural-networks/Thesis_SeanFanning_13360951.pdf
  • Mohammad Samragh Razlighi; Mohsen Imani; Farinaz Koushanfar; Tajana Rosing LookNN: Neural network with no multiplication, Design, Automation & Test in Europe Conference & Exhibition (DATE), 27-31 March 2017, https://ieeexplore.ieee.org/document/7927280 (Lookup-table based multiplication.)
  • Daniel Gerlinghoff, Benjamin Chen Ming Choong, Rick Siow Mong Goh, Weng-Fai Wong, Tao Luo, 18 Mar 2024, Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic, https://arxiv.org/abs/2403.11414 (Replacing Multiply-Accumulate MAC on hardware with lookup-tables for low-bit quantization.)

Multiplication-Free Neural Networks

There are zero-multiplication models that use some other arithmetic method instead of addition. In the literature, these algorithms are often called "zero multiplication" or "multiplication-free" algorithms, such as Multiplication-Free Neural Networks (MFNNs). Binary quantization and ternary quantization are not the only options. Papers on inference without any multiplication operations:

Diff-Squared Networks

Squaring the difference between two numbers is well-known in Euclidean distance calculations and statistical variance. This idea has been applied to neural networks as "diff-squared networks". Some methods cited by other papers as "multiplication-free" model research compute a difference (subtraction), but then square it, which is technically still multiplication, but who's counting? However, it isn't using multiplication by weights, so it's a distinct method.

Research papers on squared differences and neural networks:

  • Xinlin Li, Mariana Parazeres, Adam Oberman, Alireza Ghaffari, Masoud Asgharian & Vahid Partovi Nia, EuclidNets: An Alternative Operation for Efficient Inference of Deep Learning Models, SN Computer Science, volume 4, 2023, https://link.springer.com/article/10.1007/s42979-023-01921-y (This uses the square of the difference, which is really still multiplication.)
  • Xinlin Li, Mariana Parazeres, Adam Oberman, Alireza Ghaffari, Masoud Asgharian, Vahid Partovi Nia, EuclidNets: An Alternative Operation for Efficient Inference of Deep Learning Models Dec 2022, https://arxiv.org/abs/2212.11803 (uses squares and Euclidean distances as weights)
  • S. Fan, L. Liu, and Y. Luo. An alternative practice of tropical convolution to traditional convolutional neural networks. In 2021 The 5th International Conference on Compute and Data Analysis, pages 162–168, 2021, https://arxiv.org/abs/2103.02096 (Tropical arithmetic)
  • Y. Luo and S. Fan. Min-max-plus neural networks. arXiv preprint arXiv:2102.06358, 2021, https://arxiv.org/abs/2102.06358 (Tropical arithmetic)
  • Robert Tibshirani, Regression Shrinkage and Selection via the Lasso Journal of the Royal Statistical Society. Series B (Methodological), Vol. 58, No. 1 (1996), pp. 267-288, https://www.jstor.org/stable/2346178 (Low-level mathematical paper from 1996 about the additive-squares method.)
  • David Spuler, March 2024, Chapter 51. Zero-Multiplication Models, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9

Bitwise Operators for Inference

Instead of addition, could any of the bitwise operations be used to replace multiplication? Yes, and there are various possibilities. Some examples are below; see also bitwise operators for inference efficiency.

Bitshift operators: The bitwise shift operators are good candidates for replacing multiplication (or division) by a power-of-2 integer, as discussed under power-of-two quantization (logarithmic quantization). This is a well-known optimization and has considerable research.

Bitwise-or is a possible candidate, that doesn't seem to be considered in the research. When applied to positive integers, bitwise-or has some properties similar to addition, but its result will either equal or be less than addition result, but greater than or equal to either of the two operands. This assumes bitwise-or on unsigned weights, such as via integer quantization to positive weights, because bitwise-or on negative signed numbers has various oddities. As such, bitwise-or with quantization could be considered for some of the above algorithms that use addition to replace multiplication. The accumulated sum based on bitwise-or would increase slightly more slowly than with pure addition.

Bitwise-and is another candidate, similar to bitwise-or, as it will also be less than or equal to the addition result, but the result will be less than or equal to either operand. This seems less desirable than bitwise-or, but it's all a bit unclear. Experiments are required.

Bitwise-xor seems too odd for realistic usage. It has rather strange mathematical properties. But, who knows.

The use of the bitwise operators (or/and/xor) with quantization for non-multiplication inference is an area that needs some exploration. No papers were found yet.

More Zero Multiplication Research

More general papers on zero-multiplication models:

Zero Skipping (Avoiding Multiplication by Zero)

See Zero skipping research.

Hadamard Product Matrix Multiplication Models

These models are not technically "zero multiplication" LLMs, but they offer a tenfold reduction in multiplication arithmetic. The idea involves simple element-wise multiplication of each element of two matrices, rather than a dot product computaton for each element, which is called the "Hadamard product" of two matrices. Basic matrix multiplication is O(n3) whereas Hadamard computations are O(n2), so it's potentially a tenfold decrease, and also a simpler algorithm that's more amenable to followup kernel optimizations like vectorization and kernel fusion.

Research papers on Hadamard products in AI engines include:

More AI Research

Read more about: