Aussie AI

Bitwise Operations

Last Updated 25 December, 2024

by David Spuler, Ph.D.

Bitwise operations are a well-known coding trick that has been applied to neural network optimization. Bitwise-shifts can be equivalent to multiplication and division, but faster. Other bitwise operators can also be used in various ways in inference algorithms.

C++ Bitwise Integer Operators

Here's a refresher on the C++ bitwise operators:

x & y (binary bitwise-AND)
x | y (binary bitwise-OR)
x ^ y (binary bitwise-XOR)
x << y (binary left bitshift)
x >> y (binary right bitshift)
~x (unary bitwise-complement)

A few pitfalls in coding C++ bitwise operators should be mentioned:

Integer-only: the C++ bitwise operators do not work on floating-point data types.
Quiet overflow: if you do anything to overflow an integer type, nobody's going to tell you. For example, shifting the sign bit too far left with "1<<32" instead of "1<<31" will simply lose it. You might get a compiler warning, though.
Two is not better than one. The "&" operator is bitwise, but "&&" is logical. Similarly, "|" and "| |". It's the reverse for "<" and "<<" or ">" and ">>". Choose the wrong one and you might get a warning, if the stars are aligned and the wind is blowing easterly.
Operator precedence is tricky and not what you'd expect (it's arguably broken, but rather too late to fix), so use lots of brackets in bitwise expressions, and don't ignore C++ compilation warnings.
Bitwise operators are not always well-defined on negative values (e.g. bitwise right shift is officially "undefined behavior" on a negative), so it's best to use "unsigned" types as operands to bitwise operators.

Testing Bit Flags

The main use of C++ bitwise operators is to use bit flags in integer variables, which is very efficient in both storage space and execution time. A basic integer can store 32 bit flags, and a "long" can store 64 bits.

Here are some example macros for examining the bits in a 32-bit int, which should be of "unsigned int" type:

    // Bit Flags in Integers
    #define YAPI_ONE_BIT_SET(x, bit)    (( ((unsigned)(x)) & ((unsigned)(bit))) != 0 )
    #define YAPI_ANY_BITS_SET(x, bits)  (( ((unsigned)(x)) & ((unsigned)(bits))) != 0 )
    #define YAPI_ALL_BITS_SET(x, bits)  (( ((unsigned)(x)) & ((unsigned)(bits))) == ((unsigned)(bits)) )
    #define YAPI_NO_BITS_SET(x, bits)   (( ((unsigned)(x)) & ((unsigned)(bits))) == 0 )

The corresponding macros to set and clear these bit flags are:

    #define YAPI_SET_BITS(x, bits)      (( ((unsigned)(x)) | ((unsigned)(bits))))
    #define YAPI_CLEAR_BITS(x, bits)    (( ((unsigned)(x)) & (~((unsigned)(bits)))))

The basic idea is that a 32-bit unsigned int stores 32 bit flags. Certain actions on the integer as a whole effectively processes 32 bits in parallel. For example, is very fast to check if any bits are set at all by testing whether the whole integer is zero. Example code with these ideas:

    if (u != 0) { /* At least one flag set */ }
    u3 = u2 & u1;  // Intersection of the 32-bit sets (Bitwise-AND)
    u3 = u2 | u1;  // Union of the 32-bit sets (Bitwise-OR)
    u3 = u2 ^ u1;  // Toggle bits of the 32-bit sets (Bitwise-XOR)
    u3 = ~u1;      // Set complement

The toal number of bits set out of 32 can be computed fast as a "popcount" operation using intrinsic functions, such as "__popcnt" on Microsoft Visual Studio and "__builtin_popcount" for GCC (there are also versions for 64-bit longs). In x86 architectures, popcount is a single assembler instruction (POPCNT), and is therefore very fast.

These macros assume type "unsigned int" with 32 bits, and therefore 32 distinct bit flags in a single integer variable. For more bits, the "unsigned long" type could be used (64-bit), and there is also the "long long" type (128-bit). The above macros would need to be changed to use type casts to "unsigned long" rather than just "unsigned" for a 64-bit version. For even more bits, a data structure called a "bit vector" can be implemented as an array of unsigned integers, which generalizes the bit flag idea.

Missing Bitwise Operators: NAND, NOR, XNOR

Note that there's no simple operator for NOR, NAND or XNOR in C++. However, they can be simulated using two C++ bitwise operations, with a binary bitwise operation and the "~" bitwise two's complement unary operator afterwards.

    NAND(x,y) = ~(x & y)
    NOR(x,y)  = ~(x | y)
    XNOR(x,y) = ~(x ^ y)

So, you can just code this as fast C++ macros, right?

    #define NAND(x,y) ~(x & y)  // Bug alert!
    #define NOR(x,y)  ~(x | y)
    #define XNOR(x,y) ~(x ^ y)

No, this is broken in about half a dozen ways. To write macros correctly, you need to ensure there's parentheses around the whole expression, and also around each parameter name, to avoid getting bitten by C++ macro expansion operator precedence problems. And this doesn't work correctly if you pass in a non-unsigned integer.

Here's some example C++ macros that work for 32-bits:

    #define YAPI_BITWISE_NAND(x,y)  ( ~ ( ((unsigned int)(x)) & ((unsigned int)(y)) ) )
    #define YAPI_BITWISE_NOR(x,y)   ( ~ ( ((unsigned int)(x)) | ((unsigned int)(y)) ) )
    #define YAPI_BITWISE_XNOR(x,y)  ( ~ ( ((unsigned int)(x)) ^ ((unsigned int)(y)) ) )

You could also declare them as "inline" functions if you prefer. Note that these macros have a lot of parentheses to avoid various insidious precendence errors, and they also are limited to 32-bit operations. For 64-bit, you'd need to create alternative "unsigned long int" versions.

These NAND/NOR/XNOR macros are convenient, but not very efficient since they perform two arithmetic operations. Single-operation versions are available in assembler if you really need them. And you might need them, since neural networks need these bitwise operations more than normal coding. For example, XNOR is needed as the vector dot product operator for binarized bit vectors, such as in binary quantization and also XNOR neural networks.

Note that there aren't any more weird bitwise operators that do different things on a pair of bits, because there's only 8 possible unique results for binary operator functions. There are three C++ bitwise operators (AND/OR/XOR), plus the three extra ones coded above (NAND/NOR/XNOR), and the last two are trivial always-zero and always-one operations.

Useful Builtin Bitwise Intrinsic Functions

Integer Bitwise Functions: Some of the lesser-known but useful builtin functions for integer bitwise arithmetic are listed below. Most of these functions are for "int" or "unsigned int" (32-bit), but have other versions for long 64-bit or unsigned long 128-bit types. There is isn't usually a version for "short" 16-bit integers.

Count Leading Zeros (CLZ): Various functions count the leading zeros, or similarly, the offset of the first set bit. This is scanning the bits from left-to-right and finding is the most significant bit. One application of the CLZ intrinsic is a fast way to compute a truncated log2 of an integer, or similarly, computing the highest power-of-two in a number.

_BitScanReverse (Microsoft intrinsic <intrin.h>): Finds the most-significant bit in a 32-bit integer. There's also _BitScanReverse64.
clz: Count leading zeros (various versions); also sometimes called "nlz" for "number leading zeros".
__lzcnt: Leading zeros count in Microsoft Windows intrinsics, use <intrin.h> for Microsoft Visual Studio C++.
__builtin_clz (count leading zeros): GCC function to count the number of leading prefix zeros in an unsigned integer. Helpful for calculating log2's.
_CountLeadingZeros: Microsoft <intrin.h> ARM intrinsics.
BSR: Bit Scan Reverse x86 assembler instruction.
LZCNT: x86 instruction for leading-zero count, similar to BSR.

Count Trailing Zeros (CTZ): Contrasting to the leading zero functions, these functions find the zeros on the right-hand-side of an integer. This is the least-significant bit.

_BitScanForward (Microsoft intrinsic <intrin.h>): Finds the least-significant bit set. Long int version is _BitScanForward64
__builtin_ctz (count trailing zeros): GCC function counts zero bits on the right (least-significant bits).
Brian Kernighan bit trick for LSB: bitwise-and of n and n-1 (i.e. in C++ "n&(n-1)" finds the lowest set bit)
ffs/ffsl: Find first set (least-significant bit).
__builtin_ffs (find first set): GCC function: find first set bit from the least significant bits (from the right bits).
BSF: Bit Scan Forward x86 assembler instruction.
TZCNT: x86 instruction for trailing-zero count, similar to BSF.

Popcount (Set Bits Count): The count of 1's in a number is known as the "popcount" (which is short for population count) and there are various intrinsics:

__builtin_popcount: GCC function to count the number of 1's in an unsigned integer.
BitOperations.PopCount: Microsoft intrinsic function for bitwise popcount.
__popcnt: AMD x86 popcount intrinsic using POPCNT x86 instruction (Microsoft platform)
_mm_popcnt_u32: Intel x86 popcount intrinsic using POPCNT x86 instruction (Microsoft platform); use <intrin.h> on MSVS C++.
POPCNT: the name of the x86 assembler instruction for popcount (number of 1's).

Other Bitwise Intrinsics: There are many other builtin functions such as:

__builtin_parity: GCC function tracking bitwise binary parity (whether the number of 1's is odd or even).

_kxnor: x86 intrinsic for XNOR bitwise operation.

KXNORW/KXNORB/KXNORQ/KXNORD: x86 assembler bitwise XNOR operations.

VPTESTNMB/VPTESTNMW/VPTESTNMD/VPTESTNMQ: x6 assembler bitwise NAND operations.

Byte-wise Functions: Functions that operate on binary bytes:

memcpy: copy bytes.
memmove: copy bytes, but tolerates overlapping regions (unclear behavior on memcpy).
memset: set bytes to a value, usually used to clear bytes to zero.
memcmp: compare a sequence of bytes.
memicmp: compare bytes ignoring letter case.
memchr: search for a byte in a sequence.
_memccpy: copy bytes up to a specified sentinel byte.
bcopy: copy bytes
bzero: clear bytes to zero.
bcmp: compare bytes.
_byteswap_uint64 (Microsoft intrinsic): Swap the bytes of an integer.
__builtin_bswap16: GCC function to swap the bytes in an integer. There are versions for 32-bit and 64-bit.

Floating-Point Bitwise Functions: And some useful functions for bitwise manipulation of floating-point numbers:

frexp (Microsoft intrinsic): Get the mantissa and exponent.
ldexp: Bitshifting of a floating-point type by an integer shiftcount (on Linux and Microsoft intrinsics).
scalbn: Similar to ldexp, this does integer bitshift on floats.
std::signbit: Portably tests the sign bit of a floating-point number.
std::copysign: Portably copies the sign bit from one float, merging it with the value of another (i.e. another's exponent and mantissa).
logb: Extracts the exponent of a floating-point number.
ilogb: Extracts into an integer the exponent of a floating-point number (on Linux and Microsoft intrinsics).
modf: Splits a floating-point number into its whole and fractional parts (on Linux and Microsoft intrinsics).
fma: Fused multiply add on floats (Microsoft intrinsic)
remainder: Get fractional part of a floating-point (Microsoft intrinsic)
_fcvt: Convert float to string, lower-level than sprintf (Microsoft intrinsic)

For many of the listed functions, there are additional versions for different floating-point data types, such as float, double and long double.

Example: Popcount in C++

The "popcount" is short for "population count" of a binary number, and is the number of binary 1's in an integer number. This has applications such as quickly counting the number of elements in a bit set or bit vector.

Bitwise arithmetic can be used to check for a '1' value in each bit of an integer. Usually an unsigned type is used (as below), but bit twiddling of signed integers is also possible. This is the slow version in C++ that simply loops through each bit, checking if it is set:

   int yapi_popcount_basic(unsigned int x) // Count number of 1's
   {
	const int bitcount = 8 * sizeof(x);
	int ct = 0;
	for (int i = 0; i < bitcount; i++) {
		if (YAPI_ONE_BIT_SET(x, 1u << i)) ct++;
	}
	return ct;
   }

Kernighan Popcount Algorithm: A faster version is to use a bit trick found by Brian Kernighan, author of The C Programming Language. For all values of n, the previous number n-1 has one less bit set. So it you do bitwise-AND of n and n-1, it removes the rightmost bit that is 1 (i.e., least significant bit). Hence, you can use this to optimize popcount by only looping as many times as there are 1's in the number (rather than always doing 32 iterations). Here's the new C++ code:

   int yapi_popcount_kernighan_algorithm(unsigned int x) // Brian Kernighan algorithm
   {
	// Count number of 1's with Kernighan bit trick
	const int bitcount = 8 * sizeof(x);
	int ct = 0;
	while (x != 0) {
		x = x & (x - 1);  // Remove rightmost 1 bit: n & (n-1)
		ct++;
	}
	return ct;
   }

Intrinsic Popcount Functions: The Kernighan method is faster, but far from optimal. To do it super-fast, we have to look at existing builtin function primitives. For example, Microsoft intrinsics include "__popcnt" or "_mm_popcnt_u32" (in header file <intrin.h>) and the GCC library has a "__builtin_popcount" function, which count the number of 1's in an unsigned integer. On x86 CPUs, the underlying intrisincs should be using the x86 assembler instruction named POPCNT. Here is some example C++ code that works for Microsoft Visual Studio:

    int yapi_popcount_intrinsics2(unsigned int x) // MSVC version
    {
	return __popcnt(x);  // Microsoft intrinsics
    }

Obviously, a faster version is to declare this one-line function as "inline", or to convert to a C++ preprocessor macro, such as:

    #define YAPI_POPCOUNT_MACRO(x) ( __popcnt((unsigned int)(x)) )

Example: Bitwise Log2 on Integers

Calculating the base-2 logarithm of integers can be quite useful. There are various algorithms that use logarithms in AI.

Let's calculate the integer logarithm of an integer. This means we aren't doing the proper fractional logarithm of a number, but we are truncating it down to integer. For example log2(7) will be truncated 2, rather than 2.807. Note that we're assuming the input is unsigned numbers, since logarithms of negatives are undefined. Also, we have to decide how to handle zero, because log2(0) is undefined (or negative infinity if you prefer).

A simple way to implement a truncated integer log2 function is to use floating point functions and type casts back to int:

    int yapi_log2_integer_slow(unsigned int u)  // Slow float-to-int version
    {
	return (int)log2f(u);
    }

This works, but it's inefficient to use floating-point arithmetic on integers. Surely there's a faster way?

After some thoughts about binary bits, we notice that log2 of an integer is just the index position of the highest bit in a number. The log2 of 1 is 0, because the '1' is in position 0. The log2 of 2 (binary 10) is 1 because the leftmost 1 is in position 1. The log2 of 4 (binary 100) is 2, where the 1 is in index 2. The number 7 is binary 111, so log2 is the position of the leftmost 1, which is position 2. So log2(7) is the same as log2(4), but log2(8) is 3.

There are numerous builtin bitwise functions that can find the leftmost set bit. With sudden insight, we note that we can use "CLZ" (count leading zeros) to compute how many prefix zeros there are before the leftmost 1 bit (i.e. counts the zeros up to the most-significant bit from the left). We can then compute the bit index position from the right in a 32-bit integer as "32-CLZ". It's on the right track, and a bit of testing shows that the formula to use is "32-CLZ-1".

Here's some example code that uses this CLZ method to compute log2 of an integer. This works on Microsoft Visual Studio using the <intrin.h> header file to declare intrinics.

    int yapi_log2_integer_clz_intrinsic(unsigned int u)  // LOG2 using CLZ
    {
	int clz = __lzcnt(u);  // Count leading zeros
	const int bits = 8 * sizeof(u);
	return bits - clz - 1;
    }

And here's the macro version for those who don't trust compilers to inline properly:

    #define YAPI_LOG2_LZCNT(u)  ( (8 * sizeof(unsigned)) - (__lzcnt((unsigned)(u))) - 1 )

And this is actually not optimal. We really should reorder this to move the "-1" subtraction operation next to the other constant, noting that "sizeof" is a compile-time constant expression in C++. Putting them together would make sure that the compiler correctly merges these operations using "constant folding".

Bug alert! Note that you can't use "ffs" (find first set bit) for this log2 method, because it gives you the offset of the least-significant set bit (i.e. the rightmost bit rather than the leftmost bit). On x86 implementations, the CLZ builting functions are presumably using the x86 LZCNT or BSR assembler instructions, which are both similar. The other x86 instructions of TZCNT (Trailing Zeros Count) and BSF (Bit Scan Forward) are incorrect like FFS.

Example: Truncate Integer to Highest Power-of-Two

Another simple trick related to the log2 calculation is to truncate a number to its largest power-of-2. This is equivalent to the value of its leftmost bit in binary representation. For example, 8 (binary 1000) stays as 8, because it's 2^3, but 7 (binary 111) reduces down to 4 (binary 100), which is 2^2. As with the truncated integer log2 calculation, this method focuses on computing the leftmost 1 bit, which is known as the the Most-Significant Bit (MSB). Whereas the log2 calculation found the index position of that MSB, this power-of-two calculation requires the value of the MSB. In other words, we need to find the bit that is the MSB, and then keep only that bit. A simple way to do this is to compute the log2 of the integer efficiently, and then left-shift a 1 by that many places (using unsigned type). The basic idea is:

   int bitoffset = log2_integer_fast(i);
   int highestpowerof2 = 1u << bitoffset;

Note that this doesn't handle cases like zero, so it still needs a bit of extra work.

Bitwise Operations in AI

Some of the areas where bitwise optimizations have been used in neural network research include:

Power-of-two quantization: By quantizing weights to the nearest integer power-of-two, bitwise shifts can replace multiplication; see Bitwise-shift power-of-two quantization
Advanced number system division: See dyadic numbers and dyadic quantization for an obscure number system involving power-of-two division, which can be implemented as bitwise right-shifting.
Sign bit manipulation: Various optimizations are possible by direct bitwise operations on the sign bit of integers or floating point numbers.
Low-bit integer quantization: When quantized to only a few bits, inference can use bitwise arithmetic and bitserial operations to replace multiply-accumulate. The main examples are binary quantization and ternary quantization, both of which avoid multiplications in favor of bitwise operations (or addition) and sign bit handling.
Shift-add networks: Multiply-and-add (or "multiply-accumulate") can be replaced with bitshift-and-add. See Shift-add networks.
Floating point bit operations: The bits of the numeric representations of IEEE 754 floating-point numbers, or the Google bfloat16 type, include a sign bit, an exponent, and a mantissa, which can be manipulated by floating point bitwise operations.
Table look-ups: Algorithms that use table lookups for speed improvement typically involve bitwise shifts in computing the table offset. See Lookup Tables.
Data structures: Some data structures used in optimization of neural networks that involve bits include hashing and Bloom filteres
Bit arithmetic neural networks. There are neural networks where the neurons operate as bitwise operations. For example, see "XNOR networks" and Weightless Neural Networks (WNNs).
Arithmetic computation algorithms: Bit tricks are used in optimizations of multiplication operations (including faster approximate multiplication methods) and addition operations. Such methods are primarily used in hardware nowadays.
RELU activation function: The RELU activation function tests for negatives, which are changed to zero, but positive values are unchanged. This can be implemented efficiently as a sign bit test.

XNOR Networks

XNOR neural networks are similar to binarized networks. Their internal operations rely on the bitwise XNOR operation. The idea is that XNOR is actually an implementation of the multiplication operation on binary values. XNOR is an uncommonly used bitwise operation, as can be seen since there's no builtin C++ operator for binary XNOR. However, there is often hardware XNOR support, such as a 64-bit XNOR operator in the x86 CPU instruction set.

Research papers on XNOR networks include:

Rastegari, M.; Ordonez, V.; Redmon, J.; Farhadi, A. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 525–542._32. http://dx.doi.org/10.1007/978-3-319-46493-0_32, https://arxiv.org/abs/1603.05279 (This is the seminal XNOR network paper from 2016.)
Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report
Kim, M.; Smaragdis, P., Bitwise Neural Networks for Efficient Single-Channel Source Separation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 701–705. http://dx.doi.org/10.1109/ICASSP.2018.8461824, PDF: https://paris.cs.illinois.edu/pubs/kim-icassp2018.pdf
M Rastegari, V Ordonez, J Redmon, 2020, Enabling AI at the edge with XNOR-networks, Communications of the ACM, Volume 63, Issue 12, December 2020, pp 83–90, https://doi.org/10.1145/3429945, https://dl.acm.org/doi/abs/10.1145/3429945
M Mohaimenuzzaman, C Bergmeir, B Meyer, 2022, Pruning vs XNOR-net: A comprehensive study of deep learning for audio classification on edge-devices, IEEE Access, Volume 10, Pages 6696-6707, Jan 2022, https://ieeexplore.ieee.org/abstract/document/9672158/, PDF: https://ieeexplore.ieee.org/iel7/6287639/6514899/09672158.pdf
L Jiang, M Kim, W Wen, D Wang, 2017, 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), XNOR-POP: A processing-in-memory architecture for binary Convolutional Neural Networks in Wide-IO2 DRAMs, https://ieeexplore.ieee.org/abstract/document/8009163/, PDF: https://saige.sice.indiana.edu/wp-content/uploads/islped2017_ljiang.pdf
F Conti, PD Schiavone, L Benini, 2018, XNOR neural engine: A hardware accelerator IP for 21.6-fJ/op binary neural network inference, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 37, Issue 11, November 2018, https://ieeexplore.ieee.org/abstract/document/8412533/, PDF: https://arxiv.org/pdf/1807.03010
A Bulat, G Tzimiropoulos, 2019, Xnor-net++: Improved binary neural networks, arXiv preprint arXiv:1909.13863, https://arxiv.org/abs/1909.13863
B Zhang, S Xu, M Lin, T Wang, D Doermann, 2023 Binary Neural Networks: Algorithms, Architectures, and Applications, https://books.google.com.au/books?hl=en&lr=&id=r9jeEAAAQBAJ&oi=fnd&pg=PT3&ots=e7uJtDHhPl&sig=IrLor3Mh44oIbvTHWgKqoqoYE3U&redir_esc=y#v=onepage&q&f=false
J Sun, AP Fard, MH Mahoor, 2023, Xnodr and xnidr: Two accurate and fast fully connected layers for convolutional neural networks Journal of Intelligent & Robotic Systems, https://link.springer.com/article/10.1007/s10846-023-01952-w https://arxiv.org/pdf/2111.10854

Bitserial Operations

Bitserial operations are bitwise operations on all of the bits of an integer or bit vector. For example, the "popcount" operation counts how many 1's are set in the bits of an unsigned integer. The bitserial operations can be useful in neural network inference for computing the vector dot products in binary quantization or 2-bit quantization.

Papers on bitserial operations in neural networks include:

S Ashfaq, A Hoffman, S Mitra, S Sah, Sep 2023, DeepliteRT: Computer Vision at the Edge, https://arxiv.org/pdf/2309.10878.pdf (Uses bitwise and bitserial operations for dot products for binary and 2-bit quantization.)
Meghan Cowan, Thierry Moreau, Tianqi Chen, and Luis Ceze, Oct 2018, Automating generation of low precision deep learning operators. CoRR, abs/1810.11066, 2018. http://arxiv.org/abs/1810.11066 (Does bitserial dot products.)

Floating Point Bitwise Operations

Normal bitwise arithmetic operators cannot be applied to floating point numbers. The C++ bitwise and bitshift operators only work on integer types. However, floating point numbers are really just integers underneath, so there are various tricky ways that bitwise operators can be used on the underlying IEEE standard representations that are used by floating point numbers. See more about floating-point bitwise representation tricks.

Bitwise Neural Network Research

Research papers on the use of bitwise operations directly to implement neural networks:

Minje Kim, Paris Smaragdis, Jan 2016, Bitwise Neural Networks, https://arxiv.org/abs/1601.06071
Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report
Noureen Talpur, Said Jadid Abdulkadir, Emelia Akashah Patah Akhir, Mohd Hilmi Hasan, Hitham Alhussian, Mohd Hafizul Afifi Abdullah, A novel bitwise arithmetic optimization algorithm for the rule base optimization of deep neuro-fuzzy system, Journal of King Saud University - Computer and Information Sciences, Volume 35, Issue 2, February 2023, Pages 821-842, https://www.sciencedirect.com/science/article/pii/S1319157823000289
Kim, M.; Smaragdis, P., Bitwise Neural Networks for Efficient Single-Channel Source Separation. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 701–705. http://dx.doi.org/10.1109/ICASSP.2018.8461824, PDF: https://paris.cs.illinois.edu/pubs/kim-icassp2018.pdf (Uses a binary XNOR-network.)
Zhou, S.; Ni, Z.; Zhou, X.; Wen, H.; Wu, Y.; Zou, Y. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv 2016, arXiv:1606.06160. https://arxiv.org/abs/1606.06160 (Uses a convolution kernel involving bitwise operations.)

General Research on Bitwise Operations

Sean Eron Anderson 2005, Bit Twiddling Hacks, Stanford University, https://graphics.stanford.edu/~seander/bithacks.html
Ian Brayoni, 2020, https://github.com/ianbrayoni/bithacks (Python code inspired by Sean Eron Anderson's Bit Twiddling Hacks.)
Henry S Warren, 2012, Hacker's Delight, 2nd Edition, Addison-Wesley Professional, https://www.amazon.com/Hackers-Delight-2nd-Henry-Warren/dp/0321842685 Code: https://github.com/hcs0/Hackers-Delight
Antonio Gulli, 2014, A Collection of Bit Programming Interview Questions solved in C++ Kindle Edition, https://www.amazon.com/Collection-Programming-Interview-Questions-solved-ebook/dp/B00KIIDPUG/
Wikipedia, 2023, Find first set, https://en.wikipedia.org/wiki/Find_first_set
Microsoft, 2023 (accessed), _BitScanReverse, _BitScanReverse64, https://learn.microsoft.com/en-us/cpp/intrinsics/bitscanreverse-bitscanreverse64
Microsoft, 2023 (accessed), x86 intrinsics list, https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list
Jörg Arndt, 2010, Matters Computational Ideas, Algorithms, Source Code, https://dl.acm.org/doi/10.5555/1941953, https://www.jjj.de/fxt/fxtpage.html#fxtbook, Code: https://www.jjj.de/bitwizardry/bitwizardrypage.html
Sigrid/Jasper Neuman, 2023, Programming pages, http://programming.sirrida.de/
Harold, Bits, Math and Performance, Sep 2023, http://bitmath.blogspot.com/
Henry Gordon Dietz, 2023 (accessed), The Aggregate Magic Algorithms, Aggregate.Org online technical report, University of Kentucky, http://aggregate.org/MAGIC/
Agner Fog, 2023 (accessed), Software optimization resources, https://www.agner.org/optimize/ (List of links and articles on software optimization.)
Agner Fog, 2023, Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms, PDF: https://www.agner.org/optimize/optimizing_cpp.pdf
Wojciech Muła, Nathan Kurz, Daniel Lemire, Sep 2018, Faster Population Counts Using AVX2 Instructions, https://arxiv.org/abs/1611.07612, Code: https://github.com/CountOnes/hamming_weight (Finds that the Harley-Seal algorithm outperforms various intrinsics for popcount.)
Stephan Brumme, 2023, The bit twiddler, https://bits.stephan-brumme.com/
Mark Lefler, 2023 (accessed), Population Count, Chess Programming Wiki, https://www.chessprogramming.org/Population_Count (Some amazing bit tricks to compute "popcount" and Hamming distance of integers, complete with code and reference citations.)
Nicolai M. Josuttis, 2012, The C++ Standard Library: A Tutorial and Reference, Second Edition, Supplementary Chapter, https://www.amazon.com/Standard-Library-Tutorial-Reference-2nd/dp/0321623215, PDF (extra chapter): http://www.cppstdlib.com/cppstdlib_supplementary.pdf (C++ optimizations such as bit sets and user-defined memory allocators.)
Wikipedia, 2023 (accessed), x86 Bit manipulation instruction set, https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set
Intel, May 10th, 2023, Intel Intrinsics Guide, Version 3.6.6, https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html
Félix Cloutier, 2023, x86 and amd64 instruction reference, https://www.felixcloutier.com/x86/
Intel, 2023, Intel C++ Compiler Classic Developer Guide, version 2021.10, https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-10/overview.html, PDF: https://cdrdv2.intel.com/v1/dl/getContent/781922?fileName=cpp-compiler_developer-guide-reference_2021.10-767249-781922.pdf
Gurmeet Manku, 5 Aug 2008, Fast Bit Counting, https://gurmeet.net/puzzles/fast-bit-counting-routines/
S Ashfaq, A Hoffman, S Mitra, S Sah, Sep 2023, DeepliteRT: Computer Vision at the Edge, https://arxiv.org/pdf/2309.10878.pdf
Braddock Gaskill, 18 Oct 2019, The Bitwise Hashing Trick for Personalized Search, https://arxiv.org/abs/1910.08646
Alexander Golynski, Alessio Orlandi, Rajeev Raman, S. Srinivasa Rao, 10 Aug 2011, Optimal Indexes for Sparse Bit Vectors, https://arxiv.org/abs/1108.2157
NM Ho, DT Nguyen, JL Gustafson, WF Wong, 2023, Bedot: Bit Efficient Dot Product for Deep Generative Models, CoNGA 2023: Next Generation Arithmetic, pp. 19–37, https://link.springer.com/chapter/10.1007/978-3-031-32180-1_2 https://www.comp.nus.edu.sg/~wongwf/papers/CONGA23-Bedot.pdf
David Spuler, March 2024, Chapter 8. Bitwise Operations, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
David Spuler, March 2024, Bitwise AI Applications, in Generative AI in C++, https://www.aussieai.com/book/ch8-bitwise-ai-applications
Duy-Thanh Nguyen, Abhiroop Bhattacharjee, Abhishek Moitra, Priyadarshini Panda, 9 Feb 2023, DeepCAM: A Fully CAM-based Inference Accelerator with Variable Hash Lengths for Energy-efficient Deep Neural Networks, https://arxiv.org/abs/2302.04712
Chang Liu, Rui Zhang, Xishan Zhang, Yifan Hao, Zidong Du, Xing Hu, Ling Li, Qi Guo, 28 Feb 2023, Ultra-low Precision Multiplication-free Training for Deep Neural Networks, https://arxiv.org/abs/2302.14458
Felix Petersen, Hilde Kuehne, Christian Borgelt, Julian Welzel, Stefano Ermon, 7 Nov 2024, Convolutional Differentiable Logic Gate Networks, 38th Conference on Neural Information Processing Systems (NeurIPS 2024), https://arxiv.org/abs/2411.04732
SK Jha, S Jha, R Ewetz, A Velasquez, 2024, Onthe Design of Novel XOR-based A!ention Mechanism for Enhanced E"iciency of Transformers, DAC’24, June 23–27, 2024, San Francisco, CA, USA, https://sumitkumarjha.com/assets/pdf/2024_DAC_Jha_XOR_Attention_Transformer.pdf