Aussie AI

50. Adaptive Inference

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Adaptability is not imitation.
It means power of resistance and assimilation.”

— Mahatma Gandhi.

What is Adaptive Inference?

The default execution of AI inference is a brute-force computation using all the weights. The same huge computation is done over-and-over, repeatedly for each token, regardless of what's in the user input string.

Adaptive inference tries to shake that up by adding dynamic choices to this simple algorithm, so that the model uses different computations for different user inputs. The method adds various dynamic tests that change how the computations progress, rather than brute-force of everything.

The first thing to understand about adaptive inference, is that it is not the default. Although AI engines produce different outputs according to different prompts, the steps they go through are largely fixed. Each encoder or decoder runs through a fix number of layers, with fixed sets of precomputed weights from the model file, where all of these weights are used in a brute-force computation. There's only a small amount of variability in the decoding algorithm to create some creativity in responses (e.g. randomly picking from the top-50 possible words).

Although it's a huge amount of runtime computation, there's something about the whole inference algorithm that is inherently static. As I've said before, it's as if the code has no “if” statements, and always goes through a fixed sequence of steps. With adaptive inference methods, the AI engine modifies its inference algorithm to operate differently in ways that depend on the user's input prompt.

Types of Adaptive Inference

Adaptive inference is not yet part of the mainstream optimizations of an AI engine. However, there are many different research areas where this general idea is applied in a particular way to speed up inference. The main examples where the computation path “adapts” to the user inputs include:

Early exit (dynamic layer pruning)
Layer skipping
Layer reordering
Dynamic width pruning
Token pruning (dynamic length pruning)
Prompt compression
Semantic caching
Cascades
Stochastic algorithms (intentional randomness)

For example, “early exiting” is a type of “dynamic layer pruning” where some of the layers of weights are simply skipped at runtime. Instead of going through all 12 layers of GPT-2, it might decide to only run through 9 layers for some inputs.

Adaptive inference research. The above specific techniques are covered in other chapters with pointers to research papers. Research papers on adaptive inference generally:

Anil Kag, 2023, Novel neural architectures & algorithms for efficient inference, Ph.D. thesis, College of Engineering, Boston University, https://open.bu.edu/handle/2144/46649, PDF: https://open.bu.edu/bitstream/handle/2144/46649/Kag_bu_0017E_18472.pdf?sequence=8&isAllowed=y (See Chapter 10, “Input Hardness Adaptive Models” for methods of running faster on easy image classification problems.)
Bolukbasi, T., Wang, J., Dekel, O., and Saligrama, V. 2017. Adaptive neural networks for efficient inference, In Proceedings of the 34th International Conference on Machine Learning-Volume 70, Proceedings of Machine Learning Research, pages 527–536. https://arxiv.org/abs/1702.07811, http://proceedings.mlr.press/v70/bolukbasi17a.html
Nan, F. and Saligrama, V. 2017. Adaptive classification for prediction under a budget, In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (editors), 2017, Advances in Neural Information Processing Systems, volume 30, pages 4727–4737. Curran Associates, https://arxiv.org/abs/1705.10194 PDF: https://proceedings.neurips.cc/paper_files/paper/2017/file/d9ff90f4000eacd3a6c9cb27f78994cf-Paper.pdf
Nan, F. and Saligrama, V., 2017. Dynamic model selection for prediction under a budget, arXiv preprint arXiv:1704.07505. https://arxiv.org/abs/1704.07505
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. 2022. Dynamic neural networks: A survey, volume 44, pages 7436–7456, Los Alamitos, CA, USA. IEEE Computer Society. https://arxiv.org/abs/2102.04906, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837 (Survey of dynamic inference techniques, where the engine is adaptive to the input.)
Graves, A. (2016). Adaptive computation time for recurrent neural networks, CoRR, abs/1603.08983. http://arxiv.org/abs/1603.08983
Jernite, Y., Grave, E., Joulin, A., and Mikolov, T. (2017). Variable computation in recurrent neural networks, In International Conference on Learning Representations. https://openreview.net/forum?id=S1LVSrcge, https://arxiv.org/abs/1611.06188
D. Stamoulis, T.-W. Chin, A. K. Prakash, H. Fang, S. Sajja, M. Bognar, and D. Marculescu, 2018, Designing adaptive neural networks for energy-constrained image classification, in Proceedings of the International Conference on Computer-Aided Design, 2018, pp. 1–8. https://ieeexplore.ieee.org/document/8587753
Xin Wang, Fisher Yu, Zi-Yi Dou, Trevor Darrell, Joseph E. Gonzalez, 2018, Skipnet: Learning dynamic routing in convolutional networks, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 409–424. https://arxiv.org/abs/1711.09485
Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard Baraniuk, Zhangyang Wang, Yingyan Lin, 2020, Dual dynamic inference: Enabling more efficient, adaptive and controllable deep inference, IEEE Journal of Selected Topics in Signal Processing, 2020. https://arxiv.org/abs/1907.04523
P. Panda, A. Sengupta, and K. Roy, 2016, Conditional deep learning for energy-efficient and enhanced pattern recognition, in Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2016. https://arxiv.org/abs/1509.08971
Berestizshevsky, K., Even, G., 2019, Dynamically sacrificing accuracy for reduced computation: Cascaded inference based on softmax confidence, In: Lecture Notes in Computer Science, pp. 306–320. Springer International Publishing (2019). https://doi.org/10.1007/978-3-030-30484-3_26
Jayakodi, N.K., Chatterjee, A., Choi, W., Doppa, J.R., Pande, P.P., Trading-off accuracy and energy of deep inference on embedded systems: A co-design approach, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37(11), 2881–2893 (2018). https://doi.org/10.1109/tcad.2018.2857338, https://arxiv.org/abs/1901.10584
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Input-dependent inference optimization via layer-wise weight clustering and early exit based on a termination condition.)
Maedeh Hemmat; Azadeh Davoodi, March 2019, Dynamic Reconfiguration of CNNs for Input-Dependent Approximation, 20th International Symposium on Quality Electronic Design (ISQED), https://ieeexplore.ieee.org/document/8697843 (Dynamically decides how many clusters of similar weights to use, depending on input.)
B Wójcik, M Przewiȩźlikowski, F Szatkowski, Oct 2023, Zero time waste in pre-trained early exit neural networks, Neural Networks, https://www.sciencedirect.com/science/article/pii/S0893608023005555, PDF: https://papers.nips.cc/paper/2021/file/149ef6419512be56a93169cd5e6fa8fd-Paper.pdf (Attempts to quickly handle easy inputs by caching classifications from prior layers for early exit decisions.)

For more general research on adaptive inference, refer to https://www.aussieai.com/research/inference-optimization#dynamic.

Non-adaptive methods. Note that there are numerous other ways to speed up inference at runtime that do not actually change the sequence of computations for different inputs. They are not “adaptive” because they still do the same inference calculations each time, only faster. As some examples, these methods are inference optimizations, but not true adaptive inference:

Quantization
Model compression
Static pruning
Sparsity
Knowledge distillation
Inference caching
KV Caching
Integer algorithms
Zero-multiplication models

Easy vs Hard Queries

One type of adaptive inference is to use a heuristic to determine whether a user query is easy or hard. An “easy” query can be processed faster using some simpler method or a small model, whereas a “hard” query has to be processed fully by a large model. There are various multi-model ensemble architectures that perform adaptive inference with this type of approach of choosing between two or more models:

Model selection algorithms
Big-little architectures
Mixture-of-Experts (MoE)
Speculative decoding

This is not the same as caching, since one of the models is always executed, but the two ideas can be combined (i.e., cache first, then multi-model). These heuristic decision methods are discussed under multi-model methods in Chapter 54.

Zero Skipping

Zero skipping is a particular type of adaptive inference that involves the avoidance of multiplications by zero weights. This idea can be performed at a low-level or high-level of the inference algorithm.

Low-Level Zero Skipping. At a low-level, zero skipping means testing a single weight to see whether it is zero, thereby avoiding a wasteful multiplication-by-zero operation. Testing a register against zero is much faster than multiplication, because the multiplication algorithm doesn't go any faster for zeros, so this is a “simple case first” optimization.

Note that there's a whole class of research called “sparse matrices” or “sparsifications” which aims to cut whole swatches of zero-multiplications, but the research below is lower level than this.

There aren't many papers on this low-level topic of “zero skipping” of individual weights, specific to inference arithmetic, and even in some of these papers, it's not the central point of the paper. That's probably because hardware acceleration makes pre-testing for zeros on a small scale not worth it, whereas large-scale avoidance of zero-multiplication appears in research on “sparsification”.

Research papers on low-level zero skipping:

Y. Chen, J. Emer, and V. Sze, 2016, Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks, In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 367–379, https://ieeexplore.ieee.org/document/7551407
Dongyoung Kim, Junwhan Ahn, and Sungjoo Yoo, 2018, ZeNA: Zero-aware neural network accelerator, IEEE Design, 2018, & Test 35, 1 (2018), 39–46, https://doi.org/10.1109/MDAT.2017.2741463
Xinlin Li, Bang Liu, Rui Heng Yang, Vanessa Courville, Chao Xing, Vahid Partovi Nia, 2022, DenseShift: Towards Accurate and Transferable Low-Bit Shift Network, Aug 2022, https://arxiv.org/abs/2208.09708
Chunhua Deng, Yang Sui, Siyu Liao, Xuehai Qian, and Bo Yuan, 2021, GoSPA: An energy-efficient high-performance globally optimized sparse convolutional neural network accelerator, In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA’21), 1110–1123, https://doi.org/10.1109/ISCA52012.2021.00090, https://ieeexplore.ieee.org/document/9499915
S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, 2016, Cambricon: An instruction set architecture for neural networks, 2016, In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). 393–405, https://ieeexplore.ieee.org/abstract/document/7551409
Yuxiang Huan, Yifan Qin, Yantian You, Lirong Zheng, and Zhuo Zou. Sep 2016. A multiplication reduction technique with near-zero approximation for embedded learning in IoT devices, 2016 29th IEEE International System-on-Chip Conference (SOCC), 102–107. https://ieeexplore.ieee.org/abstract/document/7905445 (Avoids near-zero low multiplications on small values, by counting the number of prefix zeros in the floating-point representation using bitwise arithmetic.)
Minkyu Kim and Jae Sun Seo. 2021. An energy-efficient deep convolutional neural network accelerator featuring conditional computing and low external memory access, IEEE Journal of Solid-State Circuits 56, 3 (2021), 803–813, https://ieeexplore.ieee.org/document/9229157 (Cascades and zero-skipping.)
R. J. R. Struharik, B. Z. Vukobratovic, A. M. Erdeljan, and D. M. Rakanovic, 2020, CoNNa–Hardware accelerator for compressed convolutional neural networks, Microprocessors Microsyst., vol. 73, Mar. 2020, Art. no. 102991. https://ieeexplore.ieee.org/document/8491841
Y.-H. Chen, T. Krishina, J.-S. Emer and V. Sze, 2016, Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks, IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Nov. 2016, https://ieeexplore.ieee.org/document/7738524 (Uses zero-skipping as part of the improvements.)
R. J. R. Struharik, B. Z. Vukobratović, A. M. Erdeljan and D. M. Rakanović, 2020, CoNNa–Hardware accelerator for compressed convolutional neural networks, Microprocessors Microsyst., vol. 73, Mar. 2020. https://www.sciencedirect.com/science/article/abs/pii/S0141933119300158
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N.E. Jerger, A. Moshovos, 2016, Cnvlutin: ineffectual-neuron-free deep neural network computing, in: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, 2016, pp. 1–13. https://ieeexplore.ieee.org/document/7551378
Y. Lu, C. Wang, L. Gong, X. Zhou, 2018, SparseNN: a performance-efficient accelerator for large-scale sparse neural networks, Int. J. Parallel Program. 46 (4) (2018) 648–659. https://arxiv.org/abs/1711.01263

High-Level Zero Skipping. At a high-level, zero skipping can mean avoiding all of the multiplications from an entire column of a matrix, or in an entire structure of the model (e.g. structural model pruning). Papers on zero skipping at a high level in model structures include:

C. Gao, D. Neil, E. Ceolini, S.-C. Liu, and T. Delbruck, 2018, DeltaRNN: A power-efficient recurrent neural network accelerator, in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays, Feb. 2018, pp. 21–30. PDF: https://dl.acm.org/doi/pdf/10.1145/3174243.3174261 (Refers to zero-skipping at a high-level, skipping an entire column or row.)
M. P. Véstias, R. P. Duarte, J. T. de Sousa, and H. C. Neto, 2019, Fast convolutional neural networks in low density FPGAs using zero-skipping and weight pruning, Electronics, vol. 8, no. 11, p. 1321, Nov. 2019. https://www.mdpi.com/2079-9292/8/11/1321 (High-level zero-skipping of activations with zero weights.)
Alessandro Aimar, Hesham Mostafa, Enrico Calabrese, Antonio Rios-Navarro, Ricardo Tapiador-Morales, Iulia-Alexandra Lungu, Moritz B. Milde, Federico Corradi, Alejandro Linares-Barranco, Shih-Chii Liu, Tobi Delbruck, 2019, NullHop: A flexible convolutional neural network accelerator based on sparse representations of feature maps, IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 3, pp. 644-656, Mar. 2019. https://arxiv.org/abs/1706.01406
S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, Y. Chen, 2016, Cambricon-x: an accelerator for sparse neural networks, in: The 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, 2016, p. 20. https://ieeexplore.ieee.org/document/7783723
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M.A. Horowitz, W.J. Dally, 2016, EIE: efficient inference engine on compressed deep neural network, in: Proceedings of the 43rd International Symposium on Computer Architecture, Seoul, 2016, pp. 243–254. https://arxiv.org/abs/1602.01528
A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, W.J. Dally, 2017, SCNN: an accelerator for compressed-sparse convolutional neural networks, in: 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, 2017, pp. 27–40. https://arxiv.org/abs/1708.04485
D. Kim, J. Ahn and S. Yoo, 2018, ZeNA: Zero-aware neural network accelerator, IEEE Des. Test, vol. 35, no. 1, pp. 39-46, Feb. 2018. https://ieeexplore.ieee.org/document/8013151
Maedeh Hemmat, Joshua San Miguel, Azadeh Davoodi, 2021, AirNN: A Featherweight Framework for Dynamic Input-Dependent Approximation of CNNs, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.40, no.10, pp.2090-2103, 2021. https://ieeexplore.ieee.org/document/9239327 (Uses a “greedy interleaving” algorithm for processing sparse matrices to avoid zero multiplications.)
P. Grigoras, P. Burovskiy, E. Hung, and W. Luk. 2015, Accelerating SpMV on FPGAs by compressing nonzero values, In International Symposium on Field Programmable Gate Arrays, pages 64–67, 2015. https://ieeexplore.ieee.org/document/7160041 (Sparse multiplication of non-zero values, skipping zeros.)
M. Song, J. Zhao, Y. Hu, J. Zhang, and T. Li., 2018, Prediction based execution on deep neural networks, In International Symposium on Computer Architecture, pages 752–763, 2018, https://ieeexplore.ieee.org/document/8416870 (Attempts to predict and avoid zero-valued operands for multiplication in hardware.)
JA Chen, W Niu, B Ren, Y Wang, X Shen, 2023, Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Survey paper covering various data redundancy optimizations such as skipping or reusing computations for similar data.)
Mingcong Song; Jiechen Zhao; Yang Hu; Jiaqi Zhang; Tao Li, 2018, Prediction based execution on deep neural networks, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/8416870/, https://www.researchgate.net/profile/Mingcong-Song/publication/326566905_Prediction_Based_Execution_on_Deep_Neural_Networks/links/5bd68551a6fdcc3a8dad72ff/Prediction-Based-Execution-on-Deep-Neural-Networks.pdf
H Park, D Kim, J Ahn, S Yoo, 2016, Zero and data reuse-aware fast convolution for deep neural networks on GPU, 2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), https://dl.acm.org/doi/abs/10.1145/2968456.2968476, https://ieeexplore.ieee.org/document/7750981 (Zero-skipping by prediction of the results.)

For more research on zero skipping, see also https://www.aussieai.com/research/zero-skipping.

Negative Skipping

Negative skipping is not skipping of negative weights. Instead, negative skipping is an attempt to predict which vector dot product computations will be negative, and skip doing them. In models that use the RELU activation function, any negative results would be zero anyway if sent to RELU. Hence, negative skipping with RELU is a type of zero skipping.

Research papers on negative skipping:

Duvindu Piyasena, Rukshan Wickramasinghe, Debdeep Paul, Siew Kei Lam, and Meiqing Wu. 2019. Reducing dynamic power in streaming CNN hardware accelerators by exploiting computational redundancies, Proceedings 29th International Conference on Field-Programmable Logic and Applications, FPL 2019 (9 2019), 354–359, https://ieeexplore.ieee.org/document/8891989 PDF: https://siewkeilam.github.io/ei-research-group/Paper/2019H-Duvindu-FPL.pdf (This is “negative skipping”, similar to zero-skipping, where cheap estimates avoid computations that would be negative, which would thereby be reduced to zero by RELU activation.)
T. Ujiie, M. Hiromoto, and T. Sato. 2016. Approximated Prediction Strategy for Reducing Power Consumption of Convolutional Neural Network Processor, Conf. on Comp. Vision and Pattern Recog. Workshops (CVPRW), 870–876. https://ieeexplore.ieee.org/document/7789603 https://openaccess.thecvf.com/content_cvpr_2016_workshops/w14/papers/Ujiie_Approximated_Prediction_Strategy_CVPR_2016_paper.pdf (Does “negative skipping” by quickly approximating the value of a convolution to skip it entirely if expected to be negative.)

For more research on negative skipping, see also https://www.aussieai.com/research/zero-skipping#negative.

Zero Padding Removal

One zero skipping technique for speeding up Transformer inference is to avoid using zero padding in the input vectors. The need for padding arises in some architectures where it can be helpful in keeping vectors the same size, because that consistency can help with pipelining calculations through the GPU. However, research has shown that it can also lead to inefficiency from performing redundant computations that are never used, and various papers have advocated removing the zero padding bytes.

An alternative approach is to use packing of input sequences to avoid or reduce padding bytes. This is effective in training sets, or multiple streams of inference queries.

And it's worth nothing that not all padding bytes are evil. Some of them are quite charismatic if you take them out for a cup of tea. In fact, the need for padding removal in Transformers arose for good reason from the well-intentioned optimizing by professional programmers using very nice and hospitable padding zeros. The use of padding is a positive optimization in numerous situations, particularly when GPUs are involved. See Chapter 49 for more about padding byte optimizations.

Research papers on zero padding avoidance:

Intel, 2023, Optimizing Transformer Model Inference on Intel Processors, April 2023, https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-transformer-model-inference-processors.html (One of the optimizations suggested is to avoid computations involving zero padding bytes.)
Yujia Zhai, Chengquan Jiang, Leyuan Wang, Xiaoying Jia, Shang Zhang, Zizhong Chen, Xin Liu, Yibo Zhu, 2023, ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs, https://arxiv.org/abs/2210.03052 (Removing zero-padding inputs is one of the major optimizations in this paper.)
J Du, J Jiang, J Zheng, H Zhang, D Huang, Y Lu, August 2023, Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs, ACM Transactions on Architecture and Code Optimization, https://dl.acm.org/doi/10.1145/3617689, PDF: https://dl.acm.org/doi/pdf/10.1145/3617689
H Peng, S Huang, S Chen, B Li, T Geng, A Li, 2022, A length adaptive algorithm-hardware co-design of transformer on fpga through sparse attention and dynamic pipelining, DAC '22: Proceedings of the 59th ACM/IEEE Design Automation Conference, July 2022, Pages 1135–1140, https://doi.org/10.1145/3489517.3530585, https://dl.acm.org/doi/10.1145/3489517.3530585 https://arxiv.org/pdf/2208.03646
Taylor Simons and Dah-Jye Lee, 2019, A Review of Binarized Neural Networks, Electronics 2019, 8, 661; doi:10.3390/electronics8060661, MDPI, https://www.mdpi.com/2079-9292/8/6/661/review_report (Includes an interesting review of practical problems with zero padding in binarized networks, where the weights are only -1 and +1.)
Zhai, Yujia, 2023, Ph.D. thesis, Architectural-Aware Performance Optimization: From the Foundational Math Library to Cutting-Edge Applications, Computer Science, University of California, Riverside, https://escholarship.org/content/qt8s28g07q/qt8s28g07q.pdf (Includes examination of padding-free algorithms such as ByteTransformer.)
Gongzheng Li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie Fan, Xiaoxi Mao, Zeng Zhao, 2022, Easy and Efficient Transformer: Scalable Inference Solution For large NLP model, May 2022, https://arxiv.org/abs/2104.12470 (Optimizations include avoiding padding computations in the attention heads.)
Ashraf Eassa, Bo Yang Hsueh, Brian Pharris, Zhihan Jiang and Ashwin Nanjappa, Sep 08, 2022, Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA, NVIDIA Technical Blog, https://developer.nvidia.com/blog/full-stack-innovation-fuels-highest-mlperf-inference-2-1-results-for-nvidia/ (The NVIDIA Bert submission used zero-padding removal and also various kernel fusions.)
Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034, Code: https://github.com/JonasGeiping/cramming (Used packing of sequences in training with a SEP separator token rather than CLS. Note: code uses deprecated nvFuser compiler.)

For more research on zero padding, see also https://www.aussieai.com/research/zero-padding.

Weight Precomputations

Weights are static during inference, so why not fiddle with them before we start? Of course, that's exactly the underlying idea of quantization and static pruning. Quantization precomputes new versions of the weights that are quantized to integers or lower precision floating-point. Pruning removes weights by changing some of them to zero.

However, this section looks at other precomputation ideas in general. What useful information can we discern by preprocessing the weights and doing precomputations? Since the weight data is available after training, we can do intervening weight calculations “offline” without affecting inference speed, and use the precomputed data in some way to speed up dynamic runtime inference thereafter.

Research papers on weight precomputation:

T. J. Ham, S. J. Jung, S. Kim et al., 2020, A3: Accelerating attention mechanisms in neural networks with approximation, in Proc. of HPCA. IEEE, 2020, pp. 328–341. https://arxiv.org/abs/2002.10941 (Preprocessing of the key matrix in attention, with focus on large positive and negative values.)
Q. Chen, C. Sun, Z. Lu, and C. Gao, 2022, Enabling energy-efficient inference for self-attention mechanisms in neural networks, in IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 25–28, https://ieeexplore.ieee.org/document/9869924
Tae Jun Ham; Yejin Lee; Seong Hoon Seo; Soosung Kim; Hyunji Choi; Sung Jun Jung; Jae W. Lee, 2021, ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/9499860/, https://taejunham.github.io/data/elsa_isca21.pdf (Precomputations involve the key and value matrices including dot products, hashing, and similarity checking.)

For research on weight precomputations, see also https://www.aussieai.com/research/weight-precomputations.

• Next: Chapter 51. Zero-Multiplication Models

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs