Aussie AI
Weight Precomputations
-
Last Updated 14 September, 2024
-
by David Spuler, Ph.D.
Weights are static during inference, so why not fiddle with them before we start? Of course, that's exactly the underlying idea of quantization and static pruning. Quantization precomputes new versions of the weights that are quantized to integers or lower precision floating-point. Pruning removes weights by changing some of them to zero.
However, this section looks at other precomputation ideas. What useful information can we discern by preprocessing the weights and doing precomputations? Since the weight data is available after training, we can do intervening changes "offline" without affecting inference speed, and use the precomputed data in some way to speed up inference thereafter.
Research on Weight Precomputations
Some of the papers with generalized ideas about pre-examining weights to speed up inference include:
- T. J. Ham, S. J. Jung, S. Kim et al., “A3: Accelerating attention mechanisms in neural networks with approximation,” in Proc. of HPCA. IEEE, 2020, pp. 328–341. https://arxiv.org/abs/2002.10941 (Preprocessing of the key matrix in attention, with focus on large positive and negative values.)
- Q. Chen, C. Sun, Z. Lu, and C. Gao, “Enabling energy-efficient inference for self-attention mechanisms in neural networks,” in IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), 2022, pp. 25–28, https://ieeexplore.ieee.org/document/9869924
- Tae Jun Ham; Yejin Lee; Seong Hoon Seo; Soosung Kim; Hyunji Choi; Sung Jun Jung; Jae W. Lee, 2021, ELSA: Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), https://ieeexplore.ieee.org/abstract/document/9499860/, https://taejunham.github.io/data/elsa_isca21.pdf (Precomputations involve the key and value matrices including dot products, hashing, and similarity checking.)
- J. Rae, J. J. Hunt, I. Danihelka, T. Harley, A. W. Senior, G. Wayne, A. Graves, and T. Lillicrap, “Scaling memory-augmented neural networks with sparse reads and writes,” in International Conference on Neural Information Processing Systems, NIPS, 2016. https://arxiv.org/abs/1610.09027
- Z Qu, L Liu, F Tu, Z Chen, Y Ding, Y Xie, 2022, Dota: detect and omit weak attentions for scalable transformer acceleration, https://dl.acm.org/doi/pdf/10.1145/3503222.3507738
- David Spuler, March 2024, Chapter 50. Adaptive Inference, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Nils Graef, 12 Mar 2024 (v3), Transformer tricks: Precomputing the first layer, https://arxiv.org/abs/2402.13388 Code: https://github.com/OpenMachine-ai/transformer-tricks (Because the first layer only depends on the embeddings, it can be precomputed.)
- SZ Lin, YC Chen, YH Chang, TW Kuo, HP Li, 2024, LUTIN: Efficient Neural Network Inference with Table Lookup, ISLPED ’24, August 5-7, 2024, Newport Beach, CA, USA, https://dl.acm.org/doi/pdf/10.1145/3665314.3670804
More AI Research
Read more about: