Aussie AI
Skipping Optimizations
-
Last Updated 30 August, 2024
-
by David Spuler, Ph.D.
Skipping calculations is a powerful optimization whenever it can be achieved. And neural network inference is a morass of redundant calculation, so there is plenty to be skipped. There is a variety of different types of "skipping" that can be done to improve AI inference speed, from top to bottom of the AI stack.
Structural component-level skipping methods include:
- Layer skipping
- Layer fusion
- Early exit (dynamic layer skipping)
- Layer pruning
- Cascades (pathway skipping)
Transformer-specific types of structural "skipping" are possible:
Calculation skipping is possible at various levels, both structured and unstructured, and in various ways:
- Zero skipping
- Negative skipping
- Conditional computation
- Caching and calculation re-use
- Vector dot product computation reuse
- Zero padding calculation skipping
- Loop perforation (probabilistic skipping of loop iterations)
Top-level skipping of a big model's inference phase entirely, in favor of a smaller model:
- Inference cache (storing the whole kaboodle)
- Big-little architecture (routing "easy" queries to the "small" model)
- Speculative decoding (a small model "speculates" about the output)
- Ensemble inference (e.g. swarms of small models)
General Papers on Skipping Optimizations
Papers with skipping algorithm theory include:
- Sparsh Mittal. 2016. A survey of techniques for approximate computing. ACM Computing Surveys (CSUR) 48, 4 (2016), 1–33. https://dl.acm.org/doi/10.1145/2893356
- Shujian Zhang, Korawat Tanwisuth, Chengyue Gong, Pengcheng He, Mingyuan Zhou, 7 May 2024, Switchable Decision: Dynamic Neural Generation Networks, https://arxiv.org/abs/2405.04513 (Switching and skipping sub-layer components such as attention heads, FFNs, or input token skipping, using decisions made based on allocating computation resources.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Ajay Jaiswal, Bodun Hu, Lu Yin, Yeonju Ro, Shiwei Liu, Tianlong Chen, Aditya Akella, 5 Apr 2024, FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping, https://arxiv.org/abs/2404.03865
More AI Research
Read more about:
- Layer skipping
- Zero skipping
- Layer pruning
- Token pruning
- Attention head pruning
- Embeddings pruning
- FFN pruning
- Shallow decoder architecture
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home