Aussie AI
Pyramid Inference
-
Last Updated 19 January, 2025
-
by David Spuler, Ph.D.
What is Pyramid Inference?
Pyramid inference is an LLM efficiency optimization based on adaptive inference, where the processing dynamically reduces on two dimensions up to a "peak" at the end with a small and focused area of computation. One way to do pyramid inference is via dual pruning optimizations, with adaptive pruning on two dimensions (e.g., combining layer-based depth pruning and attention head width pruning). Computation begins with a broad set of data on three tensor computation dimensions (length, depth, and width), as usual for LLM inference, but is reduced on two dimensions as inference progresses (e.g., through layers), so that the final steps of inference computation are only considering a small subset of the area. This yields a pyramid shaped structure in the computation with a broad base at the start and a narrow, sharp peak at the end of inference.
Research on Pyramid Inference
Research papers on pyramid LLM inference optimizations:
- K. He, X. Zhang, S. Ren, J. Sun, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9) (2015) 1904–1916. doi: 10.1109/TPAMI.2015.2389824. http://dx.doi.org/10.1109/TPAMI.2015.2389824
- Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, Dahua Lin, 22 Oct 2024, PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction, https://arxiv.org/abs/2410.17247
- Yipeng Zhang, Yifan Liu, Zonghao Guo, Yidan Zhang, Xuesong Yang, Chi Chen, Jun Song, Bo Zheng, Yuan Yao, Zhiyuan Liu, Tat-Seng Chua, Maosong Sun, 18 Dec 2024, LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer, https://arxiv.org/abs/2412.13871
- Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh Rajagopalan, Trishul Chilimbi, 30 Oct 2021, Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning, https://arxiv.org/abs/2111.00230
- Zhaokai Wang, Xizhou Zhu, Xue Yang, Gen Luo, Hao Li, Changyao Tian, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai, 14 Jan 2025, Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding, https://arxiv.org/abs/2501.07783
More Research on Pruning Types
- Depth pruning (overview)
— Static layer pruning
— Layer pruning
— Early exit
— Dynamic layer pruning
— Layer skipping
— Layer approximation
— Shallow decoder architecture
— Layer reordering
— Layer Importance - Width pruning (overview)
— Attention head pruning
— Slimmable networks (width pruning)
— FFN pruning
— Channel pruning
— Filter pruning - Length pruning (longitudinal/input/end-to-end):
— Token pruning (input pruning)
— Dynamic token pruning
— Prompt compression
— Context compression
— Token merging
— Token skipping
— Token dropping
— Zero padding removal - Embedding-dimension pruning
— Embedding pruning
— Embedding matrix compression (embedding pruning)
— Embedding low-rank matrix factorization
— Unembedding matrix (output embeddings) - Multi-dimensional pruning
— Dual pruning
— Triple pruning
— Quadruple pruning
— 3D CNN model pruning
More AI Research
Read more about:
- Inference Optimizations
- Dual Pruning Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home