Aussie AI
Inference Budget
-
Last Updated 7 March, 2025
-
by David Spuler, Ph.D.
What is an Inference Budget?
An inference budget is a type of adaptive inference as an LLM inference optimization, whereby the processing of a query is given a "budget" that specifies how much processing can be used to compute the answer. The budget might be a "compute budget" in terms of allowed GPU juice, or a "time budget" in terms of trying to meet latency constraints for Service Level Objectives (SLOs).
Various methods of performing budget-based inference have been tried. Typically, they have been used to optimize a single inference step using adaptive inference optimizations. To do so, one of the many possible LLM adaptive inference optimizations must be triggered at the budget point, such as:
- Early exiting of layers
- Depth pruning
- Width pruning (e.g., attention head pruning)
- Length pruning (e.g., dynamic token pruning)
In some cases, the LLM can immediately return with whatever interim results it has (e.g., early exit), but in others it must continue processing, but using a reduced-resource for the remainder of the completion.
More recently, the idea of budgets has been applied to multi-step inference reasoning methods such as Chain-of-Thought. The optimizations to cut short multi-step reasoning in CoT include options like:
- CoT token reduction
- CoT path reduction
- CoT tree-based path pruning
Research on Inference Budgets
Research papers that use an inference budget, usually in addition with some other LLM inference optimization:
- Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou, 16 Aug 2024 (v3), Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference, https://arxiv.org/abs/2407.11550
- Youva Addad, AlexisLechervy, Frédéric Jurie, 2024, Balancing Accuracy and Efficiency in Budget-Aware Early-Exiting Neural Networks, https://lechervy.users.greyc.fr/publi/C/publi_pdf/icpr24.pdf
- Tschannen, M., Khanna, A. & Anandkumar, A, StrassenNets: deep learning with a multiplication budget. In International Conference on Machine Learning 4985–4994 (PMLR, 2018). https://arxiv.org/abs/1712.03942
- Shubham Gandhi, Manasi Patwardhan, Lovekesh Vig, Gautam Shroff, 12 Nov 2024, BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks, https://arxiv.org/abs/2411.07464
- Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang, 12 Dec 2024, ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty, https://arxiv.org/abs/2412.09036 (KV cache compression on a layerwise budget, similar to token-based eviction, giving a kind of "dual-dimension" KV cache compression on depth and length.)
- Ç. Yeşil, B. T. Ay, F. A. Ak, Ö. B. Mercan and O. Nefesoğlu, "Adaptive Batch Budget for LLM Inference," 2024 9th International Conference on Computer Science and Engineering (UBMK), Antalya, Turkiye, 2024, pp. 219-223, doi: 10.1109/UBMK63289.2024.10773573. https://ieeexplore.ieee.org/abstract/document/10773573
- Tingxu Han, Chunrong Fang, Shiyu Zhao, Shiqing Ma, Zhenyu Chen, Zhenting Wang, 30 Dec 2024 (v2), Token-Budget-Aware LLM Reasoning, https://arxiv.org/abs/2412.18547 https://github.com/GeniusHTX/TALE
- Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi, 14 Aug 2024 (v2), CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference, https://arxiv.org/abs/2310.10845
- Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343 (Combine RAG and multi-step inference, controlling token cost via budgeting allocations.)
- Xingyang He, Jie Liu, Shaowei Chen, 25 Jan 2025, Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads, https://arxiv.org/abs/2501.15113 (Dynamic KV cache compression based on budgets.)
- Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto, 3 Feb 2025 (v2), s1: Simple test-time scaling, https://arxiv.org/abs/2501.19393 https://github.com/simplescaling/s1 (Method of "budget forcing" that allows either shortening or lengthening multi-step reasoning sequences.)
- Zijian He, Reyna Abhyankar, Vikranth Srivatsa, Yiying Zhang, 12 Feb 2025, Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning, https://arxiv.org/abs/2502.08056
- Daman Arora, Andrea Zanette, 11 Feb 2025 (v2), Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 https://github.com/Zanette-Labs/efficient-reasoning
- Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng, 29 Jan 2025 (v2), O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, https://arxiv.org/abs/2501.12570 https://github.com/StarDewXXX/O1-Pruner
- Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut Kandemir, Venkatram Vishwanath, 18 Feb 2025, BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference, https://arxiv.org/abs/2502.13176
- Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li, 24 Feb 2025, DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance, https://arxiv.org/abs/2502.16886
- Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He, 25 Feb 2025, Chain of Draft: Thinking Faster by Writing Less, https://arxiv.org/abs/2502.18600 (Concise CoT method using a per-step inference budget.)
- Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter, 31 Jul 2024 (v3), How to Rent GPUs on a Budget, https://arxiv.org/abs/2406.15560
- Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou, 28 Feb 2025, FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference, https://arxiv.org/abs/2502.20766 (Prefill optimization that dynamically applies different attention patterns, including sparse attention, for KV computations, based on the input query.)
More AI Research
Read more about: