Aussie AI

Inference Budget

  • Last Updated 7 March, 2025
  • by David Spuler, Ph.D.

What is an Inference Budget?

An inference budget is a type of adaptive inference as an LLM inference optimization, whereby the processing of a query is given a "budget" that specifies how much processing can be used to compute the answer. The budget might be a "compute budget" in terms of allowed GPU juice, or a "time budget" in terms of trying to meet latency constraints for Service Level Objectives (SLOs).

Various methods of performing budget-based inference have been tried. Typically, they have been used to optimize a single inference step using adaptive inference optimizations. To do so, one of the many possible LLM adaptive inference optimizations must be triggered at the budget point, such as:

In some cases, the LLM can immediately return with whatever interim results it has (e.g., early exit), but in others it must continue processing, but using a reduced-resource for the remainder of the completion.

More recently, the idea of budgets has been applied to multi-step inference reasoning methods such as Chain-of-Thought. The optimizations to cut short multi-step reasoning in CoT include options like:

Research on Inference Budgets

Research papers that use an inference budget, usually in addition with some other LLM inference optimization:

  • Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou, 16 Aug 2024 (v3), Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference, https://arxiv.org/abs/2407.11550
  • Youva Addad, AlexisLechervy, Frédéric Jurie, 2024, Balancing Accuracy and Efficiency in Budget-Aware Early-Exiting Neural Networks, https://lechervy.users.greyc.fr/publi/C/publi_pdf/icpr24.pdf
  • Tschannen, M., Khanna, A. & Anandkumar, A, StrassenNets: deep learning with a multiplication budget. In International Conference on Machine Learning 4985–4994 (PMLR, 2018). https://arxiv.org/abs/1712.03942
  • Shubham Gandhi, Manasi Patwardhan, Lovekesh Vig, Gautam Shroff, 12 Nov 2024, BudgetMLAgent: A Cost-Effective LLM Multi-Agent system for Automating Machine Learning Tasks, https://arxiv.org/abs/2411.07464
  • Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang, 12 Dec 2024, ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based on Layer Uncertainty, https://arxiv.org/abs/2412.09036 (KV cache compression on a layerwise budget, similar to token-based eviction, giving a kind of "dual-dimension" KV cache compression on depth and length.)
  • Ç. Yeşil, B. T. Ay, F. A. Ak, Ö. B. Mercan and O. Nefesoğlu, "Adaptive Batch Budget for LLM Inference," 2024 9th International Conference on Computer Science and Engineering (UBMK), Antalya, Turkiye, 2024, pp. 219-223, doi: 10.1109/UBMK63289.2024.10773573. https://ieeexplore.ieee.org/abstract/document/10773573
  • Tingxu Han, Chunrong Fang, Shiyu Zhao, Shiqing Ma, Zhenyu Chen, Zhenting Wang, 30 Dec 2024 (v2), Token-Budget-Aware LLM Reasoning, https://arxiv.org/abs/2412.18547 https://github.com/GeniusHTX/TALE
  • Amirkeivan Mohtashami, Matteo Pagliardini, Martin Jaggi, 14 Aug 2024 (v2), CoTFormer: A Chain-of-Thought Driven Architecture with Budget-Adaptive Computation Cost at Inference, https://arxiv.org/abs/2310.10845
  • Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky, 6 Oct 2024, Inference Scaling for Long-Context Retrieval Augmented Generation, https://arxiv.org/abs/2410.04343 (Combine RAG and multi-step inference, controlling token cost via budgeting allocations.)
  • Xingyang He, Jie Liu, Shaowei Chen, 25 Jan 2025, Task-KV: Task-aware KV Cache Optimization via Semantic Differentiation of Attention Heads, https://arxiv.org/abs/2501.15113 (Dynamic KV cache compression based on budgets.)
  • Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto, 3 Feb 2025 (v2), s1: Simple test-time scaling, https://arxiv.org/abs/2501.19393 https://github.com/simplescaling/s1 (Method of "budget forcing" that allows either shortening or lengthening multi-step reasoning sequences.)
  • Zijian He, Reyna Abhyankar, Vikranth Srivatsa, Yiying Zhang, 12 Feb 2025, Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning, https://arxiv.org/abs/2502.08056
  • Daman Arora, Andrea Zanette, 11 Feb 2025 (v2), Training Language Models to Reason Efficiently, https://arxiv.org/abs/2502.04463 https://github.com/Zanette-Labs/efficient-reasoning
  • Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng, 29 Jan 2025 (v2), O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning, https://arxiv.org/abs/2501.12570 https://github.com/StarDewXXX/O1-Pruner
  • Ahmed Burak Gulhan, Krishna Teja Chitty-Venkata, Murali Emani, Mahmut Kandemir, Venkatram Vishwanath, 18 Feb 2025, BaKlaVa -- Budgeted Allocation of KV cache for Long-context Inference, https://arxiv.org/abs/2502.13176
  • Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li, 24 Feb 2025, DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance, https://arxiv.org/abs/2502.16886
  • Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He, 25 Feb 2025, Chain of Draft: Thinking Faster by Writing Less, https://arxiv.org/abs/2502.18600 (Concise CoT method using a per-step inference budget.)
  • Zhouzi Li, Benjamin Berg, Arpan Mukhopadhyay, Mor Harchol-Balter, 31 Jul 2024 (v3), How to Rent GPUs on a Budget, https://arxiv.org/abs/2406.15560
  • Xunhao Lai, Jianqiao Lu, Yao Luo, Yiyuan Ma, Xun Zhou, 28 Feb 2025, FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference, https://arxiv.org/abs/2502.20766 (Prefill optimization that dynamically applies different attention patterns, including sparse attention, for KV computations, based on the input query.)

More AI Research

Read more about: