Aussie AI
In-Flight Batching
-
Last Updated 13 December, 2024
-
by David Spuler, Ph.D.
In-Flight Batching (IFB) is an LLM inference optimization that involves splitting queries up into "batches," to increase throughout of processing. These batches are groups of tokens with a fixed size that has strong throughput characteristics. In-flight batching is also known in research papers as "continuous batching," but AI leader NVIDIA prefers the IFB vernacular.
Note that "batching" is not referring to the "batch API" capabilities, such as the OpenAI Batch API. That is a different type of batching of user queries at a higher-level of processing, rather than their underlying low-level tokens as processed by the GPU.
The idea with in-flight batching is to use fixed-size chunks as a "batch," which are chosen to be an optimal size for GPU processing in parallel. It is mainly effective in the "prefill" or "prompt processing" phase, which refers to the initial processing of the "context" tokens by the LLM. Hence, this idea is similar to "chunked prefill" and other methods of prefill optimization.
The effect of IFB is that large prompts inputs are split up into smaller chunks. Small prompt lengths, or uneven leftover chunks, can also be combined into one "batch" for processing. Note that even short user queries of a few words often have a much longer hidden context. There is typically a prepended set of extra tokens, including:
- System instructions (prepended to all queries by the LLM vendors).
- Global instructions (per-user account).
- RAG data chunks (note a different meaning of "chunks" here).
- Conversational history (e.g., chatbots or Q&A sessions).
- Document context (e.g., for document summarization).
Processing of these initial prepended tokens, mostly hidden from users in every prompt, can also be optimized using "prefix caching," which is even more efficient than in-flight batching.
Research on In-Flight Batching
Articles and research papers on in-flight batching or continuous batching:
- Shixiaowei02, Oct 2024, TensorRT-LLM 0.13.0 Release Latest, https://github.com/NVIDIA/TensorRT-LLM/releases/tag/v0.13.0
- Anjali Shah, Kshitiz Gupta, Jiahong Liu and Haohang Huang, Dec 11, 2024, NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-now-accelerates-encoder-decoder-models-with-in-flight-batching/
- Abhimanyu Bambhaniya, Ritik Raj, Geonhwa Jeong, Souvik Kundu, Sudarshan Srinivasan, Midhilesh Elavazhagan, Madhu Kumar, Tushar Krishna, 3 Jun 2024, Demystifying Platform Requirements for Diverse LLM Inference Use Cases, https://arxiv.org/abs/2406.01698 Code: https://github.com/abhibambhaniya/GenZ-LLM-Analyzer (Analysis of cost of serving LLMs, including separate profiles of prefill versus decoding phases, and the cost of extra prompt processing in RAG architectures with prepended information.)
- Theia Vogel, December 18, 2023, How to make LLMs go fast, https://vgel.me/posts/faster-inference/
- Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, Yiying Zhang, 2 Feb 2024, APIServe: Efficient API Support for Large-Language Model Inferencing, https://arxiv.org/abs/2402.01869
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang, 20 Jun 2024, Optimizing Speculative Decoding for Serving Large Language Models Using Goodput, https://arxiv.org/abs/2406.14066 (Estimation of the draft length for increased acceptance to improve overall performance.)
- Yuqing Yang, Yuedong Xu, Lei Jiao, 7 Jul 2024, A Queueing Theoretic Perspective on Low-Latency LLM Inference with Variable Token Length, https://arxiv.org/abs/2407.05347
- Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis, 12 Jul 2024, Inference Optimization of Foundation Models on AI Accelerators, KDD’24, August 25–29, 2024, Barcelona, Spain, https://arxiv.org/abs/2407.09111
- Sergei Savvov, Jun 27, 2023, 7 Ways To Speed Up Inference of Your Hosted LLMs, https://betterprogramming.pub/speed-up-llm-inference-83653aa24c47
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, 6 Jun 2024 (v2), SGLang: Efficient Execution of Structured Language Model Programs, https://arxiv.org/abs/2312.07104 https://github.com/sgl-project/sglang
- https://community.juniper.net/blogs/sharada-yeluri/2024/02/20/llm-inference-hw-sw-optimizations
- Cade Daniel, Chen Shen, Eric Liang and Richard Liaw , June 22, 2023, How continuous batching enables 23x throughput in LLM inference while reducing p50 latency, https://www.anyscale.com/blog/continuous-batching-llm-inference
- Hugging Face, 2024, Text Generation Inference, https://huggingface.co/docs/text-generation-inference/index
- Sungmin Yun, Kwanhee Kyung, Juhwan Cho, Jaewan Choi, Jongmin Kim, Byeongho Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn, 2 Sep 2024, Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching, https://arxiv.org/abs/2409.01141
- OpenVINO-toolkit, Oct 1, 2024, Introducing OpenVINO™ 2024.4, https://medium.com/openvino-toolkit/introducing-openvino-2024-4-28578870b264
- Yuhang Yao, Han Jin, Alay Dilipbhai Shah, Shanshan Han, Zijian Hu, Yide Ran, Dimitris Stripelis, Zhaozhuo Xu, Salman Avestimehr, Chaoyang He, 10 Sep 2024 (v2), ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency, https://arxiv.org/abs/2408.00008
- Yuzhe Yang, Yipeng Du, Ahmad Farhan, Claudio Angione, Yue Zhao, Harry Yang, Fielding Johnston, James Buban, Patrick Colangelo, 28 Oct 2024, Meta-Learning for Speeding Up Large Model Inference in Decentralized Environments, https://arxiv.org/abs/2410.21340 (Choosing between multiple acceleration techniques).
- OpenVINO™ toolkit, Sep 26, 2024, How To Efficiently Serve Today’s Large Language Models, https://medium.com/openvino-toolkit/how-to-efficiently-serve-todays-large-language-models-b3f1e8d33fdf
- NVIDIA, Dec 2024, Multi-Head, Multi-Query, and Group-Query Attention, https://nvidia.github.io/TensorRT-LLM/advanced/gpt-attention.html#kv-cache
More AI Research
Read more about: