Aussie AI

In-Flight Batching

  • Last Updated 13 December, 2024
  • by David Spuler, Ph.D.

In-Flight Batching (IFB) is an LLM inference optimization that involves splitting queries up into "batches," to increase throughout of processing. These batches are groups of tokens with a fixed size that has strong throughput characteristics. In-flight batching is also known in research papers as "continuous batching," but AI leader NVIDIA prefers the IFB vernacular.

Note that "batching" is not referring to the "batch API" capabilities, such as the OpenAI Batch API. That is a different type of batching of user queries at a higher-level of processing, rather than their underlying low-level tokens as processed by the GPU.

The idea with in-flight batching is to use fixed-size chunks as a "batch," which are chosen to be an optimal size for GPU processing in parallel. It is mainly effective in the "prefill" or "prompt processing" phase, which refers to the initial processing of the "context" tokens by the LLM. Hence, this idea is similar to "chunked prefill" and other methods of prefill optimization.

The effect of IFB is that large prompts inputs are split up into smaller chunks. Small prompt lengths, or uneven leftover chunks, can also be combined into one "batch" for processing. Note that even short user queries of a few words often have a much longer hidden context. There is typically a prepended set of extra tokens, including:

  • System instructions (prepended to all queries by the LLM vendors).
  • Global instructions (per-user account).
  • RAG data chunks (note a different meaning of "chunks" here).
  • Conversational history (e.g., chatbots or Q&A sessions).
  • Document context (e.g., for document summarization).

Processing of these initial prepended tokens, mostly hidden from users in every prompt, can also be optimized using "prefix caching," which is even more efficient than in-flight batching.

Research on In-Flight Batching

Articles and research papers on in-flight batching or continuous batching:

More AI Research

Read more about: