Aussie AI Blog

Hot Inference Optimization Techniques

  • 25th August, 2024
  • by David Spuler, Ph.D.

Inference optimization has become a hot area of research as the industry evolves to the point where inference costs are about 95% of overall compute. This is a change since the early days when training expense far exceeded inference costs. This trend is driven by:

    (a) more users, which means more queries, which means more inference computations,

    (b) commercial and open source pre-trained models (rather than training your own),

    (c) faster training and fine-tuning methods (e.g. LoRA and multi-LoRA), and

    (d) RAG architectures replacing fine-tuning.

This change has spawned a deluge of research papers on speeding up inference that aims to offer faster latency to users and reduce costs. Some of the hottest research sub-areas for speeding up inference include:

  • Hardware optimizations. The biggest opportunity for inference speedup is probably in hardware rather than software. There's the upcoming NVIDIA Blackwell architecture, which is apparently delayed as I write this, along with several AI-specific hardware startups such as Groq and Etched receiving large funding rounds. I'm not an expert on the hardware opportunities, so I'll leave it there.
  • KV cache compression. The KV cache was initially a speedup for inference, but it's become a memory hog, especially for long context processing. Hence, there are numerous research papers on making the KV cache data use less memory (see KV cache compression research). In particular, KV cache quantization is becoming standard in industry framework implementations, such as 4-bit quantized KV cache data used by Character.AI and Apple Intelligence. There are several fancier types of KV cache compression in the research, such as KV cache layer pruning (depth dimension) and KV cache token pruning (input prompt length dimension). Notably, an implementation of KV cache layer fusion is used by Character.AI's inference backend for companionbots.
  • Context caching. The simplest cache is a text-to-text full "inference cache" and there's also semantic caching based on embedding vector similarity. However, the idea of saving the KV cache, but re-running decoding has various advantages, and is gaining attention in both research and industry. This is usually termed "context caching" or "prompt caching" in the industry. Google has recently released "context caching" features, Anthropic has added "prompt caching," and this type of caching is also appearing in other inference frameworks, such as vLLM and DeepSeek. Expect many more to follow! See: context caching research.
  • Prefix KV caching. There are many cases where Transformers are re-processing the same prefix of tokens, such as chatbot multi-turn conversational context, global system instructions (prepended), RAG chunks (prepended), and re-used documents. Instead, you can just load the KV cache data from a prefix KV cache, and the latency is minimal, and you only have to decode the last few tokens. Prefix KV caching is also getting implemented in frameworks, including vLLM, DeepSeek, and Character.AI's backend. Interestingly, DeepSeek offers lower pricing for "cached tokens," which reflects the lower cost.
  • Multi-LoRA. The idea of using multiple LoRA adapters for efficiently supporting multiple fine-tuned models got a massive boost from Apple Intelligence. There are many research papers now focused on further optimizing the load-time and inference characteristics of multi-LoRA architectures and other types of Parameter-Efficient Fine-Tuning (PEFT).
  • Memory-efficient attention algorithms. The two leading contenders for attention optimization by paying attention to its memory access patterns are Flash Attention and Paged Attention, and you can even combine them! There's also their precursors Multi-Query Attention (MQA) and Grouped Query Attention (GQA) that are still in use and getting researched in papers. See memory-efficient attention optimization.
  • Linear attention. Another way to reduce memory cost is to simply access it less! Algorithms like this include local attention and other types of linear attention. As a recent example in industry, Character.AI's inference backend uses a hybrid layerwise attention scheme, that alternates between local and global attention across different layers. There's a lot of research happening in optimizing the attention mechanisms, because of its annoying quadratic complexity. See research on attention optimization.
  • Zero-multiplication models. MIT research released a model architecture based on element-wise multiplication for matrix multiplication, which is the "Hadamard product." Basic matrix multiplication is O(n3) whereas Hadamard computations are O(n2), so it's potentially a tenfold reduction, and also a simpler algorithm that's more amenable to followup kernel optimizations like kernel fusion. See Hadamard multiplication models. There are actually at least ten other types of zero-multiplication models in the literature (e.g., adder models, shift-add, logarithmic, power-of-two, max-plus, min-max, weightless neural networks, etc.). There's also the well-known method of avoiding multiplication with low-bit quantization. Both binary quantization and ternary quantization can be implemented via addition, albeit with accuracy loss.
  • Speculative decoding. Increased parallelization of the decoding algorithm via speculative decoding is a perennially hot area of research. It's a speedup that's long been used in production backends. Various generalization have been discovered, such as generalized speculative decoding, heuristic speculative decoding, self-speculative decoding, retrieval lookup decoding, prompt lookup decoding, and several other methods.
  • Multi-token generation. Generalizing the decoding algorithm to output multiple tokens in parallel is a clear gain in efficiency, and some research is starting to show promise. These require an entirely different type of model architecture for both training and inference. There are also some multi-token drafting methods starting to be used to optimize speculative decoding algorithms. See: parallel decoding research.
  • Prefill optimizations. There has been a burst of new research that examines the cost of the prefill operation, which creates the KV cache, and is the reason for the initial latency before the first token is output. Hence, prefill time is important for user responsiveness for any interactive use cases. In particular, research has found that prefill is compute-bound, whereas the decoding phase is memory-bound. Hence, there is much research on prefill phase optimizations, chunked prefill, and disaggregated scheduling of prefill and decoding phases on GPU platforms. Note also that KV caching methods as discussed above can optimize prefill by avoiding it completely!

More AI Research Topics

Read more about: