Aussie AI Blog

State-of-the-Art LLM Backends

26th August, 2024

by David Spuler, Ph.D.

State-of-the-Art LLM Backends

It's somewhat difficult to determine the state-of-the-art in LLM serving backends, as used by the industry's top players. Much of this information is commercially sensitive and is no longer appearing in public research papers (maybe it's in patents!). Nevertheless, there are some public papers and articles about various issues. Let's look at a few of them.

Character.AI companionbots backend. As detailed in their blog post, Character.AI has a very high level of traffic to their models. Inference optimization techniques include:

INT8 quantization of weights and activations
KV cache quantization (also INT8)
MatMul INT8 kernels
INT8 training (QAT)
Hybrid attention with interleaved layers of local attention and global attention (with global attention for only approximately 1 in every 6 layers)
KV cache compression
Multi-Query Attention (MQA)
KV cache layer fusion
Session KV caching (for chat sessions)
Prefix KV caching
Sticky sessions (avoids copying session caches)

They cite a 13.5X reduction in cost versus use of commercial model hosting (i.e., by doing it themselves), and also a 33X reduction compared to when they started optimizing.

Apple Intelligence for On-Device Inference. In announcing their "Apple Intelligence" initiative in June, 2024, Apple released certain information about the platform, specifically in relation to on-device execution of LLMs on iPhones and Mac. The exact details are somewhat opaque, but some aspects include:

M-series CPUs with NPU capabilities
3B base LLM (with 16-bit precision)
LoRA adapters for fine-tuning (with 16-bit parameters, sized "in the tens of millions")
Multi-LoRA inference
Grouped Query Attention (GQA)
Low-bit quantization for some parameters (mixed 2-bit and 4-bit quantizations)
Talaria (tool for analysis)
KV cache quantization (bit precision undisclosed)
KV cache optimizations for "KV cache update" (details undisclosed)

With these optimizations, Apple reported performance on an iPhone 15 Pro of time-to-first-token latency of 0.6 milliseconds per prompt token, and decoding phase performance of 30 tokens per second.

Together AI data center networking. In various papers and announcements, Together AI has offered various details of its backend platform. Some example software optimizations available include:

CUDA backend for NVIDIA GPUs
Flash Attention
Flash Decoding
Medusa decoding (multi-token prediction)

Together AI also described their GPU cluster management, including the networking aspects and validation steps. This involves techniques and components such as: