Aussie AI Blog
State-of-the-Art LLM Backends
-
26th August, 2024
-
by David Spuler, Ph.D.
State-of-the-Art LLM Backends
It's somewhat difficult to determine the state-of-the-art in LLM serving backends, as used by the industry's top players. Much of this information is commercially sensitive and is no longer appearing in public research papers (maybe it's in patents!). Nevertheless, there are some public papers and articles about various issues. Let's look at a few of them.
Character.AI companionbots backend. As detailed in their blog post, Character.AI has a very high level of traffic to their models. Inference optimization techniques include:
- INT8 quantization of weights and activations
- KV cache quantization (also INT8)
- MatMul INT8 kernels
- INT8 training (QAT)
- Hybrid attention with interleaved layers of local attention and global attention (with global attention for only approximately 1 in every 6 layers)
- KV cache compression
- Multi-Query Attention (MQA)
- KV cache layer fusion
- Session KV caching (for chat sessions)
- Prefix KV caching
- Sticky sessions (avoids copying session caches)
They cite a 13.5X reduction in cost versus use of commercial model hosting (i.e., by doing it themselves), and also a 33X reduction compared to when they started optimizing.
Apple Intelligence for On-Device Inference. In announcing their "Apple Intelligence" initiative in June, 2024, Apple released certain information about the platform, specifically in relation to on-device execution of LLMs on iPhones and Mac. The exact details are somewhat opaque, but some aspects include:
- M-series CPUs with NPU capabilities
- 3B base LLM (with 16-bit precision)
- LoRA adapters for fine-tuning (with 16-bit parameters, sized "in the tens of millions")
- Multi-LoRA inference
- Grouped Query Attention (GQA)
- Low-bit quantization for some parameters (mixed 2-bit and 4-bit quantizations)
- Talaria (tool for analysis)
- KV cache quantization (bit precision undisclosed)
- KV cache optimizations for "KV cache update" (details undisclosed)
With these optimizations, Apple reported performance on an iPhone 15 Pro of time-to-first-token latency of 0.6 milliseconds per prompt token, and decoding phase performance of 30 tokens per second.
Together AI data center networking. In various papers and announcements, Together AI has offered various details of its backend platform. Some example software optimizations available include:
- CUDA backend for NVIDIA GPUs
- Flash Attention
- Flash Decoding
- Medusa decoding (multi-token prediction)
Together AI also described their GPU cluster management, including the networking aspects and validation steps. This involves techniques and components such as:
- H100 GPUs
- Infiniband networking
- NVLink and NVSwitch
- NCCL
- HPCX
- SLURM (workload management)
- Remote Direct Memory Access (RDMA) (GPUDirect RDMA)
- Telegraf (open source monitoring)
Important steps in the process include:
- GPU validation
- Network validation
- Storage validation
References
- Character.AI, June 20, 2024, Optimizing AI Inference at Character.AI, https://research.character.ai/optimizing-inference/
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
- Together AI, Nov 13, 2023, Announcing Together Inference Engine – the fastest inference available, https://www.together.ai/blog/together-inference-engine-v1
- Ryan Lucchese, Niki Birkner, Yaron Hagai, Virginia Adams, August 13, 2024, Together AI, A practitioner's guide to testing and running large GPU clusters for training generative AI models, https://www.together.ai/blog/a-practitioners-guide-to-testing-and-running-large-gpu-clusters-for-training-generative-ai-models
More AI Research Topics
Read more about:
- « Aussie AI Blog Index
- 500+ LLM Inference Optimization Techniques
- Inference Optimizations
- List of Optimization Techniques
- AI Research Home