Aussie AI

Serving and Deployment

  • Last Updated 12 December, 2024
  • by David Spuler, Ph.D.

Serving

Serving is the practical matter of how to architecture the full production application around the LLM. Other components may include a web server, application server, RAG datastore, retriever, load balancer, and more. Furthermore, there are some techniques that affect the speed of inference:

  • Batching
  • Prefill versus decoding phase
  • Scheduling
  • Load balancing
  • Frameworks (backend)

Research on LLM Serving

Recently, there has been an explosion of papers about the practical aspects of deployment, orchestration, and serving of LLM inference. Here's some of the papers:

Deployment

Research on LLM deployment:

Batching

Research papers on batching:

Continuous Batching

Research papers on continuous batching:

Frameworks

Research on inference frameworks as part of serving:

Serverless

Scheduling

Load Balancing

Research papers on AI load balancing:

Networking

Research papers on networking optimizations for LLMs:

AI Tech Stack

Research on AI tech stacks:

More AI Research

Read more about: