Aussie AI

7. Deployment Architecture

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Most people overestimate what they can do in one year
and underestimate what they can do in ten years.”

— Bill Gates

Backend Server Architecture

Although it's wonderful to see your AI engine running on your dev box, there's still a lot to do before your users can see it. This is called “deployment” of your application, including its AI model and whatever application-specific logic you're adding on top. Your deployment architecture typically consists of these main components:

Web server (e.g. Apache, NGinx)
Backend server
Application logic server
AI Request Server
AI Engine

You can merge several of these types of server components, but it's simpler to do them separately, or at least to think about them conceptually as separate.

The AI engine is not the first part of the backend deployment architecture. There needs to be a simpler request-handling server that receives the user's input from the client. This may involve one or more server processes behind the scenes.

For example, in a simple browser-based Q&A service, the user would input their question or prompt from a web browser. This browser request is then handled by the basic HTTPD server such as Apache or Nginx, which then forwards the user's prompt to another application-specific server that processes the request.

The request processing server could be the AI engine directly in a small architecture, but in a more realistic production architecture it would be a simpler server that multiplexes a stream of input requests, farming out the requests to multiple AI servers.

The backend server takes the user request and sends it over for the application logic server to do whatever high-end services you are providing, which then decides what AI requests are needed, and then sends it along for the AI request server to handle.

The AI request server has to multiplex the AI requests across multiple AI engines, and then, for any complex queries or multi-engine requests, collate the results back together. Neither of these components are trivial, but at least they're not as big of a C++ project as trying to write a whole AI engine. Various commercial off-the-shelf servers already exist for either of these components, so that's probably your best option. But the application logic server should be your own brilliance expressed in C++ code.

AI Server Hosting Options

How are you running your AI engines? If you're calling a commercial AI API, at least there's an SEP field wrapped around that issue. But if you're running your own model or open source models, then you have various options:

Cloud server hosting to rent boxes with GPUs (e.g. AWS, Azure, GCP, OVH, etc.)
GPU-specific hosting companies (the big companies and several newer startups).
Hourly GPU rental (from the same GPU hosting companies).
Model hosting (not just GPUs, and again, choose between companies big and small)

Note that GPUs are not your only concern. You will need some non-GPU boxes to run the other components of your AI production architecture, such as basic Apache/Nginx servers, backend servers, and application logic servers. The AI request servers might run near your AI servers, or could be on separate boxes.

Various ancillary servers may also be needed for your operations, such as:

Testbed servers (GPU and non-GPU)
Deployment servers (e.g. marshaling new releases)
Static file cache servers
DNS servers (if you DIY)

You need a way to test your new production architecture before it goes live. The full “deployment” procedure may need a server to manage it, and for a rollback procedure when it fails. You might also need extra boxes to DIY a cache of static files (e.g. images, scripts), or you can use a Content Delivery Network (CDN) commercial provider.

Hosting Server Specs

Irrespective of whether you're hosting on your own servers, or using a major cloud service provider, you need to consider the specs for an AI backend server. Firstly, note that not all servers are “AI servers” and most of the basic servers don't need GPUs at all (e.g. Apache servers, ancillary servers, etc.). The main specs for a non-AI server are the usual suspects for an Internet server:

CPU
RAM
Disk speed
Network connectivity

Personally, for the basic servers, I recommend going reasonably low-end or mid-tier in terms of specs, but running a lot of them. In particular, you don't need a lot of disk space for many of these basic servers, so focus on getting enough RAM and a CPU with enough cores. If you're running multiple servers, you also don't need a high network level on a per-server basis, because it's distributed across multiple servers. But you do need to consider your method of dispersing traffic across multiple HTTPD servers (e.g. round-robin DNS or a load balancing method).

Also, try to set up your architecture so that you don't need those gold-plated extras from your hosting provider. You don't need fault tolerance and failover for an architecture with multiple web server boxes. You also don't need backup of these cloned production servers in this case, but only for those servers with important logs, user management databases, or user document datastores. The idea is to run multiple identical servers, and then shut down any that start being problematic, which occurs rarely anyway. Instead, you need a streamlined process for deploying a new server, whether it's renting a new bare metal server or auto-spinning up a new VM. Hence, the DevOps software processes are almost more important than the exact choice of server specs for many utility servers.

What you do need is a monitoring system to detect any problems. And you also need a fast network between all the servers so you can copy over an entire server deployment quickly. Note that you can often have faster “private” network connections than the public ones if the servers support multiple network cards.

Disk specs. Although your GPU choice and RAM size are more important, you should also consider your disk speed. This applies to your options in setting up a virtual machine, or on the choice of disk storage for a bare metal server.

You need a large amount of disk storage for model files, which are larger than your average bear. This might tempt you to go for the cheaper and larger HDD disks. On the other hand, SDD disks are much faster to load, and not that expensive any more. If you want fast startup of your engine with its full model, I think SDD, such as using NVMe disks, is the way to go. Also, it's a kind of fallback in case you mess up the server process configurations, and the machine starts paging, which is much faster if the disk is SSD.

GPU Specs

Sadly, you will need to rent some GPUs for your rapacious AI engines, and this will skyrocket your hosting bills. There are several important considerations related to GPUs:

GPU choice
GPU RAM (VRAM)
GPU billing methods

GPU Choice. Which brand of GPU should you choose? For a data center backend execution of AI inference or training, the leader of the pack is currently NVIDIA. Alternatively, as your second choice, you could try a GPU made by NVIDIA. Or if you can't afford that, then you really should pass the hat around and save up to pay for a GPU from NVIDIA. Your basic data center GPU options from NVIDIA, sorted by GPU RAM (and cost), include:

P100 — 12GB or 16GB
V100 — 16GB or 32GB
A100 — 40GB or 80GB
H100 — 80GB

Okay, yes, there are some other options. There is the Google TPU and some data center GPUs from AMD and Intel that you can consider.

If it's not in the data center, such as running a smaller model on a high-end PC (e.g. a “gaming” desktop), then there are more options, and many more GPU RAM sizes to consider. You can choose between various NVIDIA RTX series, AMD Radeons, and several other GPU vendors.

GPU RAM. The amount of RAM on a GPU is important and directly impacts the performance of AI inference. This is sometimes called “VRAM” for “Video RAM” in a somewhat outdated reference to using GPUs for video games, but it's often just called “GPU RAM” when used for AI engines. The “G” in “GPU” used to stand for “Graphics,” but now it just means “Gigantic” or “Generative” or something.

How much GPU RAM is needed? Short answer: at least 12GB for smaller models (e.g. 3B), ideally 24GB to run a 7B model or a 13B model. Quantization is also helpful in making models small enough to fit in a GPU.

Typically, for open source models you want the entire model able to sit inside a single GPU's RAM. However, this also impacts how many instances of the AI engine can run on a single server with one GPU. The GPU needs a static copy of the model (once), but also needs to store all the interim calculations of activations separately for each instance. If you're running an open source 7B model, multiple copies fit inside a decent GPU. Less so for 13B models, and trying to run a 70B model in an 80GB GPU gets a touch more difficult. Quantized models are your friend.

NVIDIA 3080Ti has 12GB and works for 3B/7B models, mainly for POC development and researchy type stuff. NVIDIA 3090 has 24GB and works well for 3B/7B and you can toy around with 13B if careful. NVIDIA 4070Ti (12GB) is similar to a 3080Ti; NVIDIA 4080 has 16GB and NVIDIA 4090 has 24GB. For bigger models requiring more GPU RAM, you're looking at V100, A100, or H100.

The above discussion mainly relates to small and medium-size open source models. Running a big commercial mega-model isn't really possible with only a single GPU. The big companies are running H100's by the bucketload with complex multi-GPU scheduling algorithms. If you want to know about that stuff, send them a resume (or read research papers).

Note that there's not really the concept of “virtual memory” or “context switching” or “paging” of GPU RAM. The operating system isn't going to help you here. Managing GPU RAM is a low-level programming task and you basically cram the entire model into the GPU and hope to never unload it again.

You will more than one box with a GPU, even for smaller models, assuming multiple model instances per GPU. To get a decent response time, you want a model instance on a GPU to be immediately available to run a user's query. How many total instances you need depends on your user load, and whether they like watching blue spinning circles.

GPU Billing. There are various billing methods for GPUs, and you have to compare providers. A lot of GPU power is billed on an hourly basis, with monthly options, and managing this expense can make a big difference overall. The load profile differs for inference versus training, and also obviously depends on user patterns in the production application.

Online Architecture Optimization

The AI engine and its model are only part of your production architecture. An online website version needs a backend server that receives user requests, marshals them to an API, that then sends them off to the AI engine. The AI engine shouldn't be running on the same tech as your basic server, so the requests are sent remotely whether it's your own AI engine or a commercial API wrapper architecture.

Website Optimization: There's a whole bag of jobs needed to optimize the user response for a website, and most of that is well-known and unrelated to AI. Some of the issues include:

Apache versus Nginx
DNS speed (DIY vs use a commercial provider)
Image sizes (i.e. low-resolution images)
Script sizes (e.g. minifying JavaScript)
HTML page sizes
File cache settings
Etags
SSL/HTTPS certificates (e.g. LetsEncrypt is free)
Third-party scripts (e.g. Google AdSense, Google Analytics)
Cookie management (or like Mater: “to not to”)

Some of the broader issues include:

User account management
CDN usage
Analytics
Ad serving scripts
Cloud hosting servers (GPU and non-GPU)
Multi-server management

Having a website run small and fast is a whole tech discipline in itself. This book does not cover many of these non-AI-specific website optimization issues in detail.

API Wrapper Architecture Optimizations

If your architecture is wrapping a commercial API, then you can't do much with the model or its engine. However, in addition to optimizing your backend server architecture, you also have control over what user requests get sent through to the commercial API. Some of the optimizations include:

Filter dubious queries with heuristics (e.g. blanks only, punctuation only, single cuss words, etc.)
Use an “inference cache” of full responses to identical previously-seen queries. Consider caching multiple slightly-different responses.
Use a “semantic cache” with a vector database that does “nearest-neighbor” lookups to cache responses to close-enough queries.
Context compression of chatbot conversation history or RAG document chunks.

Chapter 29 examines caching techniques in more detail. If you are wrapping a commercial API, your speed improvements are limited to this type of pre-API caching, along with speedups to your basic deployment architecture (e.g. Apache/Nginx, back-end servers, application logic, etc.)

To some extent, these caching optimizations also apply to your own non-wrapped AI engine architecture as a way to reduce the GPU compute costs of your own hosting platform. You can reduce load on your own AI engine, such as an open source engine running an open source model, by using these caching techniques. However, you can also speed up your own AI engine using many of the other techniques in this book.

Request Queue Architecture

Assuming you have an incoming stream of AI prompt requests, how do you send them to be processed? There are two issues to consider:

(a) Session tracking, and

(b) Prompt history.

Session tracking refers to having multiple users with different sessions, who may be either logged in or running in a guest session. The responses from the web server need to be consistent with the session, and may need to process information using the session. For example, a logged in user may be working on a document that is stored in their online account.

Prompt history refers to whether the AI engine can answer a prompt in isolation, or whether it must keep track of the “history” of recent prompts from the one user. Does the AI engine need “context” between two prompts from the same user? If not, then the AI engine may simply answer a prompt in a “stateless” manner. But if history is required in a multi-prompt conversation with the user (e.g. chatbot or Q&A session use cases), there are more issues to consider.

In the simplest architecture, without needing context from the prompt history, the session tracking is done near the web server. In other words, the request handling server can handle any session-related requests, and then access any session-specific documents from a data store. It can then forward this data to the AI engine, and the AI engine can simply receive and handle prompt requests without knowing from where they came. The AI engine is thereby operating in a stateless manner, simply processing input prompts without any additional user context.

There are many ways to implement a simple request queue in a server. It is an “application server” to put on the backend of your web server. You can build your own, or use a production tool such as:

Uvicorn with FastAPI
Gunicorn with Flask
Apache Tomcat
Microsoft IIS

There are many other options here, and you can add your favorite application server to the list. Tomcat is a well-known application server typically used as add-on to Apache on Linux. Microsoft IIS was a web server that has gradually evolved into a combined web-application server.

Uvicorn combined with FastAPI or Gunicorn with Flask are both similar Python-specific architectures. Either is a reasonably simple option as a production tool that is good for using a Python-based architecture that interfaces with AI engines.

Load Balancing

If your goal is a high volume of user requests, then you need to consider higher-end scalable architectures with load-balancing and fault tolerance. Some of the technologies to consider with load balancing include:

Round Robin DNS
Load Balancer Network Devices
Apache Kafka
Apache Load Balancer
Nginx load balancing

Round robin DNS, or RR DNS, is a simple way to distribute incoming requests to multiple servers, but it isn't true load balancing because it doesn't consider load or availability of the server connections. On the upside, it requires no extra server components and can be done simply by manipulating your domain DNS records.

Kafka is a more scalable production tool with advanced features such as clustering. The advantages of using Kafka are many in a large architecture, in that it is a pre-built tool that is purpose-designed for handling a high volume incoming event stream. It has a highly efficient distributed architecture, where you send requests to a Kafka cluster, and multiple listeners can be created to process incoming requests. For each input prompt, the Kafka listener would dequeue the request, and then forward the prompt text to its associated AI engine.

Apache Load Balancer is s freeware load balancing add-on. For more information, see the mod_proxy and mod_proxy_balancer Apache modules. Nginx also supports multiple different load balancing approaches such as round robin and least connections. Refer to the Nginx documentation for details.

Prompt History and Context

The architecture gets more complicated when the use case requires the AI engine to incorporate the user's history of prompts in their current conversation. For example, if it's a chatbot to take your food order, it needs to know what you've already ordered so that it can annoyingly push you to buy more stuff than you need.

The main way is to use your existing AI engine, but simply prepend the prior conversation to the user's new prompt. In other words, the AI engine simply receives each request in a stateless style, but each request includes all of the necessary prior context.

Implementing this architecture requires that the current session's history of prompts and responses are both stored in a session-specific data store. This might be a temporary store for guest sessions and/or a permanent store for signed-in users. Either way, the main point is that the text of the prompts and engine responses is available to be used for the next incoming request. The new prompt is then appended to the end of the conversation, and the whole conversation can be passed to the AI engine.

There are some downsides to this simple approach. Firstly, it's not always that effective, and may require some fancier prompt engineering before it works well. Some AI engines are beginning to have options to explicitly send these two inputs separately in the same API request, which may improve this situation. Secondly, it's sending a lot of extra tokens to the AI engine, which are expensive to process, whether it's extra dollars in the billing statement for a commercial fee-based engine (e.g. OpenAI's API) or the hidden cost of increased load on your own GPU hosting.

One idea to reduce costs is to store a “summary” of the prior conversation, rather than all of it, so fewer tokens are prepended. Summarization is a whole research area in itself, and there are various approaches. For example, this could be achieved via some types of simple heuristics (e.g. just remove unimportant “stop words”) or via AI-based summarization algorithm (although that extra expense probably defeats the purpose). The research areas of “prompt compression” and “document summarization” may be relevant here.

More advanced approaches than prepending the prior conversation are possible to handle an incoming request with history. There are various ways to store and handle the “context” of a long user conversation with prompt and answer history. This area is called “conversational AI” and may require changes to your AI engine architecture.

Finally, this area is changing rapidly and the above may be outdated by the time you read this. It's a fairly obvious extension to a commercial API for the provider to track the context for you, rather than impolitely foisting that requirement onto every programmer. Admittedly, it's also not a cheap capability to add, because the API provider would need to store a huge amount of extra context data, in return for getting paid less because you'd be sending them fewer tokens. Nevertheless, I expect to see commercial APIs having this functionality soon.

• Next: Chapter 8. Bitwise Operations

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs