Aussie AI

Request Queue Architecture

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Request Queue Architecture

Assuming you have an incoming stream of AI prompt requests, how do you send them to be processed? There are two issues to consider:

(a) Session tracking, and

(b) Prompt history.

Session tracking refers to having multiple users with different sessions, who may be either logged in or running in a guest session. The responses from the web server need to be consistent with the session, and may need to process information using the session. For example, a logged in user may be working on a document that is stored in their online account.

Prompt history refers to whether the AI engine can answer a prompt in isolation, or whether it must keep track of the “history” of recent prompts from the one user. Does the AI engine need “context” between two prompts from the same user? If not, then the AI engine may simply answer a prompt in a “stateless” manner. But if history is required in a multi-prompt conversation with the user (e.g. chatbot or Q&A session use cases), there are more issues to consider.

In the simplest architecture, without needing context from the prompt history, the session tracking is done near the web server. In other words, the request handling server can handle any session-related requests, and then access any session-specific documents from a data store. It can then forward this data to the AI engine, and the AI engine can simply receive and handle prompt requests without knowing from where they came. The AI engine is thereby operating in a stateless manner, simply processing input prompts without any additional user context.

There are many ways to implement a simple request queue in a server. It is an “application server” to put on the backend of your web server. You can build your own, or use a production tool such as:

Uvicorn with FastAPI
Gunicorn with Flask
Apache Tomcat
Microsoft IIS

There are many other options here, and you can add your favorite application server to the list. Tomcat is a well-known application server typically used as add-on to Apache on Linux. Microsoft IIS was a web server that has gradually evolved into a combined web-application server.

Uvicorn combined with FastAPI or Gunicorn with Flask are both similar Python-specific architectures. Either is a reasonably simple option as a production tool that is good for using a Python-based architecture that interfaces with AI engines.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Request Queue Architecture

Request Queue Architecture

Quick Links

Product

New to Writing?

Writing Styles