Aussie AI

API Wrapper Architecture Optimizations

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

API Wrapper Architecture Optimizations

If your architecture is wrapping a commercial API, then you can't do much with the model or its engine. However, in addition to optimizing your backend server architecture, you also have control over what user requests get sent through to the commercial API. Some of the optimizations include:

Filter dubious queries with heuristics (e.g. blanks only, punctuation only, single cuss words, etc.)
Use an “inference cache” of full responses to identical previously-seen queries. Consider caching multiple slightly-different responses.
Use a “semantic cache” with a vector database that does “nearest-neighbor” lookups to cache responses to close-enough queries.
Context compression of chatbot conversation history or RAG document chunks.

Chapter 29 examines caching techniques in more detail. If you are wrapping a commercial API, your speed improvements are limited to this type of pre-API caching, along with speedups to your basic deployment architecture (e.g. Apache/Nginx, back-end servers, application logic, etc.)

To some extent, these caching optimizations also apply to your own non-wrapped AI engine architecture as a way to reduce the GPU compute costs of your own hosting platform. You can reduce load on your own AI engine, such as an open source engine running an open source model, by using these caching techniques. However, you can also speed up your own AI engine using many of the other techniques in this book.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

API Wrapper Architecture Optimizations

API Wrapper Architecture Optimizations

Quick Links

Product

New to Writing?

Writing Styles