Aussie AI

API Wrapper Architecture Optimizations

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

API Wrapper Architecture Optimizations

If your architecture is wrapping a commercial API, then you can't do much with the model or its engine. However, in addition to optimizing your backend server architecture, you also have control over what user requests get sent through to the commercial API. Some of the optimizations include:

  • Filter dubious queries with heuristics (e.g. blanks only, punctuation only, single cuss words, etc.)
  • Use an “inference cache” of full responses to identical previously-seen queries. Consider caching multiple slightly-different responses.
  • Use a “semantic cache” with a vector database that does “nearest-neighbor” lookups to cache responses to close-enough queries.
  • Context compression of chatbot conversation history or RAG document chunks.

Chapter 29 examines caching techniques in more detail. If you are wrapping a commercial API, your speed improvements are limited to this type of pre-API caching, along with speedups to your basic deployment architecture (e.g. Apache/Nginx, back-end servers, application logic, etc.)

To some extent, these caching optimizations also apply to your own non-wrapped AI engine architecture as a way to reduce the GPU compute costs of your own hosting platform. You can reduce load on your own AI engine, such as an open source engine running an open source model, by using these caching techniques. However, you can also speed up your own AI engine using many of the other techniques in this book.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++