Aussie AI
AI Middleware
-
Last Updated 12 December, 2024
-
by David Spuler, Ph.D.
AI middleware is the layer in the tech stack above the LLMs. It can provide services such as prompt extensions, conversational history management, and other relatively low-level functionality. As such, middleware can operate as a wrapper around remote AI API services, or can run near a self-hosted open source LLM in the same local servers.
Features of LLM Middleware
Some of the features that middleware components may typically provide in a layer above an individual LLM inference layer include:
- Multi-LLM access (helping to avoid "vendor lock-in")
- Prompt templating (e.g., adding global instructions)
- Programmatic prompting (i.e., automatic prompt improvement)
- Conversational history management (chatbots and Q&A versions)
- Prompt caching (e.g., prefix KV caching)
- Logging
- Monitoring and observability
- Reporting and statistics tracking
- User identity and security credential management
Generally speaking, most of the RAG architecture are not considered by fit under the category of "middleware." Components such as vector databases, rerankers, packers, and other RAG components have their own category in the RAG stack.
However, the recent advances in Chain-of-Thought and other multi-step inference-based reasoning algorithms have spawned another use of AI middleware at a much higher level. An AI middleware can wrap individual LLM queries into sequences of multiple steps, thereby implementing reasoning algorithms, such as:
- Reflection
- LLM as Judge
- Chain-of-Thought
- Best-of-N
- Skeleton-of-Thought
And there are many more such multi-step reasoning algorithms.
Research on AI Middleware
Research papers on AI middleware components and the overall AI tech stack:
- Asankhaya Sharma (codelion), Sep 2024, Optillm: Optimizing inference proxy for LLMs, https://github.com/codelion/optillm
- Noah Martin, Abdullah Bin Faisal, Hiba Eltigani, Rukhshan Haroon, Swaminathan Lamelas, Fahad Dogar, 4 Oct 2024, LLMProxy: Reducing Cost to Access Large Language Models, https://arxiv.org/abs/2410.11857 (Deploying a proxy between user and LLM, with handling of conversational history context and caching.)
- Narcisa Guran, Florian Knauf, Man Ngo, Stefan Petrescu, Jan S. Rellermeyer, 21 Nov 2024, Towards a Middleware for Large Language Models, https://arxiv.org/abs/2411.14513
- Andrew Ng, Nov 2024, Simple, unified interface to multiple Generative AI providers, https://github.com/andrewyng/aisuite
- Asif Razzaq, November 29, 2024, Andrew Ng’s Team Releases ‘aisuite’: A New Open Source Python Library for Generative AI, https://www.marktechpost.com/2024/11/29/andrew-ngs-team-releases-aisuite-a-new-open-source-python-library-for-generative-ai/
- Ian Drosos, Jack Williams, Advait Sarkar, Nicholas Wilson, 3 Dec 2024, Dynamic Prompt Middleware: Contextual Prompt Refinement Controls for Comprehension Tasks, https://arxiv.org/abs/2412.02357
- Stephen MacNeil, Andrew Tran, Joanne Kim, Ziheng Huang, Seth Bernstein, Dan Mogil, 3 Jul 2023, Prompt Middleware: Mapping Prompts for Large Language Models to UI Affordances, https://arxiv.org/abs/2307.01142
- Yu Gu, Yiheng Shu, Hao Yu, Xiao Liu, Yuxiao Dong, Jie Tang, Jayanth Srinivasa, Hugo Latapie, Yu Su, 4 Oct 2024 (v2), Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments, https://arxiv.org/abs/2402.14672
- Jiejun Tan, Zhicheng Dou, Yutao Zhu, Peidong Guo, Kun Fang, Ji-Rong Wen, 30 May 2024 (v3), Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs, https://arxiv.org/abs/2402.12052 https://github.com/plageon/SlimPLM
More AI Research
Read more about: