Aussie AI Blog
Reasoning is the New AI Middleware
-
December 9th, 2024
-
by David Spuler, Ph.D.
Strawberries are Smarter
A strange thing has been happening to the AI industry of late in the race to smarter reasoning models: training is out; inference is in.
It used to be that if you wanted to build a smarter model, you added a few (hundred) billion extra model parameters, a few trillion extra training tokens, and a few (hundred) thousand extra GPUs. After a few weeks, enough coolant water to fill the Sahara, and the GDP of a small country in electricity, voila, a smarter frontier LLM.
The mantra has always been that in terms of LLM intelligence and reasoning ability, bigger is better. This has long been an underpinning idea about AI: to get a smarter model, train a bigger one. This basic idea was called the "scaling law" and it meant that by scaling the model's size, you could also scale it up to be smarter.
Lately, not so much.
The change really became noticeable with the release of the "o1" model by OpenAI, also known as Project Strawberry, in September 2024. Whereas GPT-4o in April 2024 had been extended to "multimodal" uses, it was still trained. What was notable about the "o1" model was that it was smarter not from better training, but because it used extra steps of inference. The underlying algorithm is called "Chain-of-Thought" (or "CoT" if you want to save your fingers), and the key trick was that it uses multiple LLM queries instead of one.
AI Middleware
There's always been a "middleware" part of the AI tech stack. The overall layers look something like this:
- User Interface
- Application Layer
- Middleware
- LLMs and AI Engine (Transformers)
- Hardware layer (GPUs)
The key point about answering a user's query: one-shot LLM inference. The LLM is only asked to respond once.
Traditionally, the AI middleware layer has been about managing the LLM requests from the application layer, and routing them to the appropriate inference server. Hence, the strawberry-free AI middleware has features such as:
- LLM request routing
- Error handling and retries
- Session management (e.g., conversational history of a chatbot session)
- Logging
- Security credential management
- RAG datastore retrieval of "chunks" of documents
- Observability and instrumentation (e.g., monitoring, performance analysis)
- Reporting and statistics tracking
- Billing management
In truth, there's always been more than one LLM request behind-the-scenes. Responding to an LLM request often needs a preliminary planning request, to decide things such as:
- Whether an external data source is needed (e.g., search the web).
- Whether a RAG data chunk is needed or not.
- Whether a tool "function call" is needed (e.g., clocks, calculators).
But now there's a new thing that middleware has to handle: multi-step inference algorithms for advanced reasoning.
Reasoning as Middleware
The new reasoning algorithms no longer use a one-shot inference method. Instead, they are using "multi-shot reasoning" or "multi-step inference" methods. So, we've got this weird new layer in AI reasoning, whereby the multi-step reasoning algorithms are a kind of middleware layer.
The Chain-of-Thought and other reasoning algorithms sit above the raw LLM layer. There are numerous algorithms for AI reasoning that involve sending multiple requests to the LLMs, before combining the results into a final answer that gets sent back to the user. Here's a list of some of the multi-step inference methods for smarter LLM reasoning:
- Chain-of-Thought (in the "o1" model)
- Self-reflection
- Skeleton-of-thought
- Tree-of-thought
- Graph-of-thought
There's probably some more. In fact, there's probably dozens more in the research papers, if you care to look for them. Actually, yeah, I forgot a few of the main ones: Best-of-N (BoN), LLM as Judge, and others.
Training Fights Back?
Sam Altman is on record saying "there is no wall," by which he means there's nothing blocking more advanced training from leading to smarter models. This remains to be seen, but he's in probably the best position to know, so I wouldn't bet against it.
Nevertheless, there are some reasons to think that training and its scaling laws could be under threat:
- Training data scarcity — much of the open source training data has already been used, although there are new pockets of both free and paid data.
- Electricity and water constraints for data centers.
- Capital costs of all those GPU chips.
- Inherent limits of the Transformer engine architecture.
- Underwhelming training performance — some recent media reports state that new model training has only resulted in incremental improvements in reasoning ability of next-gen models (unconfirmed).
- Public pronouncements — e.g., Microsoft CEO Satya Nadella saying the "low-hanging fruit" is gone.
On the other hand, there are plenty of AI research areas aiming to improve training:
- Hardware advances — from NVIDIA Blackwell GPUs and many other hardware vendors and startups.
- Network hardware — sending data between multiple GPUs faster.
- Data center improvements — e.g., modular nuclear power.
- Synthetic data — using LLM-generated data to train LLMs.
- Derivative data — automatically extending data size (e.g., by synonyms).
- Training software optimizations — e.g., better computations, communication scheduling, and more.
- Resiliency improvements — e.g., checkpointing in memory, SDC handling.
Personally, I wouldn't like to bet against human ingenuity.
Related Research
More AI Research Topics
Read more about: