Aussie AI Blog

What's Hot in LLM Inference Optimization in 2025?

  • March 3rd, 2025
  • by David Spuler, Ph.D.

Inference Optimization News in 2025

Surprisingly, 2025 started with a huge focus on LLM efficiency with the release of DeepSeek R1's advanced reasoning model that outpaced the frontier models on many benchmarks, and at a fraction of the cost.

New Research on Reasoning Efficiency

Although the reported cheap training cost of DeepSeek R1 was what put tremors through NVIDIA's stock price, there were several categories of DeepSeek architecture improvements, and they actually included several advancements to inference efficiency in reasoning models:

Some of these algorithms had actually appeared in their prior V3 model, but they were then applied to their R1 reasoning model. Read more here: DeepSeek's research advancements.

Neural Magic Acquisition

Another industry change related to inference efficiency that received a lot less attention was that Neural Magic, a Boston-based AI inference startup, was acquired by RedHat and IBM in late 2024. As one of the few pure-play inference optimization startups, they raised over $50m in venture capital, and have now exited (price undisclosed). With a focus on the software stack, especially on dynamic sparsity, they achieved significant advances in inference efficiency.

New Research on Reasoning Efficiency

Chain-of-Thought efficiency. The rise of reasoning models from several major AI companies (not just DeepSeek) has led to a follow-on rash of many papers on improving the efficiency of LLM reasoning algorithms. Reasoning models, especially ones using multi-step inference, are very costly to run.

In particular, Chain-of-Thought tends to emit a huge amount of tokens as it "talks to itself" to achieve good reasoning results. This is true in both single-step CoT "long answer" models (e.g. DeepSeek R1) and multi-step CoT versions with "test time compute" (e.g. OpenAI's o3 and Google Gemini's reasoning model). Hence, a number of different techniques have been tested in the research to reduce the token processing cost in CoT. Changes have included high-level changes to the CoT algorithm so as to skip steps and prune redundant reasoning paths (in multi-step CoT), or much lower-level optimizations to compress tokens and other algorithms.

In general, almost all of the 500+ LLM optimization techniques could theoretically be used in the special case of reasoning algorithms, but so far only a small number of these methods have been tested in Chain-of-Thought. For more information, see:

Small Reasoning Models. Another related area of research is "Small Reasoning Models," which is the use of smaller models with reasoning capabilities. This has been approached in various ways:

  • Multi-step CoT algorithms wrapped around smaller base models.
  • Improved training and fine-tuning of reasoning techniques applied to small models.
  • Distillation of small models (from Large Reasoning Models).
  • Model compression of larger reasoning models (e.g. quantization of DeepSeek R1).

Endlessly Hot Research Areas

There continues to be an endless flow of research papers on these LLM efficiency optimizations:

Quantization: Ever the hottest area, the use of quantization is widespread throughout industry, let alone in research labs. Some of the latest hot areas include:

Speculative decoding: Improvements to speculative decoding have included improved accuracy of draft models, and better use of multi-machine "distributed speculative decoding." In another extension, prompt lookup decoding has been generalized to look beyond the current prompt to the entire history of prompts in multiple queries. (See blog article "What's New in Speculative Decoding?").

KV cache optimizations: KV cache compression remains hot because the KV cache is the main bottleneck for long context processing. In fact, we're now into the realm of "ultralong contexts" over 1M tokens, and this requires KV cache optimizations. Quantization of KV cache data to INT4 or INT2 is now common, but there's also research on numerous other ways to compress the KV cache, such as token pruning, head fusion, and various other ways. Here are some subareas:

Generally speaking, almost all of the LLM model compression optimizations (to model parameters or activations) can be re-applied in the new realm of the KV cache, so there's no shortage of future paper topics.

Fused KV caching. My favorite area that I feel is certain for a breakout is the generalization of prefix KV caching to non-prefix strings by concatenating the KV caches together and making adjustments. Recent research has examined accuracy problems and made improvements to the handling of positional encoding adjustments and cross-chunk attention; see substring/fused/concatenated KV cache. Yes, it needs a better name!

New attention algorithms: Another way to achieve long context processing is to alter the attention algorithm. Every paper on this topic, and there are many, chooses a new name for their approach. Hence, there are numerous new candidates:

We seem to be stuck at version 3 of Flash Attention, which was released eons ago in July 2024. There also hasn't been much added recently to the literature on Paged Attention. Some of the other newer attention algorithms have been combined with these two major memory-efficient attention algorithms, but there hasn't been a core advance in these techniques lately.

More AI Research Topics

Read more about: