Aussie AI Blog

Inference Optimization Research Ideas

26th August, 2024

by David Spuler, Ph.D.

AI Dissertation Topic Suggestions

Here are some more ideas for research papers or dissertation topics in the sub-area of "inference optimization." This is an important area now, with many research papers, because inference is around 95% of the overall cost of generative AI in modern workloads.

KV cache model compression methods. It is desirable to have a smaller KV cache because it is heavy in memory usage, which also costs processing time. All of the model compression methods can be applied to KV cache data in a way analogous to compression of the model weights, but only some of them have been researched. There's a lot of work on KV cache quantization, and some work on pruning and sparsification, such as KV cache layer pruning (depthwise pruning) or KV cache token pruning (lengthwise pruning). However, there's not really been any work on some other types of dynamic pruning, such as width-wise pruning techniques analogous to attention head pruning. What about less-known model compression methods like weight clustering, layer skipping, parameter sharing, or low-rank factorization/tensor decomposition? What about knowledge distillation? See KV caching research.

Semantic caching with KV data caching. Is it possible to combine semantic caching for similar but non-identical queries with storage of KV cache data? This is desirable because it avoids the entire prefill cost, but won't output identical texts for every user entering a similar query. KV caching is a well-known optimization across multiple queries for the exact token sequence, or a prefix of the token sequence. The problem with semantic caching is that the two token sequences are different, and often also have a different length. Is there a way that you can you re-use all of the KV cache data, or is there an equivalent of the prefix KV caching algorithm? A simple idea is just to discard the actual input query's token sequence, and start decoding using the token sequence and the KV cache data, as they appear in the semantic cache. Does this work? Is it possible to do better?

LoRA inference without weight updates. The use of multiple LoRA adapters is important, since it has been chosen to underpin the on-device inference capability of Apple Intelligence for iPhone and Mac. Currently, a LoRA adapter is loaded in by updating all the weights in the underlying large model (using addition), and then later unloaded by subtracting these weights again. There's already a few papers on reducing the cost of this load and unload sequence, which has to touch every weight twice. When swapping out LoRA adapters, we do a subtraction and then an addition of weight differences again. Can these be combined into a single addition of a combined set of differences? Even better, can it be avoided completely? Is there a way to run inference without updating the main model's weights at all? The idea is to run inference with the main weights unchanged, and also run a separate set of computations on just the LoRA weights, and then combine them at the end. Sounds simple when you say it like that.

Multi-LoRA parallel multi-model inference. Current usage of multiple LoRA adapters is to load and unload them, using one at a time. But what if you wanted the results from two of them. Currently, you'd have to run it twice, using two sets of memory, and two computations. It's difficult to use parallel execution twice for on-device inference, such as AI phones or AI PCs. Can you do better? Is there an optimization to inference whereby multiple LoRA adapters can be run on the same base model in the same memory, but not using full parallel computation? Can you somehow run partial computations of multiple LoRA adapters using fewer weights?

Granular quantization methods (layers or blocks). There's a bazillion papers on quantization, but there's not many on low-level granular quantization, with different quantization parameters for each low-level structure. Vector-level quantization has a lot of papers, as it's a technique in existence for classic ML, but there's less research in layerwise quantization and blockwise quantization.

Quadruple pruning. I'm still "weighting" for someone to combine all four of the possible pruning dimensions for an adaptive inference optimization: (a) depthwise pruning (e.g., layer pruning, early exit, layer skipping, layer fusion), (b) widthwise pruning (e.g., attention head pruning, slimmable networks), (c) lengthwise pruning (e.g., input token pruning, token merging, prompt compression), and (d) embedding dimension pruning. There have been a handful of papers with two or three pruning methods combined. I might be waiting a while for four.