Aussie AI
Research Topic Ideas
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Research Topic Ideas
Need a topic for a research paper or a thesis dissertation? Here are some thoughts within the “inference optimization” subarea of AI research, which does not include smarter AI or training research areas.
Our research focus is on optimizing these algorithms so that the AI models respond quickly to users (called “latency”) and have a high overall throughput so as to scale efficiently. They need to be much faster not only to reduce data center GPU costs, but also to run efficiently on your smartphone device or your AI laptop.
Some suggestions for research topics based on observations about areas of AI research that seem to be under-researched are below. These topics are primarily in inference optimization for neural networks and large language models. Original research topics include:
- Phone AI. Smartphone-level inference efficiency remains elusive. The general research area of running AI models on low-resource platforms is called “edge computing.” This whole area has lots of research subareas, but is also needing a major breakthrough.
- Integer-only-arithmetic quantization. Much of the quantization research focuses on model size reduction but still uses floating-point multiplications for inference, although they can be avoided. This area seems under-researched when you consider its potential.
- Non-shift bitwise zero-multiplication models. Bitwise-or/bitwise-and with zero-multiplication inference. Papers in this area have used addition, but the bitwise operations might work, too, and they each have slightly different characteristics versus addition.
- Matrix algebra. Advanced matrix algebra has promise to reduce FLOPs. Sparse matrices, butterfly matrices, Monarch matrices, low-rank matrix factorization, tensor decomposition, etc.
- Hybrid inference optimizations. There are many research papers on using two or more model compression optimizations together (e.g. quantization and pruning), but this is a combinatorially large space, and there are many combinations unresearched, and a thorough overview of all of the possible combinations with citations to related research also seems missing (because it'd be a huge paper).
- Layer skipping. Skipping of individual layers during inference, rather than just “early exit” of all layers (dynamic layer pruning). There are various papers, but there's room for more research.
- Layer reordering. This is a technique that seems like it shouldn't work well. What happens if you run each layer twice? Is it more accurate? (Obviously, it's slower.) What happens if you run all the layers in reverse? Is it true that the initial layers do the general broader understanding and the finer layers do the finessing?
- Approximate multipliers. Use of approximate arithmetic multiplication algorithms in software for edge platforms with no hardware acceleration, or with different types of limited hardware acceleration.
- Tokenizer and vocabulary theory. Tokenizer word-to-token ratios and their impact on inference latency via model size (vocabulary size) and input sequence length. Tokenization with larger tokens, such as multi-word tokens, would mitigate the auto-regression latency issue, but increase vocabulary (thereby massively increasing model size). Is there a worthwhile trade-off?
- Multi-AI. Multi-model algorithms are an interesting area (often called “ensemble algorithms”). Two might be better than one for AI engines. There's a lot of research already (e.g. big-little architectures, collaborative inference, consensus-based decoding, speculative decoding, etc.), but there is much room for advancement here. What new advanced capabilities are possible by leveraging two or more AI engines?
- Logarithmic quantization. Power-of-two quantization with bitwise shift inference seems to be a fast inference method with many papers, but model accuracy remains problematic.
- Double-bit power-of-two quantization. Two shifts might be better than one (in terms of model accuracy). Some papers were found, but there's room for innovation.
- Granular quantization. Fine granularity quantization, such as per-tensor quantization. This is hard to implement efficiently, so it might have to be done in deep learning compilers.
- Lookup tables (LUTs). Zero-multiplication inference is possible using table lookup. A simple idea that trades space for lower latency time, looks effective, and has had relatively less research attention.
- Hashing and vector databases. There are quite a lot of papers, but there seems to be room for more. It's an O(1) lookup operation if it succeeds, but hashing a vector of numbers, or comparing two vectors for similarity using vectors, is quite tricky. Probably more to come in this area.
- Bloom filters. A data structure that is an extension of hashing and bit vectors with O(k) complexity, and has been occasionally used in neural network papers.
- Data structures. As already mentioned for hashing and Bloom filters, an overall review and comprehensive theoretical basis for the various (non-hardware) data structures in AI is needed, for the various different data structures, and generally across all of them.
- Dyadic quantization. It's interesting and might have some promise because it replaces multiplication with addition and bitshifts, but isn't as constrained in terms of unique weights as power-of-two quantization, so it's probably more accurate.
- Odd quantization bit sizes. There's plenty of papers on 4-bit and 8-bit integer quantization (and binary or ternary), a few papers on 2-bit quantization, but hardly any papers focused on 3-bit or 5-bit quantization (or 6-bit or 7-bit). Why not? They seem to work and offer good trade-offs in space versus model accuracy.
- Double-byte quantization (9 to 15 bits). There are not many papers on 9-bit to 15-bit integer quantization, with more focus on 8-bit or lower bitwidth, and full 16-bit integer quantization (for understandable reasons). Obviously, there's less benefit to memory utilization from more bits, and consequently extra CPU/GPU cost of data processing, but the models should be more accurate than 8-bit quantization.
- Non-ternary 2-bit quantization. This method has 4 weights -1, 0, +1, +2, thereby using zero, one or two additions/subtractions, rather than multiplication. This hasn't received as much attention as binary or ternary quantization, but 2-bit quantization might be more accurate than ternary quantization with the same space usage, and allows a zero-multiplication model.
- Streaming of model inference. Is it possible to start model inference before the entire model has been downloaded? Initial thoughts are that you can't run half a model, but you know what, you actually can, and there are two main ways. Yet to see a paper on this idea; there are papers on “streaming” of models, but they're about using a model on a stream, not streaming the model itself.
- Model-specific file compression algorithms. Whether the standard Zip and Bzip algorithms for file compression can be improved upon for the byte-wise compression of model files. What are the compression characteristics of model files, including the various quantized file formats with different bit sizes? Specific applications are for (a) transmission over the internet, and/or (b) efficient file storage on devices such as smartphones (with the need to quickly uncompress the file to load it fully into RAM).
- NAS for Dynamic Inference Hyper-Parameters. There are a lot of dynamic (adaptive) inference optimization strategies and they all have hyper-parameters. For example, early exiting layers has hyper-parameters in the minimum executed before early exiting is considered and the configurations of the decision algorithm to use on whether to exit at a given layer. Searching for an optimal set of such dynamic optimization hyper-parameters is an extension of NAS that I call “dynamic NAS.”
- Quadruple axis pruning. Multi-dimensional pruning is not yet fully researched. There are several papers on dual pruning (depth and width) and only a handful of research papers on triple pruning (adding length pruning), but apparently there's not yet a fourth dimension of pruning.
- Pruning Positional Embedding. Some papers have found that positional embedding (also called positional encoding) can be completely removed (abbreviated as “NoPE”). Somehow, the AI engine learns inter-token positional information without needing a positional encoding module. This is poorly understood and a new area with few papers.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |