Yoryck AI
AI Research Topics
-
Last Updated 26th August, 2024
-
by David Spuler, Ph.D.
Writing a thesis or a scholarly research paper?
This section presents various ideas you might want to use for your research or as a dissertation topic.
They aren't necessarily areas that we are working on.
List of AI Research Topics
Some suggestions for research topics based on observations about areas of AI research that seem to be under-researched are below. These topics are primarily in inference optimization for neural networks and large language models. Original research topics include:
- Phone AI. Smartphone-level inference efficiency remains elusive. For various thoughts on this, see phone inference optimization and also AI PCs (the general research area of running AI models on low-resource platforms is often called "edge computing").
- Integer-only-arithmetic quantization. Much of the quantization research focuses on model size reduction but still uses floating point multiplications for inference, although they can be avoided. This area seems under-researched when you consider its potential; see quantization.
- Non-shift bitwise zero-multiplication. Bitwise-or/bitwise-and with zero-multiplication inference. Papers in this area have used addition, but the bitwise operations might work, too, and they each have slightly different characteristics versus addition.
- Matrix algebra. Advanced matrix algebra has promise to reduce FLOPs. Sparse matrices, butterfly matrices, Monarch matrices, low-rank matrix factorization, tensor decomposition, etc. Lots of fun with advanced matrix and vector math; see matrix algebra research.
- Hybrid inference optimizations. There are many papers on using two or more model optimizations together (e.g. quantization and pruning), but this is a combinatorially large space, and there are many combinations unresearched, and a thorough overview of all of the possible combinations with citations to related research also seems missing (because it'd be a huge paper).
- Layer skipping. Skipping of individual layers during inference, rather than just "early exit" of all layers (dynamic layer pruning). There are various papers, but there's room for more research; see layer skipping.
- Layer reordering. This is a technique that seems like it shouldn't work well. What happens if you run each layer twice? Is it more accurate? (Obviously, it's slower.) What happens if you run all the layers in reverse? Is it true that the initial layers do the general broader understanding and the upper layers do the finessing? (In which case, early exit makes sense.) See layer reordering.
- Approximate multipliers. Use of approximate arithmetic multiplication algorithms in software for edge platforms with no hardware acceleration, or with different types of limited hardware acceleration. See advanced mathematics research.
- Tokenizer and vocabulary theory. Tokenizer word-to-token ratios and their impact on inference latency via model size (vocabulary size) and input sequence length. Tokenization with larger tokens, such as multi-word tokens, would mitigate the auto-regression latency issue, but increase vocabulary (thereby massively increasing model size). Is there a worthwhile trade-off? See tokenization and vocabulary research.
- Multi-AI. Multi-model algorithms are an interesting area (often called "ensemble algorithms"). Two might be better than one for AI engines. There's a lot of research already (e.g. big-little architectures, collaborative inference, consensus-based decoding, speculative decoding, etc.), but there is much room for advancement here. What new advanced capabilities are possible by leveraging two or more AI engines? (see ensemble AI research).
- Logarithmic quantization. Power-of-two quantization with bitwise shift inference seems to be a fast inference method with many papers, but model accuracy remains problematic. See logarithmic quantization research.
- Double-bit power-of-two quantization. Two shifts might be better than one (in terms of model accuracy). Some papers were found, but there's room for innovation. See power-of-two quantization research.
- Granular quantization. Fine granularity quantization, such as per-tensor quantization. This is hard to implement efficiently, so it might have to be done in deep learning compilers. See granular quantization research.
- Lookup tables (LUTs). Zero-multiplication inference is possible using table lookup. A simple idea that trades space for lower latency time, looks effective, and has had relatively less research attention. See zero-multiplication inference research
- Hashing There are quite a lot of papers, but there seems to be room for more. It's an O(1) lookup operation if it succeeds, but hashing a vector of numbers, or comparing two vectors for similarity using vectors, is quite tricky. Probably more to come in this area. See hashing research.
- Bloom filters. A data structure that is an extension of hashing with O(k) complexity, and hash been occasionally used in neural network papers; see Bloom filter research.
- Data structures. As already mentioned for hashing and Bloom filters, an overall review and comprehensive theoretical basis for the various (non-hardware) data structures in AI is needed, for the various different data structures, and generally across all of them.
- Dyadic quantization. It's interesting and might have some promise because it replaces multiplication with addition and bitshifts, but isn't as constrained in terms of unique weights as power-of-two quantization, so it's probably more accurate. See dyadic quantization research.
- Odd quantization bit sizes. There's plenty of papers on 4-bit and 8-bit integer quantization (and binary or ternary), a few papers on 2-bit quantization, but hardly any papers focused on 3-bit or 5-bit quantization (or 6-bit or 7-bit). Why not? They seem to work and offer good trade-offs in space versus model accuracy. See quantization research.
- Double-byte quantization (9 to 15 bits): There's not many papers on 9-bit to 15-bit integer quantization, with more focus on 8-bit or lower bitwidth, and full 16-bit integer quantization (for understandable reasons). Obviously, there's less benefit to memory utilization from more bits, and consequent extra CPU/GPU cost of data processing, but the models should be more accurate than 8-bit quantization.
- Non-ternary 2-bit quantization (e.g. with 4 weights -1, 0, +1, +2, thereby using zero, one or two additions/subtractions, rather than multiplication). This hasn't received as much attention as binary or ternary quantization, but 2-bit quantization might be more accurate than ternary quantization with the same space usage, and allows a zero-multiplication model. See 2-bit integer quantization research.
- Streaming of model inference. Is it possible to start model inference before the entire model has been downloaded? Initial thoughts are that you can't run half a model, but you know what, you actually can, and there are two main ways. Yet to see a paper on this idea; there are papers on "streaming" of models, but they're about using a model on a stream, not streaming the model itself.
- Model-specific file compression algorithms. Whether the standard Zip and Bzip algorithms for file compression can be improved upon for the byte-wise compression of model files. What are the compression characteristics of model files, including the various quantized file formats with different bit sizes. Specific applications are for (a) transmission over the internet, and/or (b) efficient file storage on devices such as smartphones (with the need to quickly uncompress the file to load it fully into RAM).
- NAS for Dynamic Inference Hyper-Parameters. There are a lot of dynamic inference optimization strategies and they all have hyper-parameters. For example, early exiting layers has hyper-parameters, such as the minium executed before early exiting is considered and the decision algorithm to use on whether to exit at a given layer. Searching for an optimal set of such dynamic optimization hyper-parameters is an extension of NAS. See some emerging research at Dynamic NAS.
- Quadruple axis pruning. Multi-dimensional pruning is not yet fully researched. There are several papers on dual pruning and only a handful of research papers on triple pruning, but apparently there's not yet a 4th dimension of pruning.
- Pruning Positional Embedding. Some papers have found that positional embedding (also called positional encoding) can be completely removed (abbreviated as "NoPE"). Somehow, the AI engine learns inter-token positional information without needing a positional embedding module. This is poorly understood and a new area with few papers. See Positional embeddings pruning.
More AI Research
Read more about:
- Inference Optimizations
- Advanced AI Mathematics
- Zero-Multiplication Models
- Matrix Algebra
- Logarithmic Models
- Approximate Computing
- Loop Optimizations
- Code Optimizations
- « Research Home
(For feedback, suggestions or corrections, please email research@yoryck.com.)