Aussie AI

Research Goals

  • Last Updated 3rd September, 2023
  • by David Spuler, Ph.D.

The primary focus of research at Aussie AI is on optimizing LLM inference algorithms (i.e. "running" the model after training or fine-tuning), and our research is toward the following aims:

  • Fast on-device model inference algorithms, specifically for smartphones and AI PCs.
  • Scaling inference algorithms to large volumes of requests
  • Efficient GPU inference algorithms (hardware acceleration)
  • Non-GPU inference optimization algorithms (see inference optimization)

Optimization of algorithms for training and fine-tuning of models is also a laudable goal that has received much attention in the literature, but is less important for a scaleable architecture, and is not a primary focus for our research. Model training and fine-tuning can always be done "offline" and in batch mode, without requiring a fast response time, albeit usually requiring a significantly high level of GPU computing power in the cloud.

Why is AI Slow?

The computing power required by AI algorithms is legendary. The cost of training big models is prohibitive, with literally millions of dollars being spent, and getting even small models to run fast is problematic. But why?

The bottleneck is the humble multiplication. All AI models use "weights" which are numbers, often quite small fractions, that encode how likely or desirable a particular feature is. In an LLM, it might encode the probabilities of the next word being correct. For example, simplifying it considerably, a weight of "2.0" for the word "dog" would mean to make the word twice as likely to be the next word, and a weight of "0.5" for "cat" would mean to halve the probability of outputing that word. And each of these weights is multiplied against other probabilities in many of the nodes in a neural network.

How many multiplications? Lots! By which we mean billions every time it runs. A model of size 3B has 3 billion weights that need multiplication to work. GPT-3 as used by ChatGPT had 175B weights, and GPT-4 apparently has more (it's confidential).

Why so many weights? Simplifying, a typical LLM will maintain a vector representation of words (called the model's "vocabulary"), where each number is the probability of emitting that word next. Actually it's more complicated, with the use of "embeddings" as an indirect representation of the works, but conceptually the idea is to track word probabilities. To process these word tokens (or embeddings), the model has a set of "weights", also sometimes called "parameters", which are typically counted in the billions in advanced LLMs (e.g. a 3B model is considered "small" these days and OpenAI's GPT-3 had 175B). And each node of the neural network inside the LLMs is doing floating-point multiplications across its vocabulary (embeddings), using the weights, whereby multiplication by a weight either increases or decreases the likelihood of an output. And there are many nodes in a layer of an LLM that need to do these computations, and there are multiple layers in a model that each contain another set of those nodes. And all of that is just to spit out one word of a sentence in a response. Eventually, the combinatorial explosion of the sheer number of multiplication operations catches up to reality and overwhelms the poor CPU.

Specific Inference Optimization Research Areas

Within the above list of general goals for faster inference, these are our specific topics of active interest for theoretical research:

  • End-to-end logarithmic models. If everything can be stored as a logarithmic value in the "log-domain", then floating-point multiplication changes to floating-point addition. This is called the Logarithmic Number System (LNS). The advantage of this method over quantization is that it should be more precise, since the log-domain weights are not quantized or bunched together. The main obstacle is that addition changes into a big problem. There are various approximations and LNS hardware accelerators have existed for decades. Read more about: Logarithmic models.
  • Additive and zero-multiplication neural networks. There are various ways that neural networks can avoid multiplications, using additions, bitshifts, or other methods. There are many algorithms in the literature, but model accuracy trade-offs seem to be the main concern. Read more about: Zero multiplication inference.
  • Approximate multiplication algorithms. The arithmetic bottleneck of inference is multiplication. Various approximate multiplication algorithms might be faster, although can they beat hardware-accelerated multiplication? And how inaccurate are they, since they're approximate, and does it affect inference accuracy and overall model results? Could some type of approximation-aware re-training fix it? Read more about: Approximate multiplication.
  • Adaptive inference with hybrid pruning. The typical model runs through the same multi-billion calculations for every inference, regardless of input. This seems inefficient, and suggests improvements to dynamic inference, also called "adaptive inference". There are many, many research papers. Example ideas include early exiting layers and length pruning, such as token pruning.
  • Precomputations based on weights. All of the weights are known and fixed at inference time. What can be discerned about them, and used for faster inference? This idea is conceptually similar to quantization, but not the same. There hasn't been much research on this idea, but there are a few papers. Read more about: Weight precomputations.

Ideas for AI Research Topics

Looking for research topic ideas for your thesis dissertation? See our List of AI Research Topics and AI areas of research.

Deep AI Research Areas

Read more about: