Aussie AI

Research on AI Phones

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Research on AI Phones

There's no shortage of research papers on getting AI engines to run their inference calculations directly on low-resource devices. The general class of research is about “edge” devices, and it isn't just phones, but also even less powerful devices like IoT-capable network devices and security cameras processing images.

There are quite a few articles showing that you can run AI models on a smartphone. Most of these articles are enthusiast and experimentation type articles. As of this writing, I'm not aware of a commercial product based on a native AI model in an app for either of the Android or iPhone platforms.

I've actually been looking into smartphone AI research directly. Generally, I feel that we can get to fast AI model native execution on a phone (or a PC without any GPU), through judicious combination of the various possible software optimizations, such as the approaches in this book, and the following research areas:

End-to-end logarithmic models. If everything can be stored as a logarithmic value in the “log-domain”, then floating-point multiplication changes to floating-point addition. This is called the Logarithmic Number System (LNS). The advantage of this method over quantization is that it should be more precise, since the log-domain weights are not quantized or bunched together. The main obstacle is that addition changes into a big problem (ironically). There are various approximations and LNS hardware accelerators have existed for decades.
Additive and zero-multiplication neural networks. There are various ways that neural networks can avoid multiplications, using additions, bitshifts, or other methods. There are many algorithms in the literature, but model accuracy trade-offs seem to be the main concern.
Approximate multiplication algorithms. The arithmetic bottleneck of inference is multiplication. Various approximate multiplication algorithms might be faster, although can they beat hardware-accelerated multiplication? And how inaccurate are they, since they're approximate, and does it affect inference accuracy and overall model results? Could some type of approximation-aware re-training fix it?
Adaptive inference with hybrid pruning. The typical model runs through the same multi-billion calculations for every inference, regardless of input. This seems inefficient, and suggests improvements to dynamic inference, also called “adaptive inference”. There are many, many research papers. Example ideas include early exiting layers and length pruning, such as token pruning.
Precomputations based on weights. All of the weights are known and fixed at inference time. What can be discerned about them, and used for faster inference? This idea is conceptually similar to quantization, but not the same. ML compilers do a lot of work using this idea, but I think there are some possible extensions.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Research on AI Phones

Research on AI Phones

Quick Links

Product

New to Writing?

Writing Styles