Aussie AI

3. AI Phones

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

“E.T. phone home.”

E.T. the Extra-Terrestrial, 1982.

 

 

Native Smartphone AI

Can an AI model run fast enough on your phone? I'm not talking about having your phone talk to some anonymous server in the cloud to do its AI. I'm wondering whether it's possible to run the actual C++ engine natively on the phone's CPU.

This is an area of research that is of personal interest to me. As goals go, it's quite an ambitious one: run a big AI model that's usually thirsty for GPUs, on a small platform without a GPU.

Much of the early research that is relevant to fast phone execution of models relates to another type of computer, which you might know as a “car.” The need for computer vision models for automated or assisted driving has similar requirements to running on a phone, such as low latency and small storage. The general term is an “embedded” system or “real-time” system.

Obstacles to Smartphone AI

If it were possible, we'd already have many native AI apps. Although there are already plenty of “AI apps” available to install on your phone, these are almost certainly all sending the requests over the network to an AI engine in the cloud. Running an AI model directly on your phone is problematic for several reasons:

  • Too slow to run — response times are too long.
  • Hardware acceleration — phones lack a GPU and have less CPU acceleration.
  • Storage size — e.g. a “small” 3B model with 32-bit weights will need 12 Gigabytes of storage. With modern phones often over 512GB, storing even a 13B model in 52GB seems reasonable.
  • Memory usage — an entire model is loaded into RAM for inference. The obstacle is more the time cost of accessing this memory than the storage size.
  • Transmission size — install a huge model over your phone's 4G or WiFi connection.
  • Battery depletion — computations max out the phone's CPU and chew cycles.
  • Heat generation — water-cooled phones are not a thing.

For these reasons, it's still faster to send AI requests off to a bigger server with lots of GPUs that's running in the cloud, even though it's a roundtrip network message. Before you see any truly “native” generative AI models in your app store, research is required to overcome all of the above obstacles.

Near-Term Technology Trends

Over time some of the obstacles to natively-executing inference on phones will diminish:

  • Better phone CPUs with hardware acceleration are already here (e.g. Apple Neural Engine since iPhone X, Qualcomm Snapdragon), with more on the way. Future phones will be much more AI-capable.
  • GPU phones will surely be coming to a store near you very soon.
  • Phone storage sizes are also increasing with terabyte storage sizes common.
  • 5G network connectivity will reduce concerns about transmission sizes.
  • Data compression algorithms can lower transmission sizes, and also possibly storage sizes.
  • Quantized models and other inference optimizations can improve speed and reduce storage size, giving reduced CPU usage, faster response times, lower storage size, and reduced transmission size (but with accuracy loss).
  • Training and fine-tuning of models doesn't need to happen on a phone (phew!).

But... you really need a “big” model, not a “small” model, if you want the app to be great with lots of happy users. And getting a big model running efficiently on a phone may take a while to come to fruition.

Speeding Up Smartphone AI

Okay, so let's say you want to run a “big” model on a “small” phone. Why? Lots of reasons, which we won't explore here. So, you want what you want, which is to run the latest open source AI model on a phone.

First question is: do you even need to? Why not just use the AI engines in the cloud, and send requests back-and-forth between the phone and the cloud. Response time of modern networks is fast, message sizes are small, and users may not notice or even care. There are reasons beyond speed: privacy and security come to mind.

Another piece of good news: you don't need to “build” the model on your phone. Those GPU-expensive tasks of training or fine-tuning can be done in the cloud. For native execution, the user only needs to run “inference” of the model on their phone.

Assuming you have your reasons to want to do this, let's examine each of the obstacles for native phone execution of LLM model inference.

  • Speed and response time. The AI engine on the phone needs fast “inference” (running the model quickly). And it probably cannot rely on a GPU, since there are already billions of phones out there without a GPU. Hardware acceleration in phone CPUs is limited. The main ways that models run without a GPU on a phone or PC is to use inference optimizations, of which the most popular at the moment is definitely quantization. Other supplemental techniques that might be needed include integer-only arithmetic and pruning (model compression). And there's a whole host of lesser-known inference optimization techniques that might need to be combined together. For example, maybe the bottleneck of “auto-regression” will need to get bypassed so the AI engine can crank out multiple words at a time, without running the whole glob of a model for every single word.
  • Network transmission size. Users need to download your 13B LLama-2 model to their phone? Uncompressed, it's about 52GB. There's already a lot known about compression algorithms (e.g. for video), and model files are just multi-gigabyte data files, so perhaps it can be compressed to a size that's adequately small. But before we even use those network compression algorithms, the first thing to try is model compression, such as quantization. For example, using quantization to 8-bit would reduce the original 32-bit model size four-fold down to 13GB, for a slight loss in accuracy (probably acceptable). Binary quantization would reduce it by a factor of 32, but then the inference accuracy goes south. 5G bandwidth will help a lot, but remember there's a lot of users (billions) out there with non-5G compatible phones. Model compression techniques such as quantization and pruning can also reduce the total size. But the whole model is required. There's no such thing as half an AI model. And you can't stream an AI model so it starts running before it's all loaded (although that's actually an interesting research question as to whether it might be possible).
  • Storage size. The whole model needs to be permanently stored on the device. Maybe it can be stored in some compressed form. The same comments about model compression techniques apply. It can either be stored uncompressed if the phone has a bigger storage space, or perhaps it can be stored in compressed form, and only uncompressed when it's needed. But it'll be needed all the time, because, well, it's AI you know, so everybody needs it for everything.
  • Memory size. The inference algorithm needs the whole model, uncompressed, available to use in RAM. Not all at the same time, but it will definitely need to swap the entire model (uncompressed) in and out of memory to process all those model weights. For each word. That's a fair chunk of RAM (e.g. 52GB) but the bottleneck is also the processing cost from swapping data in/out. And that occurs for every word it generates. Again, model compression seems key to cut down the original 52GB size of the model (e.g. 8-bit quantization cuts it to 13GB).
  • Battery depletion and heat generation. A model with 13B weights needs to do 13 billion multiplications for every word it outputs. That's a lot of power usage and reducing resource utilization means using the above-mentioned optimizations of the inference algorithm (e.g. quantization, pruning, non-auto-regression, etc.).

When will native phone LLMs appear in the wild? The short answer is that multiple optimization techniques probably need to be combined, and that success is several breakthroughs away. It might not even be possible to realistically run large LLMs natively on today's phones. But solving any of the above-mentioned problems is certainly valuable standalone, in that it will reduce the cost of running AI models on GPUs in server farms that are growing in the cloud, and maybe even make it possible to run AI natively on desktop PCs.

AI Phone Apps

There's an obvious opportunity to add AI functionality to phone apps. We've already seen Microsoft quick out of the gate in adding AI functionality to numerous products in their portfolio, some of which relates to accessing AI engines from your smartphone. However, at the time of writing, there hasn't been a lot of press about adding AI functionality to Google Android or Apple iPhone apps.

Probably they're not working on it at all. I mean, they're trillion-dollar companies, so they might as well rest on their laurels with their feet up on the desk, reading the newspaper.

Somewhat more likely is that we'll see a range of AI functionality coming to the key apps on your phone sooner rather than later. Apple is notoriously secretive about what it's planning, but there have been numerous hints that they're spending big on AI. Google's capabilities in the AI space are, of course, on full display with its PaLM models and the recently released Google Gemini. I expect to see AI functionality for phones in roughly this schedule:

  • Core app features
  • Secondary app features
  • Developer toolkit features

The first steps will be AI functionality in the core apps from the vendors. For example, Google has announced that the Pixel 8 Pro is powered by the new Gemini Nano model.

On Apple iPhones we might see AI completions in iMessage or photo apps or enhancements of this type. Apple tends to make large advances in functionality without hyping the underlying technology, and Tim Cook has been more reluctant than the average CEO to utter the phrase “AI” in earnings calls. Nevertheless, expect AI to be under the hood in many upcoming product upgrades.

Note that this won't be native execution. Instead, this will be AI requests going up into the cloud from your phone, and the processing being done on high-end GPU systems. This is an expensive and massive-scale project for the phone vendors to complete. Microsoft has shown that it can be done with generative AI, and the other major tech vendors will follow soon enough.

The next level will be adding AI capabilities to third-party apps. Expect both Android Studio and Apple Xcode developer platforms to start offering AI capabilities on a service model. There are various rumors that Apple is working on something like this, and it makes strategic sense. On the other hand, adding AI functionality for third-party developers adds another level of complexity to the release (e.g. security, privacy, safety, regulatory compliance, etc.), so it will likely lag behind the appearance of AI functionality in the core phone apps that come directly from Google and Apple.

Research on AI Phones

There's no shortage of research papers on getting AI engines to run their inference calculations directly on low-resource devices. The general class of research is about “edge” devices, and it isn't just phones, but also even less powerful devices like IoT-capable network devices and security cameras processing images.

There are quite a few articles showing that you can run AI models on a smartphone. Most of these articles are enthusiast and experimentation type articles. As of this writing, I'm not aware of a commercial product based on a native AI model in an app for either of the Android or iPhone platforms.

I've actually been looking into smartphone AI research directly. Generally, I feel that we can get to fast AI model native execution on a phone (or a PC without any GPU), through judicious combination of the various possible software optimizations, such as the approaches in this book, and the following research areas:

  • End-to-end logarithmic models. If everything can be stored as a logarithmic value in the “log-domain”, then floating-point multiplication changes to floating-point addition. This is called the Logarithmic Number System (LNS). The advantage of this method over quantization is that it should be more precise, since the log-domain weights are not quantized or bunched together. The main obstacle is that addition changes into a big problem (ironically). There are various approximations and LNS hardware accelerators have existed for decades.
  • Additive and zero-multiplication neural networks. There are various ways that neural networks can avoid multiplications, using additions, bitshifts, or other methods. There are many algorithms in the literature, but model accuracy trade-offs seem to be the main concern.
  • Approximate multiplication algorithms. The arithmetic bottleneck of inference is multiplication. Various approximate multiplication algorithms might be faster, although can they beat hardware-accelerated multiplication? And how inaccurate are they, since they're approximate, and does it affect inference accuracy and overall model results? Could some type of approximation-aware re-training fix it?
  • Adaptive inference with hybrid pruning. The typical model runs through the same multi-billion calculations for every inference, regardless of input. This seems inefficient, and suggests improvements to dynamic inference, also called “adaptive inference”. There are many, many research papers. Example ideas include early exiting layers and length pruning, such as token pruning.
  • Precomputations based on weights. All of the weights are known and fixed at inference time. What can be discerned about them, and used for faster inference? This idea is conceptually similar to quantization, but not the same. ML compilers do a lot of work using this idea, but I think there are some possible extensions.

 

Next: Chapter 4. AI PCs

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++