Aussie AI

LLM Phone Research

  • Last Updated 29 November, 2024
  • by David Spuler, Ph.D.

AI is going to be on your phone; see GenAI market research, and it's early in this trend. There are also going to be AI PCs on your desk.

Can an AI model run fast enough on your phone? Much of the early research that is relevant to fast phone execution of models relates to another type of computer, which you might know as a "car". The need for computer vision models for automated or assisted driving has similar requirements to running on a phone, such as low latency and small storage. The general term is an "embedded" system or "real-time" system.

LLMs on Your Smartphone

There are already plenty of "AI" apps available to put on your phone, but these are almost certainly all sending the requests over the network to an AI engine in the cloud. Running an AI model directly on your phone is problematic for several reasons:

  • Too slow to run (response times will be long)
  • Phones don't have a GPU and have less non-GPU hardware acceleration
  • Storage size (e.g. a "small" 3B model for 32-bit weights will need 12 Gigabytes of storage)
  • Memory usage (not only do models need to be permanently stored, they are also loaded into RAM for inference)
  • Transmission size (e.g. before you can "run" it, you need to install a model of size 12 Gigabytes over your phone's 4G or WiFi connection)
  • Battery depletion and heat generation (i.e. all of those matrix multiplications will max out the phone's CPU and chew cycles)

For these reasons, it's still faster to send AI requests off to a bigger server with lots of GPUs that's running in the cloud, even though it's a roundtrip network message. Before you see any truly "native" AI models in your app store, research is required to overcome all of the above obstacles.

Future of AI Models on Phones

Over time some of the obstacles to natively-executing inference on phones will diminish:

  • Better phone CPUs with hardware acceleration are already here (e.g. Qualcomm Snapdragon), with more on the way. Future phones will be more AI-capable.
  • "AI Phones" with GPUs will surely be coming to a store near you.
  • Phone storage sizes are also increasing.
  • 5G network connectivity will reduce concerns about transmission sizes.
  • Data compression algorithms can lower transmission sizes, and also possibly storage sizes.
  • Quantized models and other inference optimizations can improve speed and reduce storage size, giving reduced CPU usage, faster response times, lower storage size, and reduced transmission size (but with accuracy loss).
  • Training and fine-tuning of models doesn't need to happen on a phone (phew!).

But... you really need a "big" model, not a "small" model, if you want the app to be great with lots of happy users. And getting a big model running efficiently on a phone may take a while to come to fruition.

What's Needed?

Okay, so let's say you want to run a "big" model on a "small" phone. Why? Lots of reasons, which we won't explore here. So you want what you want, which is to run the open source LLama v2 13B model on a phone.

First question is: do you even need to? Why not just use the AI engines in the cloud, and send requests back-and-forth between the phone and the cloud. Reponse time of modern networks is fast, message sizes are small, and users may not notice or even care. There are reasons beyond speed: privacy and security come to mind.

Another piece of good news: you don't need to "build" the model on your phone. Those GPU-expensive tasks of training or fine-tuning can be done in the cloud. For native execution, the user only needs to run "inference" of the model on their phone.

Assuming you have your reasons to want to do this, let's examine each of the obstacles for native phone execution of LLM model inference.

  • Speed and response time. The AI engine on the phone needs fast "inference" (running the model quickly). And it probably cannot rely on a GPU, since there are already billions of phones out there without a GPU. Hardware acceleration in phone CPUs is limited. The main ways that models run without a GPU on a phone or PC is to use inference optimizations, of which the most popular at the moment is definitely quantization. Other supplemental techniques that might be needed include integer-only arithmetic and pruning (model compression). And there's a whole host of lesser known inference optimization techniques that might need to be combined together. For example, maybe the bottleneck of "auto-regression" will need to get bypassed so the AI engine can crank out multiple words at a time, without running the whole glob of a model for every single word.
  • Network transmission size. Users need to download your 13B LLama-2 model to their phone? Uncompressed, it's about 52GB. There's already a lot known about compression algorithms (e.g. for video), and model files are just multi-gigabyte data files, so perhaps it can be compressed to a size that's adequately small. But before we even use those network compression algorithms, the first thing to try is model compression, such as quantization. For example, using quantization to 8-bit would reduce the original 32-bit model size four-fold down to 13GB, for a slight loss in accuracy (probably acceptable). Binary quantization would reduce it by a factor of 32, but then the inference accuracy goes south. 5G bandwidth will help a lot, but remember there's a lot of users (billions) out there with non-5G compatible phones. Model compression techniques such as quantization and pruning can also reduce the total size. But the whole model is required. There's no such thing as half an AI model. And you can't stream an AI model so it starts running before it's all loaded (although that's actually an interesting research question as to whether it might be possible).
  • Storage size. The whole model needs to be permanently stored on the device. Maybe it can be stored in some compressed form. The same comments about model compression techniques apply. It can either be stored uncompressed if the phone has a bigger storage space, or perhaps it can be stored in compressed form, and only uncompressed when it's needed. But it'll be needed all the time, because, well, it's AI you know, so everybody needs it for everything.
  • Memory size. The inference algorithm needs the whole model, uncompressed, available to use in RAM. Not all at the same time, but it will definitely need to swap the entire model (uncompressed) in and out of memory to process all those model weights. For each word. That's either a lot of RAM (do you have a spare 52GB of RAM on your phone?), or a lot of processing cost from swapping data in/out. And that occurs for every word it generates. Again, model compression seems key to cut down the original 52GB size of the model (e.g. 8-bit cuts it to 13B).
  • Battery depletion and heat generation. A model with 13B weights needs to do 13B multiplications for every word it outputs. That's a lot of power usage. So to get the resource utilization lower means some of the above-mentioned optimations of the inference algorithm (e.g. quantization, pruning, non-auto-regression, etc.).

The short answer is that multiple optimization techniques are probably needed to be combined, and that success is several breakthroughs away, before native phone LLMs appear in the wild.

It might not even be possible to realistically run AI models natively on today's phones. But solving any of the above-mentioned problems is certainly valuable standalone, in that it will reduce the cost of running AI models on GPUs in server farms that are growing in the cloud.

Articles and Press on AI Phones

The drumbeat of press articles and PR releases has begun for "AI Phones" (and also AI PCs):

Survey Papers on AI Phones

Research survey papers about putting models onto a smartphone:

AI Phone Models Research

Research on smartphone AI applications:

On-Device inference

For more about on-device inference on PCs and phones, see on-device inference research.

More AI Research

Read more about: