Aussie AI

Speeding Up Smartphone AI

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Speeding Up Smartphone AI

Okay, so let's say you want to run a “big” model on a “small” phone. Why? Lots of reasons, which we won't explore here. So, you want what you want, which is to run the latest open source AI model on a phone.

First question is: do you even need to? Why not just use the AI engines in the cloud, and send requests back-and-forth between the phone and the cloud. Response time of modern networks is fast, message sizes are small, and users may not notice or even care. There are reasons beyond speed: privacy and security come to mind.

Another piece of good news: you don't need to “build” the model on your phone. Those GPU-expensive tasks of training or fine-tuning can be done in the cloud. For native execution, the user only needs to run “inference” of the model on their phone.

Assuming you have your reasons to want to do this, let's examine each of the obstacles for native phone execution of LLM model inference.

  • Speed and response time. The AI engine on the phone needs fast “inference” (running the model quickly). And it probably cannot rely on a GPU, since there are already billions of phones out there without a GPU. Hardware acceleration in phone CPUs is limited. The main ways that models run without a GPU on a phone or PC is to use inference optimizations, of which the most popular at the moment is definitely quantization. Other supplemental techniques that might be needed include integer-only arithmetic and pruning (model compression). And there's a whole host of lesser-known inference optimization techniques that might need to be combined together. For example, maybe the bottleneck of “auto-regression” will need to get bypassed so the AI engine can crank out multiple words at a time, without running the whole glob of a model for every single word.
  • Network transmission size. Users need to download your 13B LLama-2 model to their phone? Uncompressed, it's about 52GB. There's already a lot known about compression algorithms (e.g. for video), and model files are just multi-gigabyte data files, so perhaps it can be compressed to a size that's adequately small. But before we even use those network compression algorithms, the first thing to try is model compression, such as quantization. For example, using quantization to 8-bit would reduce the original 32-bit model size four-fold down to 13GB, for a slight loss in accuracy (probably acceptable). Binary quantization would reduce it by a factor of 32, but then the inference accuracy goes south. 5G bandwidth will help a lot, but remember there's a lot of users (billions) out there with non-5G compatible phones. Model compression techniques such as quantization and pruning can also reduce the total size. But the whole model is required. There's no such thing as half an AI model. And you can't stream an AI model so it starts running before it's all loaded (although that's actually an interesting research question as to whether it might be possible).
  • Storage size. The whole model needs to be permanently stored on the device. Maybe it can be stored in some compressed form. The same comments about model compression techniques apply. It can either be stored uncompressed if the phone has a bigger storage space, or perhaps it can be stored in compressed form, and only uncompressed when it's needed. But it'll be needed all the time, because, well, it's AI you know, so everybody needs it for everything.
  • Memory size. The inference algorithm needs the whole model, uncompressed, available to use in RAM. Not all at the same time, but it will definitely need to swap the entire model (uncompressed) in and out of memory to process all those model weights. For each word. That's a fair chunk of RAM (e.g. 52GB) but the bottleneck is also the processing cost from swapping data in/out. And that occurs for every word it generates. Again, model compression seems key to cut down the original 52GB size of the model (e.g. 8-bit quantization cuts it to 13GB).
  • Battery depletion and heat generation. A model with 13B weights needs to do 13 billion multiplications for every word it outputs. That's a lot of power usage and reducing resource utilization means using the above-mentioned optimizations of the inference algorithm (e.g. quantization, pruning, non-auto-regression, etc.).

When will native phone LLMs appear in the wild? The short answer is that multiple optimization techniques probably need to be combined, and that success is several breakthroughs away. It might not even be possible to realistically run large LLMs natively on today's phones. But solving any of the above-mentioned problems is certainly valuable standalone, in that it will reduce the cost of running AI models on GPUs in server farms that are growing in the cloud, and maybe even make it possible to run AI natively on desktop PCs.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++