Aussie AI

On-Device Inference

  • Last Updated 11 December, 2024
  • by David Spuler, Ph.D.

What is On-Device Inference?

On-device inference refers to running an LLM's inference phase directly on the physical device, such as of a phone or a PC. This is one of the main architectures receiving attention for building AI Phones and AI PCs.

Note that there are actually three main architectures for AI Phones and AI PCs:

  • On-device inference (running the model "natively")
  • Cloud LLM (sending queries to an AI engine on a cloud server).
  • Hybrid cloud and on-device architectures.

The first AI phone apps have been entirely cloud-based. For example, there are many ChatGPT-based apps on the phone. It seems likely that most of these are sending all queries across the internet to remote cloud-based inference servers (e.g. via the OpenAI API). Running on-device inference is likely too expensive and too slow, despite the extra cost of a round-trip network message in a cloud-based architecture.

Android Phone On-Device Inference

Research papers for on-device inference on Android phones:

iPhone On-Device Inference

Apple has been coy about its AI plans, and there hasn't even been much leaking about on-device AI models for iPhone. Several pundits expect that on-device inference will be important for Apple, given its focus on privacy, and there is the expectation of some big announcements at Apple WWDC in June 2024. By comparison, Google has released an SDK for Android on-device inference.

Online articles: Industry articles and press releases about iPhone inference:

Research papers: various research for on-device inference on iPhone phones:

Research Papers on On-Device Inference (Generally)

Running LLM model inference directly on a phone or a PC is an area of massive research. Local execution of an LLM has advantages in terms of speed and privacy.

Hybrid Cloud-Device Architectures

A combination of LLM inference on the physical device (on-device) and sending over the network to cloud servers is possible. This is called hybrid cloud-on-device inference.

Research papers on hybrid cloud-on-device inference:

Estimating On-Device Throughput from TOPS

NOTE: This analysis seems mostly bogus, but at least it's a starting point. I don't think I've seen a paper that addresses this estimation issue in benchmarking.

TOPS is Teraflops Operations Per Second (TOPS) or a similar meaning (basically trillions of computations per second). This section attempts to naively estimate throughput rates on phone inference using the reported TOPS numbers and model weight counts, but it doesn't seem very accurate.

Let us examine the TOPS ratings for some of Apple's chips. If we assume that this means 1 trillion floating-point operations per second, we get to an estimate that goes like this:

  • Apple A16 Bionic (in iPhone 14 and iPhone 15) has about 17 TOPS rating.
  • Transformer engines touch every weight in an inference computation.
  • Autoregressive default architectures repeat this for every token.

Hence, if we use GPT-2 with about 1.5B weights, or GPT-4's 176B weights (it's in an 8-model MoE architecture, but each inference would only use one model). Hence, the estimate computations give:

  • 17 trillion divided by GPT-2's 1.5 billion, we get 11,333 tokens per second.
  • 17 trillion divided by GPT-4 single-expert 176 billion, we get 96 tokens per second.

This seems way too high (or is it?), but it's not clear what's happening. According to this, running GPT-2 on an iPhone should really fly, but that isn't what's reported in the research papers. Maybe the TOPS metrics don't reflect actual floating-point operations in the A16 chip, or the real cost of model inference is much higher than the weight count for each decoded token, such as due to prefill costs and memory access costs (inference engines are "memory-bound"). We also haven't accounted for practical problems such as battery depletion, non-responsive phones (spinning due to computations), and physical temperature increase (AI is hot, literally).

More AI Research

Read more about: