Aussie AI

On-Device Inference

Last Updated 1 April, 2025

by David Spuler, Ph.D.

What is On-Device Inference?

On-device inference refers to running an LLM's inference phase directly on the physical device, such as of a phone or a PC. This is one of the main architectures receiving attention for building AI Phones and AI PCs.

Note that there are actually three main architectures for AI Phones and AI PCs:

On-device inference (running the model "natively")
Cloud LLM (sending queries to an AI engine on a cloud server).
Hybrid cloud and on-device architectures.

The first AI phone apps have been entirely cloud-based. For example, there are many ChatGPT-based apps on the phone. It seems likely that most of these are sending all queries across the internet to remote cloud-based inference servers (e.g. via the OpenAI API). Running on-device inference is likely too expensive and too slow, despite the extra cost of a round-trip network message in a cloud-based architecture.

Android Phone On-Device Inference

Research papers for on-device inference on Android phones:

Google, Get started with Gemini Nano on Android (on-device), March 30, 2024 (accessed), https://ai.google.dev/tutorials/android_aicore
Google, LLM Inference guide for iOS, March 30, 2024 (accessed), https://developers.google.com/mediapipe/solutions/genai/llm_inference/ios
Google, LLM Inference guide for Android, March 30, 2024 (accessed), https://developers.google.com/mediapipe/solutions/genai/llm_inference/android
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
Google, 2024, https://developers.google.com/mediapipe/solutions/genai/llm_inference (Experimental MediaPipe method for on-device phone LLM inference using the Gemma 2B model family, and also Phi-2, Falcon-RW-1B and StableLM-3B.)
Google, 2024, Get started with the Gemini API in Android apps (client SDK). https://ai.google.dev/tutorials/get_started_android (Cloud-based use of the Gemini API for round-trip AI in Android apps.)
Google, 2024, Get started with Gemini Nano on Android (on-device), Google AI for Developers, https://ai.google.dev/tutorials/android_aicore
Dave Burke, 06 December 2023, Google Blog, https://android-developers.googleblog.com/2023/12/a-new-foundation-for-ai-on-android.html (Gemini Nano for on-device inference on Android phones with Android AICore platform.)
Android Developers, 2024, Android AICore, https://developer.android.com/ml/aicore (AI platform on Android using Gemini Nano.)
Google for Developers Blog, 2024. Large Language Models On-Device with MediaPipe and TensorFlow Lite, March 07, 2024 https://developers.googleblog.com/2024/03/running-large-language-models-on-device-with-mediapipe-andtensorflow-lite.html
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu, 9 Mar 2024, AutoDroid: LLM-powered Task Automation in Android (v4), https://arxiv.org/abs/2308.15272 Code: https://autodroid-sys.github.io/ (Integrates both on-device Vicuna and cloud-based GPT-4/GPT-3.5 into an Android phone app called AutoDroid.)

iPhone On-Device Inference

Apple has been coy about its AI plans, and there hasn't even been much leaking about on-device AI models for iPhone. Several pundits expect that on-device inference will be important for Apple, given its focus on privacy, and there is the expectation of some big announcements at Apple WWDC in June 2024. By comparison, Google has released an SDK for Android on-device inference.

Online articles: Industry articles and press releases about iPhone inference:

Tim Hardwick, December 21, 2023, Apple Develops Breakthrough Method for Running LLMs on iPhones, Mac Rumors, https://www.macrumors.com/2023/12/21/apple-ai-researchers-run-llms-iphones/
James Bentley, January 25, 2024, Apple's new 'boost' to generative AI flags a very different approach to its competitors — on-device AI support could set the iPhone 16 apart, iMore, https://www.imore.com/iphone/apples-new-boost-to-generative-ai-flags-a-very-different-approach-to-its-competitors-on-device-ai-support-could-set-the-iphone-16-apart
Ben Dickson, Feb 24, 2024, What is Apple’s generative AI strategy? Venture Beat, https://venturebeat.com/ai/what-is-apples-generative-ai-strategy/ (Theorizing that Apple is planning on-device generative AI for iPhone and Apple Watch.)
James Rogerson, 25 March 2024, The iPhone 16 Pro’s chipset could be designed with AI in mind, https://www.techradar.com/phones/iphone/the-iphone-16-pros-chipset-could-be-designed-with-ai-in-mind (Rumors that the Apple A18 chipset may be designed with on-device AI in mind.)
Marko Zivkovic's AvatarMarko Zivkovic, Apr 15, 2024, Apple's iOS 18 AI will be on-device preserving privacy, and not server-side, https://appleinsider.com/articles/24/04/15/apples-ios-18-ai-will-be-on-device-preserving-privacy-and-not-server-side

Research papers: various research for on-device inference on iPhone phones:

Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
Maciek Jędrzejczyk, December 11, 2023, Using LLMs locally on iPad or iPhone, https://www.linkedin.com/pulse/using-llms-locally-ipad-iphone-maciek-j%C4%99drzejczyk-cd0zf/ (Running LLMs such as Mistral 7B with 4-bit quantization on Apple iPad or iPhone using Apple Testflight and LLMFarm.)
Apple, June 2022, Deploying Transformers on the Apple Neural Engine, Apple Machine Learning Research, https://machinelearning.apple.com/research/neural-engine-transformers Code: https://github.com/apple/ml-ane-transformers (Apple's open-sourced implementation of a Transformer on ANE for Apple devices using PyTorch.)
Matthias Bastian, Dec 12, 2023, Run LLMs on your M Series with Apple's new MLX machine learning framework, AI in practice, https://the-decoder.com/run-llms-on-your-m-series-with-apples-new-mlx-machine-learning-framework/
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan, 8 Apr 2024, Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs, https://arxiv.org/abs/2404.05719

Research Papers on On-Device Inference (Generally)

Running LLM model inference directly on a phone or a PC is an area of massive research. Local execution of an LLM has advantages in terms of speed and privacy.

Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, Shangguang Wang, Mengwei Xu, 2024, Mobile Foundation Model as Firmware, ACM MobiCom’24, September 30–October 4, 2024, Washington D.C., DC, USA https://xumengwei.github.io/files/MobiCom24-MobileFM.pdf (The use of an LLM foundation model as an underlying OS service on mobile devices.)
Rongjie Yi, Liwei Guo, Shiyun Wei, Ao Zhou, Shangguang Wang, Mengwei Xu, 28 Aug 2023, EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models, https://arxiv.org/abs/2308.14352
Hongzhi Yin, Liang Qu, Tong Chen, Wei Yuan, Ruiqi Zheng, Jing Long, Xin Xia, Yuhui Shi, Chengqi Zhang, 15 Feb 2024 (v2), On-Device Recommender Systems: A Comprehensive Survey, https://arxiv.org/abs/2401.11441
Venkatraman Iyer, Sungho Lee, Semun Lee, Juitem Joonwoo Kim, Hyunjun Kim, Youngjae Shin, 12 December 2023, Automated Backend Allocation for Multi-Model, On-Device AI Inference, Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 7, Issue 3, Article No.: 62, pp 1–33, https://doi.org/10.1145/3626793, https://dl.acm.org/doi/abs/10.1145/3626793
Daliang Xu, Wangsong Yin, Xin Jin, Ying Zhang, Shiyun Wei, Mengwei Xu, Xuanzhe Liu, 8 Sep 2023, LLMCad: Fast and Scalable On-device Large Language Model Inference, https://arxiv.org/abs/2309.04255
Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
Pedro Cuenca, August 8, 2023, Releasing Swift Transformers: Run On-Device LLMs in Apple Devices https://huggingface.co/blog/swift-coreml-llm, Code: https://github.com/huggingface/swift-transformers, Code: https://github.com/huggingface/swift-chat, Code: https://huggingface.co/spaces/coreml-projects/transformers-to-coreml (Overview and code called "Swift Transformers" for running LLM models natively, such as Llama2 7B or Falcon 7B, on-device for Apple devices by wrapping CoreML.)
Tim Hardwick, December 21, 2023, Apple Develops Breakthrough Method for Running LLMs on iPhones, Mac Rumors, https://www.macrumors.com/2023/12/21/apple-ai-researchers-run-llms-iphones/
Google, Get started with Gemini Nano on Android (on-device), March 30, 2024 (accessed), https://ai.google.dev/tutorials/android_aicore
Google, LLM Inference guide for iOS, March 30, 2024 (accessed), https://developers.google.com/mediapipe/solutions/genai/llm_inference/ios
Google, LLM Inference guide for Android, March 30, 2024 (accessed), https://developers.google.com/mediapipe/solutions/genai/llm_inference/android
David Spuler, Mar 30, 2024, Generative AI in C++: Coding Transformers and LLMs, Aussie AI, https://www.amazon.com/Generative-AI-Coding-Transformers-LLMs-ebook/dp/B0CXJKCWX9/
Mohit Thakkar, 20 February 2019, Beginning Machine Learning in iOS: CoreML Framework, Apress, https://www.amazon.com/dp/B07NYW5VBQ/
Daniel Situnayake, 24 January 2023, AI at the Edge: Solving Real-World Problems with Embedded Machine Learning, O'Reilly Media, Inc, USA, https://www.amazon.com/dp/1098120205/
Google, 2024, https://developers.google.com/mediapipe/solutions/genai/llm_inference (Experimental MediaPipe method for on-device phone LLM inference using the Gemma 2B model family, and also Phi-2, Falcon-RW-1B and StableLM-3B.)
Google, 2024, Get started with the Gemini API in Android apps (client SDK). https://ai.google.dev/tutorials/get_started_android (Cloud-based use of the Gemini API for round-trip AI in Android apps.)
Google, 2024, Get started with Gemini Nano on Android (on-device), Google AI for Developers, https://ai.google.dev/tutorials/android_aicore
Dave Burke, 06 December 2023, Google Blog, https://android-developers.googleblog.com/2023/12/a-new-foundation-for-ai-on-android.html (Gemini Nano for on-device inference on Android phones with Android AICore platform.)
Android Developers, 2024, Android AICore, https://developer.android.com/ml/aicore (AI platform on Android using Gemini Nano.)
Google for Developers Blog, 2024. Large Language Models On-Device with MediaPipe and TensorFlow Lite, March 07, 2024 https://developers.googleblog.com/2024/03/running-large-language-models-on-device-with-mediapipe-andtensorflow-lite.html
Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu, 9 Mar 2024, AutoDroid: LLM-powered Task Automation in Android (v4), https://arxiv.org/abs/2308.15272, Code: https://autodroid-sys.github.io/ (Integrates both on-device Vicuna and cloud-based GPT-4/GPT-3.5 into an Android phone app called AutoDroid.)
Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu, 18 Mar 2024, LLM as a System Service on Mobile Devices, https://arxiv.org/abs/2403.11805 (On-device inference for LLMs, including a stateful on-device AI service LLMaaS, including Llama2 7B and OPT-7B with INT8 quantization, based on improved KV caching on mobile, with pipelining, recomputation and chunk-level KV cache memory management for running on phones.)
Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, Shangguang Wang, and Mengwei Xu. 2023. Rethinking Mobile AI Ecosystem in the LLM Era. arXiv:2308.14363, https://arxiv.org/abs/2308.14363 (Running a 7B LLM on a phone.)
James Bentley, January 25, 2024, Apple's new 'boost' to generative AI flags a very different approach to its competitors — on-device AI support could set the iPhone 16 apart, iMore, https://www.imore.com/iphone/apples-new-boost-to-generative-ai-flags-a-very-different-approach-to-its-competitors-on-device-ai-support-could-set-the-iphone-16-apart
Juhyun Lee, Nikolay Chirkov, Ekaterina Ignasheva, Yury Pisarchyk, Mogan Shieh, Fabio Riccardi, Raman Sarokin, Andrei Kulik, Matthias Grundmann, 3 Jul 2019, On-Device Neural Net Inference with Mobile GPUs, https://arxiv.org/abs/1907.01989 (An older paper from 2019, but interesting.)
Maciek Jędrzejczyk, December 11, 2023, Using LLMs locally on iPad or iPhone, https://www.linkedin.com/pulse/using-llms-locally-ipad-iphone-maciek-j%C4%99drzejczyk-cd0zf/ (Running LLMs such as Mistral 7B with 4-bit quantization on Apple iPad or iPhone using Apple Testflight and LLMFarm.)
Apple, June 2022, Deploying Transformers on the Apple Neural Engine, Apple Machine Learning Research, https://machinelearning.apple.com/research/neural-engine-transformers Code: https://github.com/apple/ml-ane-transformers (Apple's open-sourced implementation of a Transformer on ANE for Apple devices using PyTorch.)
Matthias Bastian, Dec 12, 2023, Run LLMs on your M Series with Apple's new MLX machine learning framework, AI in practice, https://the-decoder.com/run-llms-on-your-m-series-with-apples-new-mlx-machine-learning-framework/
Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 29 Mar 2024, Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041 (On-device LLMs via four optimizations: dynamic-tensor-shape inference, FP4 quantization, operator optimizations, and KV cache improvements.)
Joel Ruben Antony Moniz, Soundarya Krishnan, Melis Ozyildirim, Prathamesh Saraf, Halim Cagri Ates, Yuan Zhang, Hong Yu, Nidhi Rajshree, 29 Mar 2024, ReALM: Reference Resolution As Language Modeling, https://arxiv.org/abs/2403.20329v1 (A paper from Apple with a candidate model for on-device inference.)
Zac Hall, Apr 1 2024, Apple AI researchers boast useful on-device model that ‘substantially outperforms’ GPT-4, https://9to5mac.com/2024/04/01/apple-ai-gpt-4/
A16Z, April 2nd, 2024 (accessed), AI Getting Started https://github.com/a16z-infra/ai-getting-started (Javascript wrapper kits for several commercial AI APIs.)
Yoko Li, April 2nd, 2024 (accessed), Local AI Stack, https://github.com/ykhli/local-ai-stack (Javascript-based example of running local AI.)
Han Hu, Yujin Huang, Qiuyuan Chen, Terry Yue Zhuo, Chunyang Chen, 2023, A First Look at On-device Models in iOS Apps, ACM Transactions on Software Engineering and Methodology, Volume 33, Issue 1, Article No.: 26, pp 1–30, https://arxiv.org/abs/2307.12328, https://doi.org/10.1145/3617177 (Interesting analysis of traditional non-generative AI modules in apps on Apple iOS, finding the average model size is about half a megabyte, i.e. rather small!)
Ivan Mehta, April 3, 2024, Opera allows users to download and use LLMs locally, https://techcrunch.com/2024/04/03/opera-will-now-allow-users-download-and-use-llms-locally/ (Opera browser users can download models, and they are run via the Ollama framework running inside the browser.)
Lucas Mearian, 21 Mar 2024, Microsoft integrates its Copilot chatbot on new devices https://www.computerworld.com/article/2071480/microsoft-integrates-its-copilot-chatbot-across-entire-product-line.html (New Surface laptops with support for ChatGPT-based Copilot.)
Justine Tunney, March 31st 2024, LLaMA Now Goes Faster on CPUs, https://justine.lol/matmul/, Code: https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/sgemm.cpp, Code: https://github.com/mozilla-Ocho/llamafile (Improved on-device benchmarks for PC CPU platforms, with Intel, AMD or M2 chips, for Mistral 7B models with 8-bit quantization by optimizing MatMul in the llama.cpp inference engine.)
Victor J.B. Jung, Alessio Burrello, Moritz Scherer, Francesco Conti, Luca Benini, 3 Apr 2024, Optimizing the Deployment of Tiny Transformers on Low-Power MCUs, https://arxiv.org/abs/2404.02945 (Uses an approach called "Fused Weight Self-Attention" that fuses some of the QKV matrices and also tiling in multi-head attention, along with 8-bit integer quantization and integerized Softmax.)
Steve Dent, Thu, Mar 28, 2024, Microsoft Copilot AI will soon run locally on PCs, https://www.engadget.com/microsoft-copilot-ai-will-soon-run-locally-on-pcs-130642514.html
MMH Shuvo, SK Islam, J Cheng, Efficient acceleration of deep learning inference on resource-constrained edge devices: A review, 2022, Proceedings of the IEEE ( Volume: 111, Issue: 1, January 2023), pp 42 - 91, 14 December 2022 , https://ieeexplore.ieee.org/abstract/document/9985008 PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9985008
Wei Chen, Zhiyuan Li, 3 Apr 2024 (v2), Octopus v2: On-device language model for super agent, https://arxiv.org/abs/2404.01744
AMD AI Staff, How to run a Large Language Model (LLM) on your AMD Ryzen™ AI PC or Radeon Graphics Card, March 2024, AMD Blog, https://community.amd.com/t5/ai/how-to-run-a-large-language-model-llm-on-your-amd-ryzen-ai-pc-or/ba-p/670709
Matthijs Hollemans, April 2024 (accessed), The Neural Engine — what do we know about it? https://github.com/hollance/neural-engine
Apple, October 30, 2023, Apple unveils M3, M3 Pro, and M3 Max, the most advanced chips for a personal computer, Apple Press Release, https://www.apple.com/newsroom/2023/10/apple-unveils-m3-m3-pro-and-m3-max-the-most-advanced-chips-for-a-personal-computer/
Ben Lovejoy, Feb 28 2024, A (very) close look at the A17 Pro chip powering the iPhone 15 Pro models, https://9to5mac.com/2024/02/28/a17-pro-chip-technology/
Apple, 2023, iPhone 15, https://www.apple.com/iphone-15/specs/
Victor Hristov Sep 17, 2022 (updated), A16 Bionic explained: what's new in Apple's Pro-grade mobile chip? https://www.phonearena.com/news/A16-Bionic-explained-whats-new_id142438
Levent Bulusan, Oct 31, 2023, Apple’s M3 Chip and Its Revolutionary Impact on AI Platforms (ChatGPT-4 & Midjourney), Medium, https://medium.com/@lvntblsn/apples-m3-chip-and-its-revolutionary-impact-on-ai-platforms-chatgpt-4-midjourney-842986effbb9
Sharon Machlis, March 28, 2024, 5 easy ways to run an LLM locally, InfoWorld, https://www.infoworld.com/article/3705035/5-easy-ways-to-run-an-llm-locally.html
Minghao Yan, Hongyi Wang, Shivaram Venkataraman, 9 Jan 2024 (v2), PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices, https://arxiv.org/abs/2310.19991 (Faster inference with a focus on pipelining and scheduling of hardware acceleration.)
Dhananjay Saikumar, Blesson Varghese, 4 Mar 2024 (v2), NeuroFlux: Memory-Efficient CNN Training Using Adaptive Local Learning, https://arxiv.org/abs/2402.14139 (On-device execution of training for CNNs.)
Castrillo, J., Valle, R., Baumela, L. (2024). Efficiency Evaluation of Mobile Vision Transformers. In: Rocha, Á., Ferrás, C., Hochstetter Diez, J., Diéguez Rebolledo, M. (eds) Information Technology and Systems. ICITS 2024. Lecture Notes in Networks and Systems, vol 933. Springer, Cham. https://doi.org/10.1007/978-3-031-54256-5_1 https://link.springer.com/chapter/10.1007/978-3-031-54256-5_1 Code: https://github.com/pcr-upm/icits24_landmarks (Vision transformers on mobile architectures.)
Xuanlei Zhao, Bin Jia, Haotian Zhou, Ziming Liu, Shenggan Cheng, Yang You, 2 Mar 2024, HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices, https://arxiv.org/abs/2403.01164
Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
Mustafa Aljadery, 2024 (accessed), Lightning Whisper MLX, https://github.com/mustafaaljadery/lightning-whisper-mlx (Whisper model optiized for Apple MLX hardware acceleration.)
David Linthicum, Jan 16, 2024, Do you need GPUs for generative AI systems? InfoWorld, https://www.infoworld.com/article/3712134/do-you-need-gpus-for-generative-ai-systems.html
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Ganesh Jawahar, April 2024, Methods for design of efficient on-device natural language processing architectures, Ph.D. thesis, Computer Science, The University of British Columbia (Vancouver) https://open.library.ubc.ca/media/download/pdf/24/1.0441384/4
Jijoong Moon, Hyeonseok Lee, Jiho Chu, Donghak Park, Seungbaek Hong, Hyungjun Seo, Donghyeon Jeong, Sungsik Kong, Myungjoo Ham, April 2024, A New Frontier of AI: On-Device AI Training and Personalization, ICSE-SEIP '24: Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice, Pages 323–333, https://doi.org/10.1145/3639477.3639716 https://dl.acm.org/doi/abs/10.1145/3639477.3639716
Xiao Fu, Weiling Yang, Dezun Dong, Xing Su, 03 June 2024, Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs, ICS '24: Proceedings of the 38th ACM International Conference on Supercomputing, May 2024, Pages 137–149, https://doi.org/10.1145/3650200.3656620 https://dl.acm.org/doi/abs/10.1145/3650200.3656620
Ollama, May 31, 2024, An entirely open-source AI code assistant inside your editor, https://ollama.com/blog/continue-code-assistant
Computer World, 29 May 2024, In two years, 100% of enterprise PC purchases will be AI computers, https://www.computerworld.com/article/2130275/in-two-years-100-of-enterprise-pc-purchases-will-be-ai-computers.html
Viviane Potocnik, Luca Colagrande, Tim Fischer, Luca Bertaccini, Daniele Jahier Pagliari, Alessio Burrello, Luca Benini, 29 May 2024, Optimizing Foundation Model Inference on a Many-tiny-core Open-source RISC-V Platform, https://arxiv.org/abs/2405.19284
Djip007, May 2024, llamafile 0.8.6 CPU benchmark #450, https://github.com/Mozilla-Ocho/llamafile/discussions/450 (Running llamafile at 20 tokens per second on a non-GPU commodity CPU.)
Ken Yeung, May 21, 2024, Microsoft introduces Phi-Silica, a 3.3B parameter model made for Copilot+ PC NPUs, https://venturebeat.com/ai/microsoft-introduces-phi-silica-a-3-3b-parameter-model-made-for-copilot-pc-npus/
OpenBMB, May 2024, MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone, https://github.com/OpenBMB/MiniCPM-V
Mandar Karhade, Dec 13, 2023, Make Any* LLM fit Any GPU in 10 Lines of Code, Towards AI, https://pub.towardsai.net/make-any-llm-fit-any-gpu-in-10-lines-of-code-dba28eebf5ba
Yash Bhaskar, Feb 22, 2024, Gemma vs. Mistal: Comparison of Smaller AI-Language Models, Cubed, https://blog.cubed.run/gemma-vs-mistal-comparison-of-smaller-ai-language-models-a9482f87b0f2
Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu, 12 Apr 2024, LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation, https://arxiv.org/abs/2404.16054
Juyong Lee, Taywon Min, Minyong An, Changyeon Kim, Kimin Lee, 25 Apr 2024, Benchmarking Mobile Device Control Agents across Diverse Configurations, https://arxiv.org/abs/2404.16660 Code: https://b-moca.github.io/
Minseok Seo, Xuan Truong Nguyen, Seok Joong Hwang, Yongkee Kwon, Guhyun Kim, Chanwook Park, Ilkon Kim, Jaehan Park, Jeongbin Kim, Woojae Shin, Jongsoon Won, Haerang Choi, Kyuyoung Kim, Daehan Kwon, Chunseok Jeong, April 2024, IANUS: Integrated Accelerator based on NPU-PIM Unified Memory System, ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, Pages 545–560, https://doi.org/10.1145/3620666.3651324 https://dl.acm.org/doi/abs/10.1145/3620666.3651324
Christian Guyton, Apr 26, 2024, iOS 18 could be loaded with AI, as Apple reveals 8 new artificial intelligence models that run on-device, Tech Radar, https://www.techradar.com/computing/artificial-intelligence/ios-18-could-be-loaded-with-ai-as-apple-reveals-8-new-artificial-intelligence-models-that-run-on-device
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Multiplatform.AI, Apr 7, 2024, Stanford University’s Octopus v2: Revolutionizing On-Device Language Models for Enhanced Agent Capabilities, https://medium.com/@multiplatform.ai/stanford-universitys-octopus-v2-revolutionizing-on-device-language-models-for-enhanced-agent-c6602d3cc026
Gavin Li, April 2024, Run the strongest open-source LLM model: Llama3 70B with just a single 4GB GPU!, https://ai.gopubby.com/run-the-strongest-open-source-llm-model-llama3-70b-with-just-a-single-4gb-gpu-7e0ea2ad8ba2 (Run Llamaa3-70B with AirLLM framework on a Macbook.)
Wei Niu, Md Musfiqur Rahman Sanim, Zhihao Shu, Jiexiong Guan, Xipeng Shen, Miao Yin, Gagan Agrawal, Bin Ren, 21 Apr 2024, SmartMem: Layout Transformation Elimination and Adaptation for Efficient DNN Execution on Mobile, https://arxiv.org/abs/2404.13528 (Choosing optimal tensor memory layouts to optimize low-level operator kernels.)
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric Lin, Zeqi Lin, Piyush Madan, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Xia Song, Masahiro Tanaka, Xin Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, Ziyi Yang, Donghan Yu, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou, 23 Apr 2024 ( v2), Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, https://arxiv.org/abs/2404.14219
Benj Edwards, 24 April, 2024, Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models, https://arstechnica.com/information-technology/2024/04/microsofts-phi-3-shows-the-surprising-power-of-small-locally-run-ai-language-models/
Martin Thissen, April 20, 2024, Llama 3 on Your Local Computer | Free GPT-4 Alternative, https://medium.com/@martin-thissen/llama-3-on-your-local-computer-free-gpt-4-alternative-1f533e9abff7 (Llama3-70B with 4-bit quantization using vLLM for inference on NVIDIA RTX 6000 Ada GPU.)
William Gallagher, Apr 16, 2024, When to expect every Mac to get the AI-based M4 processor, Apple Insider, https://appleinsider.com/articles/24/04/14/when-to-expect-every-mac-to-get-the-ai-based-m4-processor
Marko Zivkovic's AvatarMarko Zivkovic, Apr 15, 2024, Apple's iOS 18 AI will be on-device preserving privacy, and not server-side, https://appleinsider.com/articles/24/04/15/apples-ios-18-ai-will-be-on-device-preserving-privacy-and-not-server-side
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan, 8 Apr 2024, Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs, https://arxiv.org/abs/2404.05719
Kif Leswing, April 9, 2024, Intel unveils latest AI chip as Nvidia competition heats up, CNBC, https://www.cnbc.com/2024/04/09/intel-unveils-gaudi-3-ai-chip-as-nvidia-competition-heats-up-.html (Intel Gaudi 3 chip for high-end datacenter usage, completing with NVIDIA H100.)
PyTorch Edge Team, October 17, 2023, PyTorch Edge: Enabling On-Device Inference Across Mobile and Edge Devices with ExecuTorch, https://pytorch.org/blog/pytorch-edge/?hss_channel=lcp-78618366
Vihanga Ashinsana Wijayasekara, Oct 18, 2020, On-Device AI — What I know so far, https://medium.com/@VihangaAW/on-device-ai-what-i-know-so-far-4f541f399f94
Marat Dukhan and Frank Barchard, November 29, 2023, Half-precision Inference Doubles On-Device Inference Performance, TensorFlow Blog, https://blog.tensorflow.org/2023/11/half-precision-inference-doubles-on-device-inference-performance.html
Arun Kandoor, August 3, 2022 Efficient Sequence Modeling for On-Device ML, Google Research Blog, https://research.google/blog/efficient-sequence-modeling-for-on-device-ml/
Shuai Zhu, Thiemo Voigt, JeongGil Ko, Fatemeh Rahimian, 9 May 2023 (v2), On-device Training: A First Overview on Existing Systems, https://arxiv.org/abs/2212.00824
Seungtae Hong, Gunju Park, Jeong-Si Kim, 9 June 2024, Automated deep-learning model optimization framework for microcontrollers, https://doi.org/10.4218/etrij.2023-0522 https://onlinelibrary.wiley.com/doi/full/10.4218/etrij.2023-0522 (Framework for using quantization and pruning on microcontroller devices.)
MIT Technical Review, On-Device AI, https://www.technologyreview.com/hub/ubiquitous-on-device-ai/
Siddhant Sahu, May 30, 2024, Beyond the Cloud: Distributed AI and On-Device Intelligence: Transition of AI workflows from cloud to the edge with specialized chip infrastructure & models, multi-modality and ambience across devices, https://sidstage.substack.com/p/beyond-the-cloud-distributed-ai-and
Robert Wolfe, Isaac Slaughter, Bin Han, Bingbing Wen, Yiwei Yang, Lucas Rosenblatt, Bernease Herman, Eva Brown, Zening Qu, Nic Weber, and Bill Howe. 2024. Laboratory-Scale AI: Open-Weight Models are Competitive with ChatGPT Even in Low-Resource Settings. In ACMConference on Fairness, Accountability, and Transparency (ACM FAccT ’24), June 3–6, 2024, Rio de Janeiro, Brazil. ACM, New York, NY, USA, 18 pages. https://doi.org/10.1145/3630106.3658966 https://arxiv.org/pdf/2405.16820
Shengyuan Ye, Jiangsu Du, Liekang Zeng, Wenzhong Ou, Xiaowen Chu, Yutong Lu, Xu Chen, 27 May 2024, Galaxy: A Resource-Efficient Collaborative Edge AI System for In-situ Transformer Inference, https://arxiv.org/abs/2405.17245
Anonymous authors, 2024, Distributed Inference Performance Optimizations for LLMs on CPUs, ICLR 2024, https://openreview.net/pdf?id=oEbILBMvDS
Qualcomm, May 2023, The future of AI is hybrid, Qualcomm White Paper, https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Whitepaper-The-future-of-AI-is-hybrid-Part-1-Unlocking-the-generative-AI-future-with-on-device-and-hybrid-AI.pdf
Rocke, F. (2023), Evaluation of C++ SIMD Libraries, Bachelor’s Thesis, INSTITUT FUR INFORMATIK, DER LUDWIG–MAXIMILIANS–UNIVERSIT AT MUNCHEN, https://www.mnm-team.org/pub/Fopras/rock23/ PDF: https://www.mnm-team.org/pub/Fopras/rock23/PDF-Version/rock23.pdf (Reviewed six SIMD libraries: Highway, Vc, Libsimdpp, NSIMD, SIMD Everywhere, Pure SIMD).
Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, Dec 2023, LLM in a flash: Efficient Large Language Model Inference with Limited Memory Apple Research, https://arxiv.org/abs/2312.11514
MyungJoo Ham, Jijoong Moon, Geunsik Lim, Jaeyun Jung, Hyoungjoo Ahn, Wook Song, Sangjung Woo, Parichay Kapoor, Dongju Chae, Gichan Jang, Yongjoo Ahn, and Jihoon Lee. 2021. NNStreamer: Efficient and Agile Development of On-Device AI Systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 198--207. https://arxiv.org/abs/2101.06371
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E Gonzalez, et al. High-throughput generative inference of large language models with a single gpu. arXiv preprint arXiv:2303.06865, 2023. https://arxiv.org/abs/2303.06865
Mozilla, June 3, 2024, Announcing Mozilla Builders: 2024 Accelerator Theme: Local AI, https://future.mozilla.org/builders/blog/announcing-mozilla-builders/
Hesam Sheikh, Jun 1, 2024, Towards AI Build Blog Writer and Researcher AI Agents with Ollama (100% local): Creating AI agents with Crewai and using Ollama to run them 100% locally in 5 very easy steps!, https://pub.towardsai.net/build-your-first-ai-agent-in-5-easy-steps-100-local-2fb771438a8f
Tom Warren, April 9, 2024, Microsoft is confident Windows on Arm could finally beat Apple, The Verge, https://www.theverge.com/2024/4/8/24116587/microsoft-macbook-air-surface-arm-qualcomm-snapdragon-x-elite
DONE Ben Dickson, Feb 24, 2024, What is Apple’s generative AI strategy? Venture Beat, https://venturebeat.com/ai/what-is-apples-generative-ai-strategy/ (Theorizing that Apple is planning on-device generative AI for iPhone and Apple Watch.)
DONE James Rogerson, 25 March 2024, The iPhone 16 Pro’s chipset could be designed with AI in mind, https://www.techradar.com/phones/iphone/the-iphone-16-pros-chipset-could-be-designed-with-ai-in-mind (Rumors that the Apple A18 chipset may be designed with on-device AI in mind.)
Hasanul Mahmud, Peng Kang, Kevin Desai, Palden Lama, Sushil Prasad, 11 Mar 2024, A Converting Autoencoder Toward Low-latency and Energy-efficient DNN Inference at the Edge, https://arxiv.org/abs/2403.07036 (Hybrid cloud and on-device inference for image analysis.)
Semaphore, Dec 14, 2023, 6 Ways to Run LLMs Locally, https://semaphoreci.medium.com/6-ways-to-run-llms-locally-fa25be0797e5 (The six ways are HF Transformers, LangChain, Llama.cpp, Llamafile, Ollama, and GPT4All.)
Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba. arXiv:1905.06874 [cs.IR] https://arxiv.org/abs/1905.06874
Gavin Li, Nov 19, 2023, Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique, AI Advances https://ai.gopubby.com/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb
John Borthwick, May 28, 2024, Announcing AI Camp: Native Applications, https://render.betaworks.com/announcing-ai-camp-native-applications-e1358061c601
C Luo, X He, J Zhan, L Wang, W Gao, J Dai, 2020, Comparison and benchmarking of AI models and frameworks on mobile devices, https://arxiv.org/abs/2005.05085
Apple, June 10, 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
Ignacio de Gregorio, June 2024, My Thoughts on Apple Intelligence: Leveling the Stakes & Betraying the Essence, https://readmedium.com/en/my-thoughts-on-apple-intelligence-16a793359cb5
Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
Intel, Apr 25, 2024, Deployment of Llama3 on Your AI PC with OpenVINO™, https://medium.com/openvino-toolkit/deployment-of-llama3-on-your-ai-pc-with-openvino-b58e961501d6
Rithesh Murthy, Liangwei Yang, Juntao Tan, Tulika Manoj Awalgaonkar, Yilun Zhou, Shelby Heinecke, Sachin Desai, Jason Wu, Ran Xu, Sarah Tan, Jianguo Zhang, Zhiwei Liu, Shirley Kokane, Zuxin Liu, Ming Zhu, Huan Wang, Caiming Xiong, Silvio Savarese, 12 Jun 2024, MobileAIBench: Benchmarking LLMs and LMMs for On-Device Use Cases, https://arxiv.org/abs/2406.10290
Jason Perlow, June 13, 2024, The expensive reason why Apple's upcoming AI features aren't coming to your older iPhone, https://www.zdnet.com/article/the-expensive-reason-why-apples-upcoming-ai-features-arent-coming-to-your-older-iphone/
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey Bigham, Zhile Ren, Cecile Foret, Qi Shan, Xiaoyi Zhang, April 2024, Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference, https://arxiv.org/abs/2404.03085 https://machinelearning.apple.com/research/talaria
Ignacio de Gregorio, June 2024, How Does Apple Intelligence Really Work? Deep dive into Apple’s newest bet, https://medium.com/@ignacio.de.gregorio.noblejas/how-does-apple-intelligence-really-work-5f79b368c86d
Rohan Goswami 21 June, 2024, Apple Intelligence won’t launch in EU in 2024 due to antitrust regulation, company says, CNBS, https://www.cnbc.com/2024/06/21/apple-ai-europe-dma-macos.html
Katie Collins, March 6, 2024, On-Device AI Is a Whole New Way of Experiencing Artificial Intelligence, https://www.cnet.com/tech/mobile/on-device-ai-is-a-whole-new-way-of-experiencing-artificial-intelligence/
Ben Dickson, June 11, 2024, What we know about Apple’s on-device AI, https://venturebeat.com/ai/what-we-know-about-apples-on-device-ai/
By Ben Dickson, December 27, 2023, Apple research paper hints at LLMs on iPhones and Macs, https://bdtechtalks.com/2023/12/27/apple-llm-flash-research/
William Brown, June 23, 2024, ParaLLM: 1300+ tok/s on a MacBook: Batched KV caching for fast parallel LLM inference in MLX, https://willcb.com/blog/parallm/ Code: https://github.com/willccbb/mlx_parallm/tree/main
Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, Mengwei Xu, 11 June 2024, Large Language Models on Mobile Devices: Measurements, Analysis, and Insights, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 1 - 6, PDF: https://doi.org/10.1145/3662006.3662059 https://dl.acm.org/doi/abs/10.1145/3662006.3662059 https://dl.acm.org/doi/pdf/10.1145/3662006.3662059
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
Kyle Kucharski, June 26, 2024, I saw the future of AI at Qualcomm's headquarters, and Copilot+ PCs were only just the beginning, https://www.zdnet.com/article/i-saw-the-future-of-ai-at-qualcomms-headquarters-and-copilot-pcs-were-only-just-the-beginning/
Franklin Huang, May 17, 2024, Machine Learning Systems with Reduced Memory Requirements, Masters of Science, Electrical Engineering and Computer Sciences, University of California, Berkeley, Technical Report No. UCB/EECS-2024-120 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf Code: https://github.com/hongyihuang/spec-mcts/blob/main/triton (Broad paper that examines a lot of different optimizations that reduce memory costs, including quantization, kernel fusion, sparsity, MatMul optimizations, KV cache compression, and various other methods.)
Kamila Wojciechowska July 2nd, 2024, Exclusive: This is Google AI, and it's coming to the Pixel 9, https://www.androidauthority.com/google-ai-recall-pixel-9-3456399/
Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang, 25 Jun 2024, T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge, https://arxiv.org/abs/2407.00088 Code: https://github.com/microsoft/T-MAC (Table lookup for low-bit quantization on CPUs.)
Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
Ying He, Jingcheng Fang, F. Richard Yu, Victor C. Leung, 2024, Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach, PrePrints pp. 1-12, DOI: 10.1109/TMC.2024.3415661, https://www.computer.org/csdl/journal/tm/5555/01/10591707/1YraFlDdKYo
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 27 Jun 2024 (v2), MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, Meta Research, https://arxiv.org/abs/2402.14905 Code: https://github.com/facebookresearch/MobileLLM
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Jul 2024, Empowering 1000 tokens/second on-device LLM prefilling with mllm-NPU, https://arxiv.org/abs/2407.05858
Esther Shein Jul 9 2024, Anticipating the Year of the AI PC, https://cacm.acm.org/news/anticipating-the-year-of-the-ai-pc/
Adarsh Prasad Behera, Paulius Daubaris, Iñaki Bravo, José Gallego, Roberto Morabito, Joerg Widmer, Jaya Prakash Varma Champati, 10 Jul 2024, Exploring the Boundaries of On-Device Inference: When Tiny Falls Short, Go Hierarchical, https://arxiv.org/abs/2407.11061
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Dmitriy Pastushenkov, Ria Cheruvu, Max Domeika, Paula Ramos, Apr 20, 2024, AI is coming to the PC — AI PC Essentials, https://medium.com/openvino-toolkit/ai-is-coming-to-the-pc-ai-pc-essentials-ba2aa8686a59
Arjun Kharpal, July 25, 2024, Samsung hints at new products as it bets on AI to drive upgrades to its latest foldable phones, https://www.cnbc.com/2024/07/26/samsung-tm-roh-interview-galaxy-ai-mixed-reality-and-foldables.html
Vince Lam, Mar 12, 2024, 50+ Open-Source Options for Running LLMs Locally, https://medium.com/thedeephub/50-open-source-options-for-running-llms-locally-db1ec6f5a54f
Allison Johnson, Aug 1, 2024,, A first look at Apple Intelligence and its (slightly) smarter Siri, The Verge, https://www.theverge.com/2024/7/31/24209910/apple-intelligence-ios-18-preview-siri
Philip Wiese, Gamze İslamoğlu, Moritz Scherer, Luka Macan, Victor J.B. Jung, Alessio Burrello, Francesco Conti, Luca Benini, 5 Aug 2024, Toward Attention-based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow, https://arxiv.org/abs/2408.02473
Jaewook Lee, Yoel Park, Seulki Lee, 7 Aug 2024, Designing Extremely Memory-Efficient CNNs for On-device Vision Tasks, https://arxiv.org/abs/2408.03663
Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie, 10 Jul 2024, Inference Performance Optimization for Large Language Models on CPUs, https://arxiv.org/abs/2407.07304 Code: https://github.com/intel/xFasterTransformer
Nithur, Aug 5, 2024, How to Build With the Chrome’s Latest Built-in AI: Setting up Gemini Nano in Your Browser and Building a Practical Use Case With It, https://pub.towardsai.net/how-to-build-with-the-chromes-latest-built-in-ai-cb0f901c0a3e
Yucheng Ding, Chaoyue Niu, Fan Wu, Shaojie Tang, Chengfei Lyu, Guihai Chen, 24 August 2024, Enhancing On-Device LLM Inference with Historical Cloud-Based LLM Interactions, KDD '24: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Pages 597 - 608, https://doi.org/10.1145/3637528.3671679 https://dl.acm.org/doi/abs/10.1145/3637528.3671679 (External datastore of user interactions to speed up on-device LLM.)
Raymond Lo, Jul 10, 2024, How to Build Faster GenAI Apps with Fewer Lines of Code using OpenVINO™ GenAI API, https://medium.com/openvino-toolkit/how-to-build-faster-genai-apps-with-fewer-lines-of-code-using-openvino-genai-api-5dd5fcabea17
Karan Goel, August 27, 2024, The On‑Device Intelligence Update https://cartesia.ai/blog/2024-08-27-on-device (On-device state space models.)
Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang. 2024. Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, New York, NY, USA, Article 648, 1–19. https://doi.org/10.1145/3613904.3642628 https://dl.acm.org/doi/full/10.1145/3613904.3642628
Hao Zhou, Chengming Hu, Ye Yuan, Yufei Cui, Yili Jin, Can Chen, Haolun Wu, Dun Yuan, Li Jiang, Di Wu, Xue Liu, Charlie Zhang, Xianbin Wang, Jiangchuan Liu, 17 May 2024, Large Language Model (LLM) for Telecommunications: A Comprehensive Survey on Principles, Key Techniques, and Opportunities, https://arxiv.org/abs/2405.10825
Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, 7 Dec 2023 (v2), Efficient LLM Inference on CPUs, https://arxiv.org/abs/2311.00502 https://github.com/intel/intel-extension-for-transformers
Neural Magic, 2024, DeepSparse: Sparsity-aware deep learning inference runtime for CPUs, https://github.com/neuralmagic/deepsparse https://neuralmagic.com/deepsparse/
Sujeet Kumar, May 20, 2024, 14 Best Software for Running local LLM, https://scifilogic.com/interface-for-running-local-llm/
David Spuler, June 2024, Aussie AI, Optimizing On-Device Transformer Inference for Source Code Checking: IP Australia, https://ipsearch.ipaustralia.gov.au/patents/2024901675
Jiajun Xu, Zhiyuan Li, Wei Chen, Qun Wang, Xin Gao, Qi Cai, Ziyuan Ling, 26 Aug 2024, On-Device Language Models: A Comprehensive Review, https://arxiv.org/abs/2409.00088 https://github.com/NexaAI/Awesome-LLMs-on-device https://www.nexaai.com/models
Tyler Mullen, August 22, 2024, Unlocking 7B+ language models in your browser: A deep dive with Google AI Edge's MediaPipe, https://research.google/blog/unlocking-7b-language-models-in-your-browser-a-deep-dive-with-google-ai-edges-mediapipe/
Huan Yang, Deyu Zhang, Yudong Zhao, Yuanchun Li, Yunxin Liu, 6 Sep 2024, A First Look At Efficient And Secure On-Device LLM Inference Against KV Leakage, https://arxiv.org/abs/2409.04040 (Security issues where KV caches can be data leaks as they may contain encodings of private information.)
Othmane Friha, Mohamed Amine Ferrag, Burak Kantarci, Burak Cakmak, Arda Ozgun, Nassira Ghoualmi-Zine, 2024, LLM-based Edge Intelligence: A Comprehensive Survey on Architectures, Applications, Security and Trustworthiness, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10669603
Ben Dickson, September 13, 2024, Apple aims for on-device user intent understanding with UI-JEPA models https://venturebeat.com/ai/apple-aims-for-on-device-user-intent-understanding-with-ui-jepa-models/
Michael Nuñez, September 13, 2024, Microsoft’s Windows Agent Arena: Teaching AI assistants to navigate your PC, https://venturebeat.com/ai/microsofts-windows-agent-arena-teaching-ai-assistants-to-navigate-your-pc/
Wangsong Yin, Rongjie Yi, Daliang Xu, Gang Huang, Mengwei Xu, Xuanzhe Liu, 8 Sep 2024, ELMS: Elasticized Large Language Models On Mobile Devices, https://arxiv.org/abs/2409.09071
Steve Kovach, Sep 5 2024, AI gadgets have been a bust so far. Apple aims to change that, https://www.cnbc.com/2024/09/05/ai-gadgets-have-been-a-bust-so-far-apple-aims-to-change-that.html
Amos Gyamfi, Aug 28, 2024, The 6 Best LLM Tools To Run Models Locally, https://medium.com/@amosgyamfi/the-6-best-llm-tools-to-run-models-locally-eedd0f7c2bbd
Kif Leswing, Fri, Oct 4 2024, As Apple enters AI race, iPhone maker turns to its army of developers for an edge, https://www.cnbc.com/2024/10/04/apple-is-turning-to-its-army-of-developers-for-an-edge-in-the-ai-race.html
Yagil Burowski, Alyssa Coghlan, Neil Mehta, Matt Clayton, 2024-10-08, LM Studio 0.3.4 ships with Apple MLX, https://lmstudio.ai/blog/lmstudio-v0.3.4
Yue Zheng, Yuhao Chen, Bin Qian, Xiufang Shi, Yuanchao Shu, Jiming Chen, 29 Sep 2024, A Review on Edge Large Language Models: Design, Execution, and Applications, https://arxiv.org/abs/2410.11845
Joshua Xenova, October 22, 2024, Transformers.js v3: WebGPU Support, New Models & Tasks, and More…, https://huggingface.co/blog/transformersjs-v3
Haoran Lin, Xianzhi Yu, Kang Zhao, Lu Hou, Zongyuan Zhan, Stanislav Kamenev, Han Bao, Ting Hu, Mingkai Wang, Qixin Chang, Siyue Sui, Weihao Sun, Jiaxin Hu, Jun Yao, Zekun Yin, Cheng Qian, Ying Zhang, Yinfei Pan, Yu Yang, Weiguo Liu, 22 Oct 2024, FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs, https://arxiv.org/abs/2410.16663
David Gewirtz, Oct. 25, 2024, I wrote half this article on Apple Watch, thanks to this under-the-radar iOS 18 feature: Here's how to transform your writing workflow and turn your Apple Watch into a productivity powerhouse, https://www.zdnet.com/article/i-wrote-half-this-article-on-apple-watch-thanks-to-this-under-the-radar-ios-18-feature/
Meta, October 24, 2024, Introducing quantized Llama models with increased speed and a reduced memory footprint, https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/
Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
Chris Wellons, November 10, 2024, Everything I've learned so far about running local LLMs, https://nullprogram.com/blog/2024/11/10/
Justine, Apr 2023, Edge AI Just Got Faster, https://justine.lol/mmap/ (Loading models using mmap.)
Conner Takehana, Aaryan Singhal, Nov 28, 2024, ThunderMittens For Your ThunderKittens, https://hazyresearch.stanford.edu/blog/2024-11-28-tk-mlx (Porting TK to Apple Metal and MLX on the M2 chips.)
Simon Willison, Dec 2024, I can now run a GPT-4 class model on my laptop. Meta’s new Llama 3.3 70B is a genuinely GPT-4 class Large Language Model that runs on my laptop. https://simonwillison.net/2024/Dec/9/llama-33-70b/
Raymond Lo, Nov 23, 2024, How to run Whisper (Automatic Speech Recognition System) locally on CPU or GPU with OpenVINO™, https://medium.com/openvino-toolkit/how-to-run-whisper-automatic-speech-recognition-system-locally-on-cpu-or-gpu-with-openvino-a6dc0c000ada
OpenVINO™ toolkit, Oct 1, 2024, How to run Llama 3.2 locally with OpenVINO™, https://medium.com/openvino-toolkit/how-to-run-llama-3-2-locally-with-openvino-60a0f3674549
Thanaphon Suwannaphong, Ferdian Jovan, Ian Craddock, Ryan McConville, 12 Dec 2024, Optimising TinyML with Quantization and Distillation of Transformer and Mamba Models for Indoor Localisation on Edge Devices, https://arxiv.org/abs/2412.09289
Anthony Fei, Mohamed S. Abdelfattah, 15 Dec 2024, NITRO: LLM Inference on Intel Laptop NPUs, https://arxiv.org/abs/2412.11053 https://github.com/abdelfattah-lab/nitro
S. Sai, M. Prasad, G. Dashore, V. Chamola and B. Sikdar, "On-Device Generative AI: The Need, Architectures, and Challenges," in IEEE Consumer Electronics Magazine, doi: 10.1109/MCE.2024.3518761. https://ieeexplore.ieee.org/abstract/document/10804065/
Liam Seymour, Basar Kutukcu, Sabur Baidya, 19 Dec 2024, Large Language Models on Small Resource-Constrained Systems: Performance Characterization, Analysis and Trade-offs, https://arxiv.org/abs/2412.15352 https://github.com/LiamS57/orin-llm-testing
Abishek Muthian, Dec 29, 2024, How I run LLMs locally, https://abishekmuthian.com/how-i-run-llms-locally/
Andrew Chan, Dec 12, 2024, Fast LLM Inference From Scratch: Pushing single-GPU inference throughput to the edge without libraries, https://andrewkchan.dev/posts/yalm.html
Kyle Wiggers, January 23, 2025, Hugging Face claims its new AI models are the smallest of their kind, https://techcrunch.com/2025/01/23/hugging-face-claims-its-new-ai-models-are-the-smallest-of-their-kind/
OpenVINO™ toolkit, Nov 22, 2024, How to generate images locally on AI PC with OpenVINO GenAI API, https://medium.com/openvino-toolkit/how-to-generate-images-locally-on-ai-pc-with-openvino-genai-api-220d08370958
OpenVINO™ toolkit, Dec 10, 2024 How to create a multimodal chatbot locally on AI PC with OpenVINO™ GenAI API, https://medium.com/openvino-toolkit/how-to-create-a-multimodal-chatbot-locally-on-ai-pc-with-openvino-genai-api-58382be8b242
Radhika Rajkumar, Jan. 30, 2025, Mistral AI says its Small 3 model is a local, open-source alternative to GPT-4o mini. The new 24B-parameter LLM 'excels in scenarios where quick, accurate responses are critical.' In fact, the model can be run on a MacBook with 32GB RAM. https://www.zdnet.com/article/mistral-ai-says-its-small-3-model-is-a-local-open-source-alternative-to-gpt-4o-mini/
hannibal27, Feb 2025, mistral-small-24b-instruct-2501 is simply the best model ever made, https://www.reddit.com/r/LocalLLaMA/comments/1ig2cm2/mistralsmall24binstruct2501_is_simply_the_best/
Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Gang Huang, Mengwei Xu, and Xuanzhe Liu. 2025. Fast On-device LLM Inference with NPUs. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (ASPLOS '25). Association for Computing Machinery, New York, NY, USA, 445–462. https://doi.org/10.1145/3669940.3707239 https://dl.acm.org/doi/abs/10.1145/3669940.3707239 (Offloading chunked prefill computations to NPUs.)
Allen Institute, February 11, 2025, OLMoE, meet iOS: Ai2 OLMoE is now available as an on-device, state-of-the-art open-source model. https://allenai.org/blog/olmoe-app (OLMO on-device version for iPhone, based on llama.cpp.)
Jack Wallen, Feb. 13, 2025, How I feed my files to a local AI for better, more relevant responses Msty is one of the best apps for interacting with the Ollama local AI tool and it contains a feature you'll want to use to help provide contextuality to its responses. https://www.zdnet.com/article/how-i-feed-my-files-to-a-local-ai-for-better-more-relevant-responses/
Jan Kammerath, Feb 11, 2025, Programmers’ New Goldrush: Seizing Opportunities With Local AI, https://medium.com/@jankammerath/programmers-new-goldrush-seizing-opportunities-with-local-ai-12b1a3e2692f
Sabri Eyuboglu, Dan Biderman, Avanika Narayan, Feb 24, 2025, Minions: the rise of small, on-device LMs: Embracing small LMs, shifting compute on-device, and cutting cloud costs in the process, https://hazyresearch.stanford.edu/blog/2025-02-24-minions
Avanika Narayan, Dan Biderman, Sabri Eyuboglu, Avner May, Scott Linderman, James Zou, Christopher Re, 21 Feb 2025, Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models, https://arxiv.org/abs/2502.15964 (Reading long documents using on-device small models, by breaking the document into small chunks processed by local LLMs, and only using the cloud LLMs for finalization tasks.)
Pradeep Viswanathan @pradeepviswav, Mar 3, 2025 , Microsoft brings DeepSeek 7B and 14B AI models to Copilot+ PCs, https://www.neowin.net/news/microsoft-brings-deepseek-7b-and-14b-ai-models-to-copilot-pcs/
Andrew Zuo, Feb 7, 2025, On-Device LLMs Do Not Work, https://andrewzuo.com/on-device-llms-do-not-work-c4fb8b995c28 (Local LLMs are not smart and chew battery.)
Michael Nuñez, March 24, 2025, DeepSeek-V3 now runs at 20 tokens per second on Mac Studio, and that’s a nightmare for OpenAI, https://venturebeat.com/ai/deepseek-v3-now-runs-at-20-tokens-per-second-on-mac-studio-and-thats-a-nightmare-for-openai/

Hybrid Cloud-Device Architectures

A combination of LLM inference on the physical device (on-device) and sending over the network to cloud servers is possible. This is called hybrid cloud-on-device inference.

Research papers on hybrid cloud-on-device inference:

Qualcomm, May 2023, The future of AI is hybrid, Qualcomm White Paper, https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Whitepaper-The-future-of-AI-is-hybrid-Part-1-Unlocking-the-generative-AI-future-with-on-device-and-hybrid-AI.pdf
Yanming Liu, Xinyue Peng, Jiannan Cao, Le Dai, Xingzu Liu, Weihao Liu, Mingbang Wang, 11 Mar 2024, SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation, https://arxiv.org/abs/2403.07088 (A hybrid cloud and on-device inference method to retain privacy.)
Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
C Luo, X He, J Zhan, L Wang, W Gao, J Dai, 2020, Comparison and benchmarking of AI models and frameworks on mobile devices, https://arxiv.org/abs/2005.05085

Estimating On-Device Throughput from TOPS

NOTE: This analysis seems mostly bogus, but at least it's a starting point. I don't think I've seen a paper that addresses this estimation issue in benchmarking.

TOPS is Teraflops Operations Per Second (TOPS) or a similar meaning (basically trillions of computations per second). This section attempts to naively estimate throughput rates on phone inference using the reported TOPS numbers and model weight counts, but it doesn't seem very accurate.

Let us examine the TOPS ratings for some of Apple's chips. If we assume that this means 1 trillion floating-point operations per second, we get to an estimate that goes like this:

Apple A16 Bionic (in iPhone 14 and iPhone 15) has about 17 TOPS rating.
Transformer engines touch every weight in an inference computation.
Autoregressive default architectures repeat this for every token.

Hence, if we use GPT-2 with about 1.5B weights, or GPT-4's 176B weights (it's in an 8-model MoE architecture, but each inference would only use one model). Hence, the estimate computations give:

17 trillion divided by GPT-2's 1.5 billion, we get 11,333 tokens per second.
17 trillion divided by GPT-4 single-expert 176 billion, we get 96 tokens per second.

This seems way too high (or is it?), but it's not clear what's happening. According to this, running GPT-2 on an iPhone should really fly, but that isn't what's reported in the research papers. Maybe the TOPS metrics don't reflect actual floating-point operations in the A16 chip, or the real cost of model inference is much higher than the weight count for each decoded token, such as due to prefill costs and memory access costs (inference engines are "memory-bound"). We also haven't accounted for practical problems such as battery depletion, non-responsive phones (spinning due to computations), and physical temperature increase (AI is hot, literally).

Aussie AI

On-Device Inference

What is On-Device Inference?

Android Phone On-Device Inference

iPhone On-Device Inference

Research Papers on On-Device Inference (Generally)

Hybrid Cloud-Device Architectures

Estimating On-Device Throughput from TOPS

More AI Research

Quick Links

Product

New to Writing?

Writing Styles