Aussie AI
LLM Phone Research
-
Last Updated 29 November, 2024
-
by David Spuler, Ph.D.
AI is going to be on your phone; see GenAI market research, and it's early in this trend. There are also going to be AI PCs on your desk.
Can an AI model run fast enough on your phone? Much of the early research that is relevant to fast phone execution of models relates to another type of computer, which you might know as a "car". The need for computer vision models for automated or assisted driving has similar requirements to running on a phone, such as low latency and small storage. The general term is an "embedded" system or "real-time" system.
LLMs on Your Smartphone
There are already plenty of "AI" apps available to put on your phone, but these are almost certainly all sending the requests over the network to an AI engine in the cloud. Running an AI model directly on your phone is problematic for several reasons:
- Too slow to run (response times will be long)
- Phones don't have a GPU and have less non-GPU hardware acceleration
- Storage size (e.g. a "small" 3B model for 32-bit weights will need 12 Gigabytes of storage)
- Memory usage (not only do models need to be permanently stored, they are also loaded into RAM for inference)
- Transmission size (e.g. before you can "run" it, you need to install a model of size 12 Gigabytes over your phone's 4G or WiFi connection)
- Battery depletion and heat generation (i.e. all of those matrix multiplications will max out the phone's CPU and chew cycles)
For these reasons, it's still faster to send AI requests off to a bigger server with lots of GPUs that's running in the cloud, even though it's a roundtrip network message. Before you see any truly "native" AI models in your app store, research is required to overcome all of the above obstacles.
Future of AI Models on Phones
Over time some of the obstacles to natively-executing inference on phones will diminish:
- Better phone CPUs with hardware acceleration are already here (e.g. Qualcomm Snapdragon), with more on the way. Future phones will be more AI-capable.
- "AI Phones" with GPUs will surely be coming to a store near you.
- Phone storage sizes are also increasing.
- 5G network connectivity will reduce concerns about transmission sizes.
- Data compression algorithms can lower transmission sizes, and also possibly storage sizes.
- Quantized models and other inference optimizations can improve speed and reduce storage size, giving reduced CPU usage, faster response times, lower storage size, and reduced transmission size (but with accuracy loss).
- Training and fine-tuning of models doesn't need to happen on a phone (phew!).
But... you really need a "big" model, not a "small" model, if you want the app to be great with lots of happy users. And getting a big model running efficiently on a phone may take a while to come to fruition.
What's Needed?
Okay, so let's say you want to run a "big" model on a "small" phone. Why? Lots of reasons, which we won't explore here. So you want what you want, which is to run the open source LLama v2 13B model on a phone.
First question is: do you even need to? Why not just use the AI engines in the cloud, and send requests back-and-forth between the phone and the cloud. Reponse time of modern networks is fast, message sizes are small, and users may not notice or even care. There are reasons beyond speed: privacy and security come to mind.
Another piece of good news: you don't need to "build" the model on your phone. Those GPU-expensive tasks of training or fine-tuning can be done in the cloud. For native execution, the user only needs to run "inference" of the model on their phone.
Assuming you have your reasons to want to do this, let's examine each of the obstacles for native phone execution of LLM model inference.
- Speed and response time. The AI engine on the phone needs fast "inference" (running the model quickly). And it probably cannot rely on a GPU, since there are already billions of phones out there without a GPU. Hardware acceleration in phone CPUs is limited. The main ways that models run without a GPU on a phone or PC is to use inference optimizations, of which the most popular at the moment is definitely quantization. Other supplemental techniques that might be needed include integer-only arithmetic and pruning (model compression). And there's a whole host of lesser known inference optimization techniques that might need to be combined together. For example, maybe the bottleneck of "auto-regression" will need to get bypassed so the AI engine can crank out multiple words at a time, without running the whole glob of a model for every single word.
- Network transmission size. Users need to download your 13B LLama-2 model to their phone? Uncompressed, it's about 52GB. There's already a lot known about compression algorithms (e.g. for video), and model files are just multi-gigabyte data files, so perhaps it can be compressed to a size that's adequately small. But before we even use those network compression algorithms, the first thing to try is model compression, such as quantization. For example, using quantization to 8-bit would reduce the original 32-bit model size four-fold down to 13GB, for a slight loss in accuracy (probably acceptable). Binary quantization would reduce it by a factor of 32, but then the inference accuracy goes south. 5G bandwidth will help a lot, but remember there's a lot of users (billions) out there with non-5G compatible phones. Model compression techniques such as quantization and pruning can also reduce the total size. But the whole model is required. There's no such thing as half an AI model. And you can't stream an AI model so it starts running before it's all loaded (although that's actually an interesting research question as to whether it might be possible).
- Storage size. The whole model needs to be permanently stored on the device. Maybe it can be stored in some compressed form. The same comments about model compression techniques apply. It can either be stored uncompressed if the phone has a bigger storage space, or perhaps it can be stored in compressed form, and only uncompressed when it's needed. But it'll be needed all the time, because, well, it's AI you know, so everybody needs it for everything.
- Memory size. The inference algorithm needs the whole model, uncompressed, available to use in RAM. Not all at the same time, but it will definitely need to swap the entire model (uncompressed) in and out of memory to process all those model weights. For each word. That's either a lot of RAM (do you have a spare 52GB of RAM on your phone?), or a lot of processing cost from swapping data in/out. And that occurs for every word it generates. Again, model compression seems key to cut down the original 52GB size of the model (e.g. 8-bit cuts it to 13B).
- Battery depletion and heat generation. A model with 13B weights needs to do 13B multiplications for every word it outputs. That's a lot of power usage. So to get the resource utilization lower means some of the above-mentioned optimations of the inference algorithm (e.g. quantization, pruning, non-auto-regression, etc.).
The short answer is that multiple optimization techniques are probably needed to be combined, and that success is several breakthroughs away, before native phone LLMs appear in the wild.
It might not even be possible to realistically run AI models natively on today's phones. But solving any of the above-mentioned problems is certainly valuable standalone, in that it will reduce the cost of running AI models on GPUs in server farms that are growing in the cloud.
Articles and Press on AI Phones
The drumbeat of press articles and PR releases has begun for "AI Phones" (and also AI PCs):
- David Lumb, Aug. 25, 2023, Qualcomm's 'Holy Grail': Generative AI Is Coming to Phones Soon, CNet, https://www.cnet.com/tech/mobile/generative-ai-is-coming-to-phones-next-year-thanks-to-qualcomm-chips/
- Corinne Reichert, July 19, 2023, Apple Has Created Its Own AI Chatbot, Report Says, CNet, https://www.cnet.com/tech/apple-has-created-its-own-ai-chatbot-report-says/
- Mark Gurman July 20, 2023, Apple Tests ‘Apple GPT,’ Develops Generative AI Tools to Catch OpenAI, Bloomberg, https://www.bloomberg.com/news/articles/2023-07-19/apple-preps-ajax-generative-ai-apple-gpt-to-rival-openai-and-google
- James Laird, August 24, 2023, Top Android Phones Set to Pack Serious AI Power in 2024, tech.co, https://tech.co/news/android-phones-ai-power
- Steve McCaskill, April 15, 2019, Three quarters of smartphones will have AI chip by 2022, TechRadar, https://www.techradar.com/news/three-quarters-of-smartphones-will-have-ai-chip-by-2022 (Seems to have been a little early, but it isn't wrong.)
- Ryan McNeal, Hadlee Simons, July 19, 2023, Qualcomm and Meta will bring on-device AI to flagship phones in 2024, Android Authority, https://www.androidauthority.com/qualcomm-meta-on-device-ai-phones-2024-3346204/
- Qualcomm, July 19, 2023, Qualcomm Works with Meta to Enable On-device AI Applications Using Llama 2, https://www.qualcomm.com/news/releases/2023/07/qualcomm-works-with-meta-to-enable-on-device-ai-applications-usi
- Adamya Sharma, February 24, 2023, Your Android phone might soon be able to generate AI images in seconds, Android Authority, https://www.androidauthority.com/qualcomm-android-ai-image-generation-3289045/
- Ryan Whitwam, August 28, 2023, Qualcomm's Next Smartphone Chip Will Be Built for Generative AI, Extreme Tech, https://www.extremetech.com/mobile/qualcomms-next-smartphone-chip-will-be-built-for-generative-ai
- Joe Rossignol, October 19, 2023, Apple Rumored to Follow ChatGPT With Generative AI Features on iPhone as Soon as iOS 18 Mac Rumors, https://www.macrumors.com/2023/10/19/apple-generative-ai-late-2024-jeff-pu/
- Jasmine Wu, Laura Batchelor, Deirdre Bosa, Apple’s AI killer is... the iPhone, June 15, 2024, CNBC, https://www.cnbc.com/video/2024/06/14/apples-ai-killer-is-the-iphone.html
Survey Papers on AI Phones
Research survey papers about putting models onto a smartphone:
- Praveen Joshi, Mohammed Hasanuzzaman, Chandra Thapa, Haithem Afli, Ted Scully, "Enabling All In-Edge Deep Learning: A Literature Review", IEEE Access, vol.11, pp.3431-3460, 2023. https://ieeexplore.ieee.org/document/10007810, https://arxiv.org/abs/2204.03326 (Extensive survey of edge computing, including deployment architectures and optimizations.)
- X Wang, J Li, Z Ning, Q Song, L Guo, S Guo, July 2023, Wireless powered mobile edge computing networks: A survey, ACM Computing Surveys, Volume 55, Issue 13s, Article No. 263, pp 1–37, https://dl.acm.org/doi/abs/10.1145/3579992 PDF: http://101.43.59.126/static/53.Wireless_Powered_Mobile_Edge_Vomputing_Networks_A_Survey.pdf
- Kah Phooi Seng, Li-Minn Ang, "Embedded Intelligence: State-of-the-Art and Research Challenges", IEEE Access, vol.10, pp.59236-59258, 2022. https://ieeexplore.ieee.org/document/9775683, PDF: https://research.usc.edu.au/esploro/outputs/99640278002621
AI Phone Models Research
Research on smartphone AI applications:
- Benj Edwards, "You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi", Ars Technica, March 14th, 2023, https://arstechnica.com/information-technology/2023/03/you-can-now-run-a-gpt-3-level-ai-model-on-your-laptop-phone-and-raspberry-pi/
- Matthew S. Smith, "The Case for Running AI on CPUs Isn't Dead Yet", IEEE Spectrum, June 1st, 2023, https://spectrum.ieee.org/ai-cpu
- Alfonso Maruccia, "Qualcomm ran a complete Stable Diffusion AI model on an Android phone", Techspot, February 27, 2023, https://www.techspot.com/news/97744-qualcomm-ran-complete-stable-diffusion-ai-model-android.html
- Song Han, Huizi Mao, William J. Dally, "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding", arXiv:1510.00149v5 [cs.CV], 15 Feb 2016, https://arxiv.org/abs/1510.00149
- XNNPACK, https://github.com/google/XNNPACK
- Pruning for on-device inference w/ XNNPACK (TensorFlow), https://www.tensorflow.org/model_optimization/guide/pruning/pruning_for_on_device_inference
- Pier Paolo Ippolito, How to Deploy Machine Learning Models on Mobile and Embedded Devices, August 12, 2019, https://www.freecodecamp.org/news/machine-learning-for-mobile-and-embedded-devices/
- TensorFlow Demo on Android phones, https://github.com/tensorflow/examples/tree/master/lite/examples/image_classification/android
- Nimit S. Sohoni, Christopher R. Aberger et al, Low-Memory Neural Network Training: A Technical Report, https://arxiv.org/pdf/1904.10631.pdf
- M. Hollemans, Convolutional neural networks on the iphone with vggnet, 2016, http://machinethink.net/blog/convolutional-neural-networks-on-the-iphone-with-vggnet
- Model optimization (TensorFlow), https://www.tensorflow.org/lite/performance/model_optimization
- Michael Shea, "TensorFlow Lite Inception Model Android Tutorial", https://www.youtube.com/watch?v=8zQsAl2z4iU
- Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun, Shufflenet: An extremely efficient convolutional neural network for mobile devices. In CVPR, pages 6848–6856, 2018, https://ieeexplore.ieee.org/abstract/document/8578814
- Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984, 2020, https://arxiv.org/abs/2004.02984
- Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, Bin Ren, “PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY, USA: Association for Computing Machinery, Mar. 2020, pp. 907–922. doi:10.1145/3373376.3378534 https://arxiv.org/abs/2001.00138
- Li, L. Zeng, Z. Zhou, and X. Chen, “Edge AI: On-Demand Accelerating Deep Neural Network Inference via Edge Computing,” IEEE Transactions on Wireless Communications, vol. 19, no. 1, pp. 447–457, Jan. 2020. doi:10.1109/TWC.2019.2946140, https://arxiv.org/abs/1910.05316
- Manuele Rusci, Marco Fariselli, Alessandro Capotondi, and Luca Benini. Leveraging automated mixed-low-precision quantization for tiny edge microcontrollers. In IoT Streams for Data-Driven Predictive Maintenance and IoT, Edge, and Mobile for Embedded Machine Learning, pages 296–308. Springer, 2020, https://arxiv.org/abs/2008.05124
- Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, Song Han, AMC: AutoML for Model Compression and Acceleration on Mobile Devices, In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018, https://arxiv.org/abs/1802.03494
- Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828, 2016, https://arxiv.org/abs/1512.06473
- Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. NetAdapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 285–300, 2018, https://arxiv.org/abs/1804.03230
- Peng Peng, Mingyu You, Weisheng Xu, and Jiaxin Li. Fully integer-based quantization for mobile convolutional neural network inference. Neurocomputing, 432:194–205, 2021, https://www.sciencedirect.com/science/article/abs/pii/S0925231220319354
- J Mao, H Yang, A Li, H Li, Y Chen, TPrune: Efficient Transformer Pruning for Mobile Devices, ACM Trans. Cyber-Phys. Syst., Vol. 5, No. 3, Article 26, March 2021, DOI: https://doi.org/10.1145/3446640, https://dl.acm.org/doi/fullHtml/10.1145/3446640
- Dawei Li, Xiaolong Wang, and Deguang Kong. 2018. DeepRebirth: Accelerating Deep Neural Network Execution on Mobile Devices. In AAAI’18, https://arxiv.org/abs/1708.04728
- Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen, MobileNetV2: Inverted Residuals and Linear Bottlenecks, Mar 2019, https://arxiv.org/abs/1801.04381, Code: https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet
- Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, Shangguang Wang, Mengwei Xu, Aug 2023, Rethinking Mobile AI Ecosystem in the LLM Era http://export.arxiv.org/abs/2308.14363, PDF: https://arxiv.org/pdf/2308.14363.pdf
- Matevž Fabjančič, Octavian Machidon, Hashim Sharif, Yifan Zhao, Saša Misailović, Veljko Pejović, March 2023, Mobiprox: Supporting Dynamic Approximate Computing on Mobiles, https://arxiv.org/abs/2303.11291 (Uses probabilistic approximations, such as perforation, as adaptive inference optimization techniques.)
- Samuel Carreira, Tomás Marques, José Ribeiro, Carlos Grilo, Sep 2023, Revolutionizing Mobile Interaction: Enabling a 3 Billion Parameter GPT LLM on Mobile, arXiv preprint arXiv:2310.01434, https://browse.arxiv.org/abs/2310.01434 (LoRA on a mobile platform.)
- PyTorch Edge Team, October 17, 2023, PyTorch Edge: Enabling On-Device Inference Across Mobile and Edge Devices with ExecuTorch, https://pytorch.org/blog/pytorch-edge/
- S Agrawal, P Ghosh, G Kumar, T Radhika, 2023, Memory Footprint Optimization for Neural Network Inference in Mobile SoCs, 2023 IEEE Women in Technology Conference (WINTECHCON) https://ieeexplore.ieee.org/abstract/document/10277304 (Improved management of memory buffers.)
- OpenBMB, May 2024, MiniCPM-Llama3-V 2.5: A GPT-4V Level Multimodal LLM on Your Phone, https://github.com/OpenBMB/MiniCPM-V
- Li Zhang, Shihe Wang, Xianqing Jia, Zhihan Zheng, Yunhe Yan, Longxi Gao, Yuanchun Li, Mengwei Xu, 12 Apr 2024, LlamaTouch: A Faithful and Scalable Testbed for Mobile UI Automation Task Evaluation, https://arxiv.org/abs/2404.16054
- Juyong Lee, Taywon Min, Minyong An, Changyeon Kim, Kimin Lee, 25 Apr 2024, Benchmarking Mobile Device Control Agents across Diverse Configurations, https://arxiv.org/abs/2404.16660 Code: https://b-moca.github.io/
- Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
- Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, 22 Apr 2024, A Survey on Efficient Inference for Large Language Models, https://arxiv.org/abs/2404.14294
- Jiwei HUANG, Fangzheng LIU, and Jianbin ZHANG, “Multi-dimensional QoS Evaluation and Optimization of Mobile Edge Computing for IoT: A Survey,” Chinese Journal of Electronics, vol. 33, no. 5, pp. 1–16, 2024 doi: 10.23919/cje.2023.00.264 shu https://cje.ejournal.org.cn/article/doi/10.23919/cje.2023.00.264 (Theory of benchmarking and evaluation of mobile edge computing.)
- Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric Lin, Zeqi Lin, Piyush Madan, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Xia Song, Masahiro Tanaka, Xin Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, Ziyi Yang, Donghan Yu, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou, 23 Apr 2024 ( v2), Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone, https://arxiv.org/abs/2404.14219
- Benj Edwards, 24 April, 2024, Microsoft’s Phi-3 shows the surprising power of small, locally run AI language models, https://arstechnica.com/information-technology/2024/04/microsofts-phi-3-shows-the-surprising-power-of-small-locally-run-ai-language-models/
- William Gallagher, Apr 16, 2024, When to expect every Mac to get the AI-based M4 processor, Apple Insider, https://appleinsider.com/articles/24/04/14/when-to-expect-every-mac-to-get-the-ai-based-m4-processor
- Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan, 8 Apr 2024, Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs, https://arxiv.org/abs/2404.05719
- Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
- PyTorch Edge Team, October 17, 2023, PyTorch Edge: Enabling On-Device Inference Across Mobile and Edge Devices with ExecuTorch, https://pytorch.org/blog/pytorch-edge/?hss_channel=lcp-78618366
- Tom Morgan-Freelander, 8 March 2024, Best AI phones: which smartphone has the best AI features? https://www.stuff.tv/features/best-ai-phones-which-smartphone-has-the-best-ai-features/
- Stefanos Laskaridis, Kleomenis Katevas, Lorenzo Minto, Hamed Haddadi, 20 Mar 2024 (v2), MELTing point: Mobile Evaluation of Language Transformers, https://arxiv.org/abs/2403.12844 (Survey and benchmarking of SOTA methods for running LLM inference natively on phones including iPhone and Android, with quantization levels, and with measurement of speed and battery depletion.)
- Qualcomm, May 2023, The future of AI is hybrid, Qualcomm White Paper, https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Whitepaper-The-future-of-AI-is-hybrid-Part-1-Unlocking-the-generative-AI-future-with-on-device-and-hybrid-AI.pdf
- Matthias Bastian, Dec 12, 2023, Run LLMs on your M Series with Apple's new MLX machine learning framework, AI in practice, https://the-decoder.com/run-llms-on-your-m-series-with-apples-new-mlx-machine-learning-framework/
- Matthias Bastian, Jun 6, 2023, Apple doesn't talk about AI, but puts it in everything anyway, AI in practice, https://the-decoder.com/apple-doesnt-talk-about-ai-but-puts-it-in-everything-anyway/
- Pedro Cuenca, August 8, 2023, Releasing Swift Transformers: Run On-Device LLMs in Apple Devices https://huggingface.co/blog/swift-coreml-llm Code: https://github.com/huggingface/swift-transformers Code: https://github.com/huggingface/swift-chat Code: https://huggingface.co/spaces/coreml-projects/transformers-to-coreml (Overview and code called "Swift Transformers" for running LLM models natively, such as Llama2 7B or Falcon 7B, on-device for Apple devices by wrapping CoreML.)
- Tim Hardwick, December 21, 2023, Apple Develops Breakthrough Method for Running LLMs on iPhones, Mac Rumors, https://www.macrumors.com/2023/12/21/apple-ai-researchers-run-llms-iphones/
- Google, Get started with Gemini Nano on Android (on-device), March 30, 2024 (accessed), https://ai.google.dev/tutorials/android_aicore
- Google, LLM Inference guide for iOS, March 30, 2024 (accessed), https://developers.google.com/mediapipe/solutions/genai/llm_inference/ios
- Google, LLM Inference guide for Android, March 30, 2024 (accessed), https://developers.google.com/mediapipe/solutions/genai/llm_inference/android
- David Spuler, Mar 30, 2024, Generative AI in C++: Coding Transformers and LLMs, Yoryck AI, https://www.amazon.com/Generative-AI-Coding-Transformers-LLMs-ebook/dp/B0CXJKCWX9/
- Revathi Gopalakrishnan, Avinash Venkateswarlu, 31 December 2018, Machine Learning for Mobile: Practical guide to building intelligent mobile applications powered by machine learning, Packt Publishing, https://www.amazon.com/dp/B07BJKV4B4/
- Mohit Thakkar, 20 February 2019, Beginning Machine Learning in iOS: CoreML Framework, Apress, https://www.amazon.com/dp/B07NYW5VBQ/
- Daniel Situnayake, 24 January 2023, AI at the Edge: Solving Real-World Problems with Embedded Machine Learning, O'Reilly Media, Inc, USA, https://www.amazon.com/dp/1098120205/
- Pete Warden, 3 January 2020, Tiny ML: Machine Learning with Tensorflow Lite on Arduino and Ultra-Low-Power Microcontrollers, O'Reilly Media, Inc, USA, https://www.amazon.com/Tinyml-Learning-Tensorflow-Ultra-Low-Power-Microcontrollers/dp/1492052043/
- Gian Marco Iodice, 1 April 2022 TinyML Cookbook: Combine artificial intelligence and ultra-low-power embedded devices to make the world smarter, Packt Publishing, https://www.amazon.com/TinyML-Cookbook-artificial-intelligence-ultra-low-power/dp/180181497X/
- Martin Mitrevski, 9 January 2018, Developing Conversational Interfaces for iOS: Add Responsive Voice Control to Your Apps 1st ed. Edition, Kindle Edition Apress, https://www.amazon.com/dp/B078X74C37/
- Özgür Sahin, 3 December 2020, Develop Intelligent iOS Apps with Swift: Understand Texts, Classify Sentiments, and Autodetect Answers in Text Using NLP, Apress, https://www.amazon.com/dp/B08PP63KDQ/
- Sajid Ali, 15 May 2023, The integration of Artificial Intelligence (AI) into the iPhone, https://www.amazon.com/dp/B0C5GHTXGQ/
- Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
- Google, 2024, Get started with the Gemini API in Android apps (client SDK). https://ai.google.dev/tutorials/get_started_android (Cloud-based use of the Gemini API for round-trip AI in Android apps.)
- Dave Burke, 06 December 2023, Google Blog, https://android-developers.googleblog.com/2023/12/a-new-foundation-for-ai-on-android.html (Gemini Nano for on-device inference on Android phones with Android AICore platform.)
- Android Developers, 2024, Android AICore, https://developer.android.com/ml/aicore (AI platform on Android using Gemini Nano.)
- Google for Developers Blog, 2024. Large Language Models On-Device with MediaPipe and TensorFlow Lite, March 07, 2024 https://developers.googleblog.com/2024/03/running-large-language-models-on-device-with-mediapipe-andtensorflow-lite.html
- LMDeploy Contributors, 2023, LMDeploy: A Toolkit for Compressing, Deploying, and Serving LLM, Apache 2.0 License, Code: https://github.com/InternLM/lmdeploy
- Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, Shangguang Wang, Mengwei Xu, 12 Mar 2024, Mobile Foundation Model as Firmware (v4), https://arxiv.org/abs/2308.14363 (Runs a 10B LLM named "M4" based on Llama and Llama-2 on a Google Pixel 7 Pro, including use of 4-bit, 8-bit and 16-bit quantized versions of the M4 model.)
- Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, Yunxin Liu, 9 Mar 2024, AutoDroid: LLM-powered Task Automation in Android (v4), https://arxiv.org/abs/2308.15272 Code: https://autodroid-sys.github.io/ (Integrates both on-device Vicuna and cloud-based GPT-4/GPT-3.5 into an Android phone app called AutoDroid.)
- Wangsong Yin, Mengwei Xu, Yuanchun Li, Xuanzhe Liu, 18 Mar 2024, LLM as a System Service on Mobile Devices, https://arxiv.org/abs/2403.11805 (On-device inference for LLMs, including a stateful on-device AI service LLMaaS, including Llama2 7B and OPT-7B with INT8 quantization, based on improved KV caching on mobile, with pipelining, recomputation and chunk-level KV cache memory management for running on phones.)
- James Bentley, January 25, 2024, Apple's new 'boost' to generative AI flags a very different approach to its competitors — on-device AI support could set the iPhone 16 apart, iMore, https://www.imore.com/iphone/apples-new-boost-to-generative-ai-flags-a-very-different-approach-to-its-competitors-on-device-ai-support-could-set-the-iphone-16-apart
- Chris Velazco, February 21, 2024, Phones are getting packed with AI features. But how helpful are they? https://www.washingtonpost.com/technology/2024/02/21/ai-phones-google-samsung-iphone/
- Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar, 4 Jan 2024, LLM in a flash: Efficient Large Language Model Inference with Limited Memory, https://arxiv.org/abs/2312.11514
- P Dong, L Lu, C Wu, C Lyu, G Yuan, H Tang, Y Wang, 2023, PackQViT: Faster Sub-8-bit Vision Transformers via Full and Packed Quantization on the Mobile, https://openreview.net/pdf?id=N56hAiQvot Code: https://github.com/PeiyanFlying/PackQViT
- Bingkun Lai, Jinbo Wen, Jiawen Kang, Hongyang Du, Jiangtian Nie, Changyan Yi, Dong In Kim, Shengli Xie, 19 Dec 2023, Resource-efficient Generative Mobile Edge Networks in 6G Era: Fundamentals, Framework and Case Study, https://arxiv.org/abs/2312.12063
- Fucheng Jia, Shiqi Jiang, Ting Cao, Wei Cui, Tianrui Xia, Xu Cao, Yuanchun Li, Deyu Zhang, Ju Ren, Yunxin Liu, Lili Qiu, Mao Yang, Sep 2023, Accelerating In-Browser Deep Learning Inference on Diverse Edge Clients through Just-in-Time Kernel Optimizations, https://arxiv.org/pdf/2309.08978.pdf
- T. Zhu, L. Kuang, K. Li, J. Zeng, P. Herrero, and P. Georgiou, “Blood glucose prediction in type 1 diabetes using deep learning on the edge,” Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2021, pp. 1–5. https://discovery.ucl.ac.uk/id/eprint/10140471/ PDF: https://discovery.ucl.ac.uk/id/eprint/10140471/1/Blood%20glucose%20prediction%20in%20type%201%20diabetes%20using%20deep%20learning%20on%20the%20edge.pdf
- Shikhar Tuli, Niraj K. Jha, EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms, arXiv preprint arXiv:2303.13745, 2023, https://arxiv.org/abs/2303.13745
- Castrillo, J., Valle, R., Baumela, L. (2024). Efficiency Evaluation of Mobile Vision Transformers. In: Rocha, Á., Ferrás, C., Hochstetter Diez, J., Diéguez Rebolledo, M. (eds) Information Technology and Systems. ICITS 2024. Lecture Notes in Networks and Systems, vol 933. Springer, Cham. https://doi.org/10.1007/978-3-031-54256-5_1 https://link.springer.com/chapter/10.1007/978-3-031-54256-5_1 Code: https://github.com/pcr-upm/icits24_landmarks (Vision transformers on mobile architectures.)
- Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan, 26 Feb 2024, MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT, https://arxiv.org/abs/2402.16840 Code: https://github.com/mbzuai-oryx/MobiLlama
- Jinliang Yuan, Chen Yang, Dongqi Cai, Shihe Wang, Xin Yuan, Zeling Zhang, Xiang Li, Dingge Zhang, Hanzi Mei, Xianqing Jia, Shangguang Wang, Mengwei Xu, 2024, Mobile Foundation Model as Firmware, ACM MobiCom’24, September 30–October 4, 2024, Washington D.C., DC, USA https://xumengwei.github.io/files/MobiCom24-MobileFM.pdf (The use of an LLM foundation model as an underlying OS service on mobile devices.)
- Venkatraman Iyer, Sungho Lee, Semun Lee, Juitem Joonwoo Kim, Hyunjun Kim, Youngjae Shin, 12 December 2023, Automated Backend Allocation for Multi-Model, On-Device AI Inference, Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 7, Issue 3, Article No.: 62, pp 1–33, https://doi.org/10.1145/3626793 https://dl.acm.org/doi/abs/10.1145/3626793
- Jeff Butts, Feb 16th, 2023, What Is the Apple Neural Engine and What Does It Do? https://www.macobserver.com/tips/deep-dive/what-is-apple-neural-engine/
- Maciek Jędrzejczyk, December 11, 2023, Using LLMs locally on iPad or iPhone, https://www.linkedin.com/pulse/using-llms-locally-ipad-iphone-maciek-j%C4%99drzejczyk-cd0zf/ (Running LLMs such as Mistral 7B with 4-bit quantization on Apple iPad or iPhone using Apple Testflight and LLMFarm.)
- Apple, June 2022, Deploying Transformers on the Apple Neural Engine, Apple Machine Learning Research, https://machinelearning.apple.com/research/neural-engine-transformers Code: https://github.com/apple/ml-ane-transformers (Apple's open-sourced implementation of a Transformer on ANE for Apple devices using PyTorch.)
- Mark Wilson, January 08, 2024, Apple's AI upgrades for your iPhone are reportedly on track for 2024 – here's what to expect, https://www.techradar.com/phones/iphone/apples-ai-upgrades-for-your-iphone-are-reportedly-on-track-for-2024-heres-what-to-expect
- Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019. Behavior Sequence Transformer for E-commerce Recommendation in Alibaba. arXiv:1905.06874 [cs.IR] https://arxiv.org/abs/1905.06874
- Junchen Zhao, Yurun Song, Simeng Liu, Ian G. Harris, Sangeetha Abdu Jyothi, Dec 2023, LinguaLinked: A Distributed Large Language Model Inference System for Mobile Devices, https://arxiv.org/abs/2312.00388 (System for running LLMs across multiple distributed mobile devices.)
- MLC team. 2023. MLC-LLM. https://github.com/mlc-ai/mlc-llm
- Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, Shoufa Chen, 12 Jun 2024, MobileAgentBench: An Efficient and User-Friendly Benchmark for Mobile LLM Agents, https://arxiv.org/abs/2406.08184 Project: https://mobileagentbench.github.io/ Code: https://mobileagentbench.github.io/
- Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen, 12 Jun 2024 (v2), PowerInfer-2: Fast Large Language Model Inference on a Smartphone, https://arxiv.org/abs/2406.06282 Project: https://powerinfer.ai/v2/ Code: https://github.com/SJTU-IPADS/PowerInfer (Runs 47B models on phones using neuron cluster approach to matrix multiplication on NPUs and dynamic activation sparsity, with different approaches for prefill versus decoding phases.)
- Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., and Adam, H. (2019). Searching for mobilenetv3. CoRR, abs/1905.02244. URL: http://arxiv.org/abs/1905.02244
- Z Zhang, J Li, 2023, A Review of Artificial Intelligence in Embedded Systems, PDF: https://www.mdpi.com/2072-666X/14/5/897/pdf
- Z Li, M Paolieri, L Golubchik, 2023, Predicting Inference Latency of Neural Architectures on Mobile Devices, PDF: https://dl.acm.org/doi/pdf/10.1145/3578244.3583735
- Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
- David Spuler, March 2024, Chapter 3. AI Phones, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- P Busia, 2023 Optimizing Neural Networks for Embedded Edge-Processing Platforms. https://iris.unica.it/bitstream/11584/357302/2/tesi_di_dottorato_PaolaBusia.pdf
- Kah Phooi Seng, Li-Minn Ang, 2022, "Embedded Intelligence: State-of-the-Art and Research Challenges", IEEE Access, vol.10, pp.59236-59258, 2022. https://ieeexplore.ieee.org/document/9775683 PDF: https://research.usc.edu.au/esploro/outputs/99640278002621
- Shashank Nag, Gourav Datta, Souvik Kundu, Nitin Chandrachoodan, Peter A. Beerel Feb 2023, ViTA: A Vision Transformer Inference Accelerator for Edge Applications, https://arxiv.org/abs/2302.09108
- Sandeep Budki, March 20, 2024, Samsung Galaxy S24 Ultra Review: Committed and Spicing up Relationship with Customers, https://www.themobileindian.com/reviews/samsung-galaxy-s24-ultra-review-committed-and-spicing-up-relationship-with-customers
- Jason Perlow, June 13, 2024, The expensive reason why Apple's upcoming AI features aren't coming to your older iPhone, https://www.zdnet.com/article/the-expensive-reason-why-apples-upcoming-ai-features-arent-coming-to-your-older-iphone/
- Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap operations. arXiv preprint arXiv:1911.11907, 2019, https://arxiv.org/abs/1911.11907
- CNBC, June 16, 2024, Apple’s AI killer is... the iPhone, https://www.cnbc.com/video/2024/06/14/apples-ai-killer-is-the-iphone.html
- Katie Collins, March 6, 2024, On-Device AI Is a Whole New Way of Experiencing Artificial Intelligence, https://www.cnet.com/tech/mobile/on-device-ai-is-a-whole-new-way-of-experiencing-artificial-intelligence/
- Xiang Li, Zhenyan Lu, Dongqi Cai, Xiao Ma, Mengwei Xu, 11 June 2024, Large Language Models on Mobile Devices: Measurements, Analysis, and Insights, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 1 - 6, PDF: https://doi.org/10.1145/3662006.3662059 https://dl.acm.org/doi/abs/10.1145/3662006.3662059 https://dl.acm.org/doi/pdf/10.1145/3662006.3662059
- Daliang Xu, Hao Zhang, Liming Yang, Ruiqi Liu, Mengwei Xu, and Xuanzhe Liu, 11 June 2024, WiP: Efficient LLM Prefilling with Mobile NPU, EdgeFM '24: Proceedings of the Workshop on Edge and Mobile Foundation Models, June 2024, Pages 33 - 35, https://doi.org/10.1145/3662006.3662066 https://dl.acm.org/doi/abs/10.1145/3662006.3662066 PDF: https://dl.acm.org/doi/pdf/10.1145/3662006.3662066 (Faster NPU prefill via chunked prefilling using sequences of tokens, along with INT8 NPU quantization that is aware of outliers and offloads FP32 calculations from NPU back to CPU.)
- Kamila Wojciechowska July 2nd, 2024, Exclusive: This is Google AI, and it's coming to the Pixel 9, https://www.androidauthority.com/google-ai-recall-pixel-9-3456399/
- Dan Peng, Zhihui Fu, Jun Wang, 1 Jul 2024, PocketLLM: Enabling On-Device Fine-Tuning for Personalized LLMs, https://arxiv.org/abs/2407.01031 (Running fine-tuning on a smartphone via a low-memory optimization using a "derivative-free" "zeroth-order" technique called MeZo, with advantages such as privacy.)
- Ying He, Jingcheng Fang, F. Richard Yu, Victor C. Leung, 2024, Large Language Models (LLMs) Inference Offloading and Resource Allocation in Cloud-Edge Computing: An Active Inference Approach, PrePrints pp. 1-12, DOI: 10.1109/TMC.2024.3415661, https://www.computer.org/csdl/journal/tm/5555/01/10591707/1YraFlDdKYo
- Daihang Chen, Yonghui Liu, Mingyi Zhou, Yanjie Zhao, Haoyu Wang, Shuai Wang, Xiao Chen, Tegawendé F. Bissyandé, Jacques Klein, Li Li, 9 Jul 2024, LLM for Mobile: An Initial Roadmap, https://arxiv.org/abs/2407.06573
- Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 27 Jun 2024 (v2), MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, Meta Research, https://arxiv.org/abs/2402.14905 Code: https://github.com/facebookresearch/MobileLLM
- Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie, 5 Jul 2024 (v3), Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs, https://arxiv.org/abs/2403.20041
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Arjun Kharpal, July 25, 2024, Samsung hints at new products as it bets on AI to drive upgrades to its latest foldable phones, https://www.cnbc.com/2024/07/26/samsung-tm-roh-interview-galaxy-ai-mixed-reality-and-foldables.html
- Allison Johnson, Aug 1, 2024,, A first look at Apple Intelligence and its (slightly) smarter Siri, The Verge, https://www.theverge.com/2024/7/31/24209910/apple-intelligence-ios-18-preview-siri
- Y. Zhang, J. Zhang, S. Yue, W. Lu, J. Ren, X. Shen, August 2024, "Mobile Generative AI: Opportunities and Challenges," in IEEE Wireless Communications, vol. 31, no. 4, pp. 58-64, doi: 10.1109/MWC.006.2300576, https://ieeexplore.ieee.org/abstract/document/10628027/
- Kif Leswing, Aug 14 2024 Google’s live demo of Gemini ramps up pressure on Apple as AI reaches smartphone users, https://www.cnbc.com/2024/08/14/google-live-gemini-demo-lifts-pressure-on-apple-as-ai-hits-smartphones.html
- Arjun Kharpal, Thu, Jul 11 2024, Samsung to launch upgraded voice assistant Bixby this year with its own AI, https://www.cnbc.com/2024/07/11/samsung-to-launch-upgraded-bixby-this-year-with-its-own-ai.html
- Jennifer Elias, Aug 13 2024, Google launches first AI-powered Android update and new Pixel 9 phones, https://www.cnbc.com/2024/08/13/google-pixel-9-phones-first-ai-powered-android-update-announced.html
- Arvind Narayanan and Sayash Kapoor, Aug 19, 2024, AI companies are pivoting from creating gods to building products. Good. Turning models into products runs into five challenges, https://www.aisnakeoil.com/p/ai-companies-are-pivoting-from-creating
- David Gewirtz, June 18, 2024, 6 reasons why iOS 18 makes the iPhone 16 a must-upgrade for me, https://www.zdnet.com/article/6-reasons-why-ios-18-makes-the-iphone-16-a-must-upgrade-for-me/
- Jason Perlow, Aug. 27, 2024, Why you shouldn't buy the iPhone 16 for Apple Intelligence, https://www.zdnet.com/article/why-you-shouldnt-buy-the-iphone-16-for-apple-intelligence/
- Joe McKendrick, Aug. 29, 2024, What the mobile wave can teach us about the AI tsunami, https://www.zdnet.com/article/what-the-mobile-wave-can-teach-us-about-the-ai-tsunami/
- Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez, 25 Aug 2024, MobileQuant: Mobile-friendly Quantization for On-device Language Models, https://arxiv.org/abs/2408.13933 https://github.com/saic-fi/MobileQuant
- Alvaro Cintas, Aug 27, 2024, How to run Phi-3.5 in your phone, https://university.therundown.ai/c/daily-tutorials/how-to-run-phi-3-5-in-your-phone-4d5d917a-09b0-40c0-a0b4-fb63d9a65d9c
- Apple, Sep 2024, Apple Intelligence comes to iPhone, iPad, and Mac starting next month, https://www.apple.com/newsroom/2024/09/apple-intelligence-comes-to-iphone-ipad-and-mac-starting-next-month/
- Steve Kovach, Sep 5 2024, AI gadgets have been a bust so far. Apple aims to change that, https://www.cnbc.com/2024/09/05/ai-gadgets-have-been-a-bust-so-far-apple-aims-to-change-that.html
- Qinzhuo Wu, Weikai Xu, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Shuo Shang, 23 Sep 2024, MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding, https://arxiv.org/abs/2409.14818
- Clare Duffy, September 30, 2024, The iPhone 16 isn’t selling as well as Apple may have hoped, https://edition.cnn.com/2024/09/30/tech/iphone-16-presales-apple-intelligence/index.html
- Kif Leswing, Fri, Oct 4 2024, As Apple enters AI race, iPhone maker turns to its army of developers for an edge, https://www.cnbc.com/2024/10/04/apple-is-turning-to-its-army-of-developers-for-an-edge-in-the-ai-race.html
- Michael Nuñez, October 16, 2024, Mistral AI’s new language models bring AI power to your phone and laptop, https://venturebeat.com/business/mistral-ai-new-language-models-bring-ai-power-to-your-phone-and-laptop/
- Tuowei Wang, Ruwen Fan, Huang, Zixu Hao, Kun Li, Ting Cao, Youyou Lu, Yaoxue Zhang, Ju Ren, 29 Oct 2024 (v2), Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management, https://arxiv.org/abs/2410.19274
- Carl Franzen, October 31, 2024, Meta makes its MobileLLM open for researchers, posting full weights, https://venturebeat.com/ai/meta-makes-its-mobilellm-open-for-researchers-posting-full-weights/
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Sarah Perez, November 25, 2024, Apple again snubs AI in its ‘iPhone App of the Year’ finalists, https://techcrunch.com/2024/11/25/apple-again-snubs-ai-in-its-iphone-app-of-the-year-finalists/
On-Device inference
For more about on-device inference on PCs and phones, see on-device inference research.
More AI Research
Read more about:
- AI on PCs (Desktops and Laptops)
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home