Aussie AI
AI PC Research
-
Last Updated 11 December, 2024
-
by David Spuler, Ph.D.
AI models and applications are set to make PC's hot again (see also GenAI market research). The next generation of PCs will likely run some AI models natively, and there will also be hybrid architectures with AI workloads sent into the cloud. It is early days in this trend, but it's surely going to be a major technology driver for years.
Our main research interest in relation to "AI PCs" is optimization of inference algorithms, so that the models can run fast enough. This includes execution of AI inference on CPU-only PCs and low-end GPU-based PCs that are available.
Fast LLMs on Your PC or Laptop
A desktop PC or laptop is more capable than a phone, so some of issues about phones running AI inference are less problematic on a PC. Most obviously, a PC can have a decent GPU, which can be used by AI engines. Concerns about CPU usage, over-heating, and battery depletion are less problematic on a PC.
The first generation is likely to be "AI Developer PCs". Software developers typically have high-end PCs, and various AI models can be run on desktop PCs. However, execution speed is still rather sluggish for large models, even on multi-thousand dollar PCs with powerful GPUs, so there is much research still to be done on optimization of inference. Large models are where the action is at in terms of AI functionality, so it may be that software developers are still using cloud server AI for some time to come. And certainly, training and fine-tuning workloads seem less likely to move down onto desktop PCs.
But "AI PCs" are already in the works for everyday users. For end user applications, the model still has to run fast to give the user a decent response time, so there are still some significant obstacles before AI models will appear widespread on non-developer PCs. However, hybrid architectures where some AI execution is still uploaded to the cloud will likely hide a lot of the limitations of native AI execution.
Fast AI PC Techniques
What optimization techniques will be needed to run an AI model natively on a GPU-less or low-end GPU system? This remains to be seen, since the state-of-the-art is not there yet.
One likely answer: multiple techniques. It's probably going to be a combination of multiple orthogonal inference optimization techniques. Models will need to be both smaller and faster.
To make the models smaller, some of the techniques for "model compression" include:
To make the inference algorithms run faster, there are various alternative strategies vying for attention in the research:
- Faster Transformer architectures
- Multi-axis dynamic pruning (e.g. combining depth pruning, width pruning, length pruning, etc.)
- Dynamic inference optimizations (e.g. loop optimizations, early-exit)
- Integer-only arithmetic models (e.g. integer-only quantization, approximation methods)
- Zero-multiplication algorithms (e.g. adder models, shift models, log models)
- Faster attention algorithms (e.g. Flash attention, non-autoregression, and/or head pruning)
And orthogonal to these higher-level AI software methods, there will need to be underlying capabilities including:
- Hardware acceleration support (i.e. hardware-aware software optimizations)
- Deep learning compiler optimizations
And floating above all that are some top-level performance considerations:
- Hybrid multi-AI synchronization methods (i.e., ensemble methods, big-little, swarm/multi-mini-model, etc.)
- AI-aware heuristic methods
- Use-case-specific optimizations (e.g. document summarization versus search versus chatbot question-and-answer)
Putting all of that together looks like some kind of fun. Nobody's there yet. It's far from clear which is the best combination of techniques.
Articles and Announcements for AI PCs
Various PR and press articles have started pushing "AI PCs" as a new segment.
- Michael Kan, July 2023, Intel CEO: Get Ready for the 'AI PC', PCMag UK, https://uk.pcmag.com/laptops/147984/intel-ceo-get-ready-for-the-ai-pc
- Intel, May 23, 2023, AI Coming to the PC at Scale, https://www.intel.com/content/www/us/en/newsroom/news/ai-coming-to-pc-at-scale.html
- Simon Sharwood, May 2023, Intel says AI is overwhelming CPUs, GPUs, even clouds – so all Meteor Lakes get a VPU, The Register, https://www.theregister.com/2023/05/29/vpus_all_meteork_lake_skus/
- David Meyer, August 31, 2023, A.I. and big market shifts are making PCs interesting again, Fortune, https://fortune.com/2023/08/30/ai-pc-idc-demand-growth-windows-10/
- Julie Coleman, May 30, 2023, HP Inc. CEO says A.I. will enable a new kind of PC, which could release in 2024, Mad Money with Jim Cramer, CNBC, https://www.cnbc.com/2023/05/30/hp-inc-ceo-ai-will-enable-a-new-kind-of-pc-could-launch-in-2024.html
- Mark Hachman, Jan 9th, 2023, Intel and AMD are building AI into PCs. It doesn’t matter yet—but it will, PC World, https://www.pcworld.com/article/1447856/ai-pcs-should-be-the-trend-that-begins-in-2023.html
- Mark Hachman, Sep 8th, 2022, Intel’s futuristic Meteor Lake CPUs will focus on ‘core AI capabilities’, PC World, https://www.pcworld.com/article/1076150/intel-confirms-ai-improvements-will-come-in-meteor-lake.html
- Jesse Clayton, May 23, 2023, NVIDIA and Microsoft Drive Innovation for Windows PCs in New Era of Generative AI, NVIDIA Blog, https://blogs.nvidia.com/blog/2023/05/23/microsoft-build-nvidia-ai-windows-rtx/
- Simon Sharwood, Sep 2023, Desktop AI isn’t happening, says AMD, and might not for quite a while, The Register, https://www.theregister.com/2023/09/19/amd_desktop_ai_futures/
- Darren Allan, Sep 27, 2023, If you wanted an Intel Meteor Lake CPU for your next desktop PC, we’ve got some bad news, TechRadar, https://www.msn.com/en-us/news/technology/if-you-wanted-an-intel-meteor-lake-cpu-for-your-next-desktop-pc-we-ve-got-some-bad-news/ar-AA1hkJMQ
- IDC, 28 Aug 2023, Global PC Shipments Expected to Return to Growth in 2024 Albeit Below 2019 Pre-Pandemic Levels, According to IDC, https://www.idc.com/getdoc.jsp?containerId=prUS51184723
- Gartner, July 11, 2023, Gartner Says Worldwide PC Shipments Declined 16.6% in Second Quarter of 2023, https://www.gartner.com/en/newsroom/press-releases/2023-07-11-gartner-says-worldwide--pc-shipments-declined-16-percent-in-second-quarter-of-2023
- Christian Guyton, John Loeffler, October 20, 2022, Intel Core i9-13900K review: the most powerful consumer processor ever, TechRadar, https://www.techradar.com/reviews/intel-core-i9-13900k (Intel Raptor Lake CPUs.)
- Anton Shilov, April 11, 2021, New Algorithm Makes CPUs 15 Times Faster Than GPUs in Some AI Work, Tom's Hardware, https://www.tomshardware.com/news/cpu-vs-gpu-ai-performance-uplift-with-optimizations
- Shabnam Daghaghi, Nicholas Meisburger, Mengnan Zhao, Yong Wu, Sameh Gobriel, Charlie Tai, Anshumali Shrivastava, Mar 2021, Accelerating SLIDE Deep Learning on Modern CPUs: Vectorization, Quantizations, Memory Optimizations, and More, https://arxiv.org/abs/2103.10891, Code: https://github.com/RUSH-LAB/SLIDE (Fast training on CPUs using AVX-512 and locality-sensitive hashing of vectors.)
- PyTorch Edge Team, October 17, 2023, PyTorch Edge: Enabling On-Device Inference Across Mobile and Edge Devices with ExecuTorch, https://pytorch.org/blog/pytorch-edge/
- Andy Patrizio, 12 Apr 2024, The desktop processor market is suddenly hot again, https://www.computerworld.com/article/2086948/desktop-processor-market-suddenly-hot-again.html
- David Linthicum, Jan 16, 2024, Do you need GPUs for generative AI systems? InfoWorld, https://www.infoworld.com/article/3712134/do-you-need-gpus-for-generative-ai-systems.html
Research on PC Execution of LLMs
Desktop PCs are considered to be "edge" platforms in the AI literature (along with phones and IoT devices). Research papers specifically on PC execution of AI Models:
- Huma Abidi, Chandan Damannagari, "AI inference acceleration on CPUs", Intel/VentureBeat, December 9, 2021, https://venturebeat.com/ai/ai-inference-acceleration-on-cpus/.
- Simon Willison, "Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp", TILs, March 2023, https://github.com/simonw/til/blob/main/llms/llama-7b-m2.md
- Umang Sharan, Running Llama on M2 Macbook, March 15, 2023, https://www.umangsh.com/blog/running-llama-on-m2-macbook/
- Katyanna Quach, "Small custom AI models are cheap to train and can keep data private, says startup", The Register, 22 June 2023, https://www.theregister.com/2023/06/22/small_custom_ai_models/
- Julien Simon, "Smaller is better: Q8-Chat, an efficient generative AI experience on Xeon", May 16th 2023, https://huggingface.co/blog/generative-ai-models-on-intel-cpu
- Chellammal Surianarayanan, John Jeyasekaran Lawrence, Pethuru Raj Chelliah, Edmond Prakash, Chaminda Hewage, "A Survey on Optimization Techniques for Edge Artificial Intelligence (AI)", Sensors, Volume 3, Issue 3, 23, 1279, January 2023, https://www.mdpi.com/1424-8220/23/3/1279
- Jarred Walton, "How to Run a ChatGPT Alternative on Your Local PC", March 19th, 2023, Tom's Hardware, https://www.tomshardware.com/news/running-your-own-chatbot-on-a-single-gpu
- V. Vanhoucke, A. Senior, and M. Z. Mao, Improving the speed of neural networks on CPUs, In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, volume 1, page 4, 2011, https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.308.2766
- Dave Dice, Alex Kogan, Optimizing Inference Performance of Transformers on CPUs, Feb 2021, https://arxiv.org/abs/2102.06621
- Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim M. Hazelwood, Eldad Isaac, Yangqing Jia, Bill Jia, Tommer Leyvand, Hao Lu, Yang Lu, Lin Qiao, Brandon Reagen, Joe Spisak, Fei Sun, Andrew Tulloch, Peter Vajda, Xiaodong Wang, Yanghan Wang, Bram Wasti, Yiming Wu, Ran Xian, Sungjoo Yoo, and Peizhao Zhang. Machine Learning at Facebook: Understanding Inference at the Edge. In IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 331–344, 2019, https://research.facebook.com/publications/machine-learning-at-facebook-understanding-inference-at-the-edge/
- Morgan Funtowicz, Scaling up BERT-like model Inference on modern CPU - Part 1, April 2021, https://huggingface.co/blog/bert-cpu-scaling-part-1
- Shufan Wu, Tao Lv, Pengxin Yuan, Patric Zhao, Jason Ye, and Haibin Lin, Optimization for BERT Inference Performance on CPU, Sep 2019, https://medium.com/apache-mxnet/optimization-for-bert-inference-performance-on-cpu-3bb2413d376c
- Emma Ning, Nathan Yan, Jeffrey Zhu, and Jason Li. Microsoft open sources breakthrough optimizations for transformer inference on GPU and CPU, Jan 2020, https://cloudblogs.microsoft.com/opensource/2020/01/21/microsoft-onnx-open-source-optimizations-transformer-inference-gpu-cpu/
- Jiarui Fang, Yang Yu, Chengduo Zhao, and Jie Zhou, Turbotransformers: An efficient GPU serving system for transformer models, CoRR, abs/2010.05680, 2020, https://arxiv.org/abs/2010.05680
- Y. Wang, Q. Wang, and X. Chu, Energy-efficient Inference Service of Transformer-based Deep Learning Models on GPUs, In IEEE Conferences on Green Computing and Communications (GreenCom), pages 323–331, 2020, https://ieeexplore.ieee.org/document/9291633
- Yizhi Liu, Yao Wang, Ruofei Yu, Mu Li, Vin Sharma, and Yida Wang, Optimizing CNN Model Inference on CPUs, In Proc. of USENIX Annual Technical Conference (ATC), pages 1025–1040, 2019, https://arxiv.org/abs/1809.02697
- Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, Wei Lu, Sep 2023, TinyLlama, Code: https://github.com/jzhang38/TinyLlama (Apache-licensed 1.1B "tiny" Llama model trained on 3T tokens.)
- Lightning AI, 2023, Lit-GPT, https://github.com/Lightning-AI/lit-gpt (Apache licensed model for low-capacity requirements.)
- Md. Maruf Hossain Shuvo, Syed Kamrul Islam, Jianlin Cheng, Bashir I. Morshed, "Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review", Proceedings of the IEEE, vol.111, no.1, pp.42-91, 2023. https://ieeexplore.ieee.org/document/9985008, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9985008
- E Kristiani, CT Yang, KLP Nguyen, 2020, Optimization of deep learning inference on edge devices, 2020 International Conference on Pervasive Artificial Intelligence, https://ieeexplore.ieee.org/abstract/document/9302695
- Jonas Geiping, Tom Goldstein, Dec 2022, Cramming: Training a Language Model on a Single GPU in One Day, https://arxiv.org/abs/2212.14034 Code: https://github.com/JonasGeiping/cramming (Note: uses Pytorch nvFuser deep learning compiler, which seems to be deprecated now.)
- Benj Edwards, March 14, 2023, You can now run a GPT-3-level AI model on your laptop, phone, and Raspberry Pi, Ars Technica, https://arstechnica.com/information-technology/2023/03/you-can-now-run-a-gpt-3-level-ai-model-on-your-laptop-phone-and-raspberry-pi/
- Benj Edwards, Sep 28, 2023, Jony Ive and OpenAI’s Altman reportedly collaborating on mysterious AI device, Ars Technica, https://arstechnica.com/information-technology/2023/09/jony-ive-and-openais-altman-reportedly-collaborating-on-mysterious-ai-device/
- Oleksandr Kuvshynov, Oct 2023, Slow LLama, Code: https://github.com/okuvshynov/slowllama ("Fine-tune Llama2 and CodeLLama models, including 70B/35B on Apple M1/M2 devices")
- Benjamin Marie, Sep 29, 2023, Run Llama 2 70B on Your GPU with ExLlamaV2, Towards Data Science, https://towardsdatascience.com/run-llama-2-70b-on-your-gpu-with-exllamav2-588141a88598
- Computer World, 29 May 2024, In two years, 100% of enterprise PC purchases will be AI computers, https://www.computerworld.com/article/2130275/in-two-years-100-of-enterprise-pc-purchases-will-be-ai-computers.html
- Dell Technologies, May 20, 2024, Dell Technologies Expands Dell AI Factory with NVIDIA to Turbocharge AI Adoption, PR Newswire, https://www.prnewswire.com/news-releases/dell-technologies-expands-dell-ai-factory-with-nvidia-to-turbocharge-ai-adoption-302150245.html
- Djip007, May 2024, llamafile 0.8.6 CPU benchmark #450, https://github.com/Mozilla-Ocho/llamafile/discussions/450 (Running llamafile at 20 tokens per second on a non-GPU commodity CPU.)
- Ken Yeung, May 21, 2024, Microsoft introduces Phi-Silica, a 3.3B parameter model made for Copilot+ PC NPUs, https://venturebeat.com/ai/microsoft-introduces-phi-silica-a-3-3b-parameter-model-made-for-copilot-pc-npus/
- J Cañete, F Bravo-Marquez, 2024, Speedy Gonzales: A Collection of Fast Task-Specific Models for Spanish, https://felipebravom.com/publications/starsem2024.pdf (Optimizing small models on CPU and GPU for the Spanish language, mostly using distillation.)
- Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
- Martin Thissen, April 20, 2024, Llama 3 on Your Local Computer | Free GPT-4 Alternative, https://medium.com/@martin-thissen/llama-3-on-your-local-computer-free-gpt-4-alternative-1f533e9abff7 (Llama3-70B with 4-bit quantization using vLLM for inference on NVIDIA RTX 6000 Ada GPU.)
- Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
- Intel, April 2024, Intel® Compiler First to Achieve SYCL* 2020 Conformance, https://www.intel.com/content/www/us/en/developer/articles/technical/compiler-first-full-sycl2020-conformance.html
- Kif Leswing, April 9, 2024, Intel unveils latest AI chip as Nvidia competition heats up, CNBC, https://www.cnbc.com/2024/04/09/intel-unveils-gaudi-3-ai-chip-as-nvidia-competition-heats-up-.html (Intel Gaudi 3 chip for high-end datacenter usage, completing with NVIDIA H100.)
- Siddhant Sahu, May 30, 2024, Beyond the Cloud: Distributed AI and On-Device Intelligence: Transition of AI workflows from cloud to the edge with specialized chip infrastructure & models, multi-modality and ambience across devices, https://sidstage.substack.com/p/beyond-the-cloud-distributed-ai-and
- Andy Patrizio, 12 Apr 2024, The desktop processor market is suddenly hot again, https://www.computerworld.com/article/2086948/desktop-processor-market-suddenly-hot-again.html
- Qualcomm, May 2023, The future of AI is hybrid, Qualcomm White Paper, https://www.qualcomm.com/content/dam/qcomm-martech/dm-assets/documents/Whitepaper-The-future-of-AI-is-hybrid-Part-1-Unlocking-the-generative-AI-future-with-on-device-and-hybrid-AI.pdf
- David Spuler, Mar 30, 2024, Generative AI in C++: Coding Transformers and LLMs, Yoryck AI, https://www.amazon.com/Generative-AI-Coding-Transformers-LLMs-ebook/dp/B0CXJKCWX9/
- Jaskirat Singh, Bram Adams, Ahmed E. Hassan, 25 Mar 2024, On the Impact of Black-box Deployment Strategies for Edge AI on Latency and Model Performance, https://arxiv.org/abs/2403.17154 (MLOps deployment for quantization, partitioning and early-exit across mobile, edge, and cloud platforms, including running early exit on mobile.)
- Sergio De Simone, Apple Extends Core ML, Create ML, and Vision Frameworks for iOS 17, JUL 03, 2023, https://www.infoq.com/news/2023/07/coreml-createml-vision-ios-17/
- Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen 2023, PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU, https://arxiv.org/abs/2312.12456 Code: https://github.com/SJTU-IPADS/PowerInfer (Computes a GPU-CPU hybrid engine with some "active" neurons run on the GPU and other less "hot" neurons on the CPU, which is akin to adaptive inference on the width dimension.)
- Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng, Dec 2023, Efficient LLM Inference on CPUs, Intel, NeurIPS 2023, https://arxiv.org/abs/2311.00502 Code: https://github.com/intel/intel-extension-for-transformers
- Tom Warren, April 9, 2024, Microsoft is confident Windows on Arm could finally beat Apple, The Verge, https://www.theverge.com/2024/4/8/24116587/microsoft-macbook-air-surface-arm-qualcomm-snapdragon-x-elite
- Steve Dent, Thu, Mar 28, 2024, Microsoft Copilot AI will soon run locally on PCs, https://www.engadget.com/microsoft-copilot-ai-will-soon-run-locally-on-pcs-130642514.html
- AMD AI Staff, How to run a Large Language Model (LLM) on your AMD Ryzen™ AI PC or Radeon Graphics Card, March 2024, AMD Blog, https://community.amd.com/t5/ai/how-to-run-a-large-language-model-llm-on-your-amd-ryzen-ai-pc-or/ba-p/670709
- Ramine_Roane, 6 Dec, 2023, Enabling AI PCs with Ryzen AI Software, AMD Blog, https://community.amd.com/t5/ai/enabling-ai-pcs-with-ryzen-ai-software/ba-p/648665
- Lucas Mearian, 21 Mar 2024, Microsoft integrates its Copilot chatbot on new devices https://www.computerworld.com/article/2071480/microsoft-integrates-its-copilot-chatbot-across-entire-product-line.html (New Surface laptops with support for ChatGPT-based Copilot.)
- Sharon Machlis, March 28, 2024, 5 easy ways to run an LLM locally, InfoWorld, https://www.infoworld.com/article/3705035/5-easy-ways-to-run-an-llm-locally.html
- Venkatraman Iyer, Sungho Lee, Semun Lee, Juitem Joonwoo Kim, Hyunjun Kim, Youngjae Shin, 12 December 2023, Automated Backend Allocation for Multi-Model, On-Device AI Inference, Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 7, Issue 3, Article No.: 62, pp 1–33, https://doi.org/10.1145/3626793 https://dl.acm.org/doi/abs/10.1145/3626793
- Jeff Butts, Feb 16th, 2023, What Is the Apple Neural Engine and What Does It Do? https://www.macobserver.com/tips/deep-dive/what-is-apple-neural-engine/
- Semaphore, Dec 14, 2023, 6 Ways to Run LLMs Locally, https://semaphoreci.medium.com/6-ways-to-run-llms-locally-fa25be0797e5 (The six ways are HF Transformers, LangChain, Llama.cpp, Llamafile, Ollama, and GPT4All.)
- Benj Edwards, 2/22/2024, Google goes “open AI” with Gemma, a free, open-weights chatbot family, Gemma chatbots can run locally, and they reportedly outperform Meta's Llama 2. Ars Technica, https://arstechnica.com/information-technology/2024/02/google-goes-open-ai-with-gemma-a-free-open-weights-chatbot-family/
- Murray Kornelsen, April 2023, Low-Latency BERT Inference for Heterogeneous Multi-Processor Edge Devices, Department of Electrical & Computer Engineering, McGill University, Canada https://escholarship.mcgill.ca/downloads/m326m732p
- Dell is refreshing its popular XPS laptop line with all the AI features (and they still look good). https://www.zdnet.com/article/dell-is-refreshing-its-popular-xps-laptop-line-with-all-the-ai-features-and-they-still-look-good/
- Gavin Li, Nov 19, 2023, Unbelievable! Run 70B LLM Inference on a Single 4GB GPU with This NEW Technique, AI Advances https://ai.gopubby.com/unbelievable-run-70b-llm-inference-on-a-single-4gb-gpu-with-this-new-technique-93e2057c7eeb
- Paul Thurrott, October 5, 2023, HP: AI Will Transform the PC Into a Personal Companion, https://www.thurrott.com/hardware/290462/hp-ai-will-transform-the-pc-into-a-personal-companion
- Jesse Clayton, Kedar Potdar and Annamalai Chockalingam, Jun 02, 2024, Streamline Development of AI-Powered Apps with NVIDIA RTX AI Toolkit for Windows RTX PCs, NVIDIA Technical Blog, https://developer.nvidia.com/blog/streamline-ai-powered-app-development-with-nvidia-rtx-ai-toolkit-for-windows-rtx-pcs/
- MWU Rahman, MM Abrar, HG Copening, S Hariri, Oct 2023, Quantized Transformer Language Model Implementations on Edge Devices, https://arxiv.org/pdf/2310.03971.pdf (Uses a "FlatBuffer" format on TensorFlow-Lite.)
- H Dai, X Peng, X Shi, L He, Q Xiong, H Jin, 2022, Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment, Science China Information Sciences volume 65, Article number: 112103 (2022), https://link.springer.com/article/10.1007/s11432-020-3182-1 http://scis.scichina.com/en/2022/112103.pdf
- Jie Wu, Yufeng Zhu, Lei Shen, Xuqing Lu, 14 Jun 2024, GEB-1.3B: Open Lightweight Large Language Model, https://arxiv.org/abs/2406.09900 Code: https://huggingface.co/GEB-AGI/geb-1.3b
- David Spuler, March 2024, Chapter 4. AI on Your Desktop, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Intel, Apr 25, 2024, Deployment of Llama3 on Your AI PC with OpenVINO™, https://medium.com/openvino-toolkit/deployment-of-llama3-on-your-ai-pc-with-openvino-b58e961501d6
- y Matthew Finnegan 14 Jun 2024, Microsoft delays Recall launch amid privacy concerns, ComputerWorld, https://www.computerworld.com/article/2147736/microsoft-delays-recall-launch-amid-privacy-concerns.html
- Steve Kovach, June 19 2024, Microsoft AI PCs take aim at Apple: CNBC’s Steve Kovach reports on news from Microsoft, CNBC, https://www.cnbc.com/video/2024/06/18/microsoft-ai-pcs-aim-at-apple.html
- Aniket Hingane, May 23, 2024, A New AI Era in PC Begins : AI Agent Computers, https://ai.plainenglish.io/a-new-ai-era-in-pc-begins-ai-agent-computers-d6210a8f1b48
- Esther Shein Jul 9 2024, Anticipating the Year of the AI PC, https://cacm.acm.org/news/anticipating-the-year-of-the-ai-pc/
- Dmitriy Pastushenkov, Ria Cheruvu, Max Domeika, Paula Ramos, Apr 20, 2024, AI is coming to the PC — AI PC Essentials, https://medium.com/openvino-toolkit/ai-is-coming-to-the-pc-ai-pc-essentials-ba2aa8686a59
- Jason Perlow, Aug. 6, 2024, How to run dozens of AI models on your Mac or PC - no third-party cloud needed, https://www.zdnet.com/article/how-to-run-dozens-of-ai-models-on-your-mac-or-pc-no-third-party-cloud-needed/
- Gavin Li, August 3rd, 2024, Crazy Challenge: Run Llama 405B on a 8GB VRAM GPU, https://ai.gopubby.com/crazy-challenge-run-llama-405b-on-a-8gb-vram-gpu-ab5a280a3889 (Run Llama's 405B model on a low-end GPU via 4-bit quantization and layer-by-layer inference, both to save memory.)
- Vince Lam, Mar 12, 2024, 50+ Open-Source Options for Running LLMs Locally, https://medium.com/thedeephub/50-open-source-options-for-running-llms-locally-db1ec6f5a54f
- Sujeet Kumar, May 20, 2024, 14 Best Software for Running local LLM, https://scifilogic.com/interface-for-running-local-llm/
- Sean Hollister, Sep 4, 2024, Intel reveals first Lunar Lake laptop CPUs: everything you need to know, https://www.theverge.com/2024/9/3/24233957/intel-lunar-lake-core-ultra-200v-launch
- Michael Nuñez, September 13, 2024, Microsoft’s Windows Agent Arena: Teaching AI assistants to navigate your PC, https://venturebeat.com/ai/microsofts-windows-agent-arena-teaching-ai-assistants-to-navigate-your-pc/
- Steve Kovach, Sep 5 2024, AI gadgets have been a bust so far. Apple aims to change that, https://www.cnbc.com/2024/09/05/ai-gadgets-have-been-a-bust-so-far-apple-aims-to-change-that.html
- Amos Gyamfi, Aug 28, 2024, The 6 Best LLM Tools To Run Models Locally, https://medium.com/@amosgyamfi/the-6-best-llm-tools-to-run-models-locally-eedd0f7c2bbd
- Michael Nuñez, October 16, 2024, Mistral AI’s new language models bring AI power to your phone and laptop, https://venturebeat.com/business/mistral-ai-new-language-models-bring-ai-power-to-your-phone-and-laptop/
- OpenVINO™ toolkit, Oct 1, 2024, How to run Llama 3.2 locally with OpenVINO™, https://medium.com/openvino-toolkit/how-to-run-llama-3-2-locally-with-openvino-60a0f3674549
- Lucas Mearian, 24 Oct 2024, 2025: The year of the AI PC, Computer World, https://www.computerworld.com/article/3583355/2025-the-year-of-the-ai-pc.html
- Fali Wang, Zhiwei Zhang, Xianren Zhang, Zongyu Wu, Tzuhao Mo, Qiuhao Lu, Wanjing Wang, Rui Li, Junjie Xu, Xianfeng Tang, Qi He, Yao Ma, Ming Huang, Suhang Wang, 4 Nov 2024, A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness, https://arxiv.org/abs/2411.03350
- Chris Wellons, November 10, 2024, Everything I've learned so far about running local LLMs, https://nullprogram.com/blog/2024/11/10/
- Justine, Apr 2023, Edge AI Just Got Faster, https://justine.lol/mmap/ (Loading models using mmap.)
- Emilia David, November 14, 2024, OpenAI launches ChatGPT desktop integrations, rivaling Copilot, https://venturebeat.com/ai/openai-launches-chatgpt-desktop-integrations-rivaling-copilot/
- Simon Willison, Dec 2024, I can now run a GPT-4 class model on my laptop. Meta’s new Llama 3.3 70B is a genuinely GPT-4 class Large Language Model that runs on my laptop. https://simonwillison.net/2024/Dec/9/llama-33-70b/?utm_source=tldrnewsletter
On-Device inference
For more about on-device inference on PCs and phones, see on-device inference research.
More AI Research
Read more about:
- GenAI market research
- AI on Phones
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home