Aussie AI

Parameter-Efficient Fine-Tuning (PEFT)

Last Updated 22 October, 2025

by David Spuler, Ph.D.

Parameter-Efficient Fine-Tuning (PEFT) is fine-tuning that's efficient with parameters. Instead of updating all of the model's parameters, which is slow and inefficient, only a subset of parameters is updated. The rest of the model parameters are "frozen" during the fine-tuning procedure.

Various types of PEFT have been examined, such as:

Low-Rank Adapters (LoRA)
Multi-LoRA
Quantized Low-Rank Adapters (QLoRA)
Lengthwise PEFT — extending the vocabulary (see Prompt tuning).

The alternatives to using PEFT to train additional intelligence into a model include:

LoRA

The idea behind LoRA is to use "low-rank" matrices, which have a smaller size, and thus are much less costly to fine-tuning. These matrices can be multiplied together to create data than can be combined with the original model.

Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi, 29 Apr 2024, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report, https://arxiv.org/abs/2405.00732 Code: https://huggingface.co/predibase
Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella, 5 Jun 2024, Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need, https://arxiv.org/abs/2406.03216
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
Rahul Chand, Yashoteja Prabhu, Pratyush Kumar, 20 Dec 2023, DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization, https://arxiv.org/abs/2312.13211
Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora
Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim, Nov 2023, LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning, https://arxiv.org/abs/2311.12023
Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song, Dec 2023, Compressed Context Memory For Online Language Model Interaction, https://arxiv.org/abs/2312.03414 Code: https://github.com/snu-mllab/context-memory
S Guo, J Xu, LL Zhang, M Yang, Oct 2023, Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models, arXiv preprint arXiv:2310.05015, https://arxiv.org/pdf/2310.05015.pdf Code: https://github.com/microsoft/Moonlit/tree/main/Compresso
Shivansh Kaushik, Aug 1, 2023, Efficient Model Fine-Tuning for LLMs: Understanding PEFT by Implementation, https://medium.com/@shivansh.kaushik/efficient-model-fine-tuning-for-llms-understanding-peft-by-implementation-fc4d5e985389
Shashank Verma, Neal Vaidya, Vinh Nguyen, Wei Du, Scot Junkin and BoYang Hsueh, Jun 07, 2024, Seamlessly Deploying a Swarm of LoRA Adapters with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/seamlessly-deploying-a-swarm-of-lora-adapters-with-nvidia-nim/
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
vLLM, 2024, Using LoRA adapters, https://docs.vllm.ai/en/v0.4.2/models/lora.html
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 16 Oct 2021 (v2), LoRA: Low-Rank Adaptation of Large Language Models, Microsoft Research, https://arxiv.org/abs/2106.09685 https://github.com/microsoft/LoRA (The original LoRA paper from 2021.)
Adarsh Shrivastav, Aug 17, 2023, LoRA Parameter-Efficient Tuning with Understanding of Self-Attention, https://medium.com/datadreamers/exploring-lora-unveiling-parameter-efficient-tuning-and-self-attention-mechanisms-in-depth-58e4c3b5ce30
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 23 May 2023, QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314 Code: https://github.com/artidoro/qlora Code: https://github.com/TimDettmers/bitsandbytes (The original QLoRA paper.)
Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang, 13 Apr 2024 (v4), EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models, https://arxiv.org/abs/2310.03270 Code: https://github.com/ThisisBillhe/EfficientDM
Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, 8 Mar 2024 (v3), LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 Code: https://github.com/dvlab-research/LongLoRA
Pranav Patel, 2024, In-depth guide to fine-tuning LLMs with LoRA and QLoRA, https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
Jingwei Xu, Junyu Lai, Yunpeng Huang, 24 May 2024 (v2), MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models, https://arxiv.org/abs/2405.13053
Lamini, June 2024, Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations, https://www.lamini.ai/blog/lamini-memory-tuning PDF: https://github.com/lamini-ai/Lamini-Memory-Tuning/blob/main/research-paper.pdf (Deploy models with many LoRA adapters, selecting between them with MoE.)
Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang, 2 Jul 2024, SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules, https://arxiv.org/abs/2407.02031 (Efficient diffusion models in systems with multi-LoRA, ControlNets, and other multi-module add-ons, including parallelizing execution of add-ons and more efficient loading of LoRA with faster updating or "patching" of model weights, including by performing some layers in parallel without LoRA weights, while loading the LoRA adapters.)
Tianyi Tang, Yiwen Hu, Bingqian Li, Wenyang Luo, Zijing Qin, Haoxiang Sun, Jiapeng Wang, Shiyi Xu, Xiaoxue Cheng, Geyang Guo, Han Peng, Bowen Zheng, Yiru Tang, Yingqian Min, Yushuo Chen, Jie Chen, Yuanqian Zhao, Luran Ding, Yuhao Wang, Zican Dong, Chunxuan Xia, Junyi Li, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen, 8 Jul 2024, LLMBox: A Comprehensive Library for Large Language Models, https://arxiv.org/abs/2407.05563 Code: https://github.com/RUCAIBox/LLMBox
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li, 24 Jul 2024, Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance, https://arxiv.org/abs/2407.17029 Code: https://github.com/xiaocaigou/qbaraqahira (Combining quantization and LoRA.)
Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy, 28 Oct 2023, Punica: Multi-Tenant LoRA Serving https://arxiv.org/abs/2310.18547 Code: https://github.com/punica-ai/punica
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag, 28 Mar 2024, CLoRA: A Contrastive Approach to Compose Multiple LoRA Models, https://arxiv.org/abs/2403.19776
Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang, 20 Jan 2024, CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference, https://arxiv.org/abs/2401.11240 (Multi-LoRA inference where it starts running prefill computations in the CPU while loading the LoRA weights into the GPU.)
Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu, 28 May 2024, LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design, https://arxiv.org/abs/2405.17741
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, 5 Jun 2024 (v3), S-LoRA: Serving Thousands of Concurrent LoRA Adapters, https://arxiv.org/abs/2311.03285 Code: https://github.com/S-LoRA/S-LoRA
Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
Bingyang Wu, Ruidong Zhu, and Zili Zhang, Peng Sun, Shanghai AI Lab; Xuanzhe Liu, Xin Jin, 2024, dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving, https://www.usenix.org/conference/osdi24/presentation/wu-bingyang
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
Shikhar Bajpai, Sep 27, 2024, Shrinking Elephants: A Funny Guide to 4-bit and 8-bit Quantization for LLMs with LoRA, https://medium.com/@shikharstruck/shrinking-elephants-a-funny-guide-to-4-bit-and-8-bit-quantization-for-llms-with-lora-ddf9f1a62070
Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Anjali Shah, Kshitiz Gupta, Jiahong Liu and Haohang Huang, Dec 11, 2024, NVIDIA TensorRT-LLM Now Accelerates Encoder-Decoder Models with In-Flight Batching, https://developer.nvidia.com/blog/nvidia-tensorrt-llm-now-accelerates-encoder-decoder-models-with-in-flight-batching/
Zhisheng Zhong, Chengyao Wang, Yuqi Liu, Senqiao Yang, Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia, 12 Dec 2024, Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition, https://arxiv.org/abs/2412.09501
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Towards AI, December 24, 2024, Llm Fine Tuning Guide: Do You Need It and How to Do It https://towardsai.net/p/artificial-intelligence/llm-fine-tuning-guide-do-you-need-it-and-how-to-do-it-4
Menglin Yang, Jialin Chen, Yifei Zhang, Jiahong Liu, Jiasheng Zhang, Qiyao Ma, Harshit Verma, Qianru Zhang, Min Zhou, Irwin King, Rex Ying, 31 Dec 2024, Low-Rank Adaptation for Foundation Models: A Comprehensive Review, https://arxiv.org/abs/2501.00365 (Extensive survey of LoRA.)
Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang, 23 Jan 2025, Parameter-Efficient Fine-Tuning for Foundation Models, https://arxiv.org/abs/2501.13787
Q Wang, S Shen, Jan 2025, Activation-Guided Low-Rank Parameter Adaptation for Efficient Model Fine-Tuning, IEEE Access, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10852296 (Modified LoRA algorithm using activations for weighting.)
Nikhil, January 31, 2025, Intel Labs Explores Low-Rank Adapters and Neural Architecture Search for LLM Compression, https://www.marktechpost.com/2025/01/31/intel-labs-introduces-lonas-a-hybrid-approach-combining-low-rank-adapters-and-neural-architecture-search-for-efficient-llm-compression/
J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain, 23 Jan 2025, Low-Rank Adapters Meet Neural Architecture Search for LLM Compression, https://arxiv.org/abs/2501.16372 https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning
TY Zhuo, AR Zebaze, L Von Werra, H de Vries, Q Liu, Mar 2025, Parameter-Efficient Instruction Tuning Code Large Language Models: An Empirical Study, ICLR 2025 review, https://openreview.net/pdf?id=dAiUf1MAbS
Yanxia Deng, Aozhong Zhang, Selcuk Gurses, Naigang Wang, Zi Yang and Penghang Yin, 14 Aug 2025, CLoQ: Enhancing Fine-Tuning of Quantized LLMs via Calibrated LoRA Initialization, https://arxiv.org/abs/2501.18475
Yihao Xue, Baharan Mirzasoleiman, 22 Jul 2025, LoRA is All You Need for Safety Alignment of Reasoning LLMs, https://arxiv.org/abs/2507.17075
Delong Ran, Xinlei He, Tianshuo Cong, Anyu Wang, Qi Li, Xiaoyun Wang, 24 Jul 2025, LoRA-Leak: Membership Inference Attacks Against LoRA Fine-tuned Language Models, https://arxiv.org/abs/2507.18302
Rameen Abdal, Or Patashnik, Ekaterina Deyneka, Hao Chen, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman, 23 Jul 2025, Zero-Shot Dynamic Concept Personalization with Grid-Based LoRA, https://arxiv.org/abs/2507.17963
Yi Zhong and Hongchao Liu and Di ZHao, 10 Aug 2025, AutoAssert 1: A LoRA Fine-Tuned LLM Model for Efficient Automated Assertion Generation, https://arxiv.org/abs/2508.07371
Donald Shenaj, Ondrej Bohdal, Mete Ozay, Pietro Zanuttigh, Umberto Michieli, 10 Aug 2025, LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation, https://arxiv.org/abs/2412.05148
Ojonugwa Oluwafemi Ejiga Peter, Md Mahmudur Rahman, and Fahmi Khalifa, 10 Aug 2025, Advancing AI-Powered Medical Image Synthesis: Insights from MedVQA-GI Challenge Using CLIP, Fine-Tuned Stable Diffusion, and Dream-Booth + LoRA, https://arxiv.org/abs/2502.20667
Atharva Nijasure, Tanya Chowdhury, James Allan, 10 Aug 2025, How Relevance Emerges: Interpreting LoRA Fine-Tuning in Reranking LLMs, https://arxiv.org/abs/2504.08780
Yining Huang,Bin Li,Keke Tang,Meilian Chen, 28 Jul 2025, LoRA-PAR: A Flexible Dual-System LoRA Partitioning Approach to Efficient LLM Fine-Tuning, https://arxiv.org/abs/2507.20999
Xin Chen, Shuaijun Chen, Omid Tavallaie, Nguyen Tran, Shuhuang Xiang, Albert Zomaya, 2 Aug 2025, Convergence Analysis of Aggregation-Broadcast in LoRA-enabled Federated Learning, https://arxiv.org/abs/2508.01348
Yixin Shen, 4 Aug 2025, Kronecker-LoRA: hybrid Kronecker-LoRA adapters for scalable, sustainable fine-tuning, https://arxiv.org/abs/2508.01961
Amitava Das, Abhilekh Borah, Vinija Jain, Aman Chadha, 4 Aug 2025, AlignGuard-LoRA: Alignment-Preserving Fine-Tuning via Fisher-Guided Decomposition and Riemannian-Geodesic Collision Regularization, https://arxiv.org/abs/2508.02079
Cristian Minoccheri, Matthew Hodgman, Haoyuan Ma, Rameez Merchant, Emily Wittrup, Craig Williamson, Kayvan Najarian, 3 Aug 2025, LoRA-based methods on Unet for transfer learning in Subarachnoid Hematoma Segmentation, https://arxiv.org/abs/2508.01772
Minghao Yan, Zhuang Wang, Zhen Jia, Shivaram Venkataraman, Yida Wang, 4 Aug 2025, PLoRA: Efficient LoRA Hyperparameter Tuning for Large Models, https://arxiv.org/abs/2508.02932
Igor Sokolov, Abdurakhmon Sadiev, Yury Demidovich, Fawaz S Al-Qahtani, and Peter Richt\'arik, 5 Aug 2025, Bernoulli-LoRA: A Theoretical Framework for Randomized Low-Rank Adaptation, https://arxiv.org/abs/2508.03820
Zhan Su, Fengran Mo, Guojun Liang, Jinghan Zhang, Bingbing Wen, Prayag Tiwari, Jian-Yun Nie, 6 Aug 2025, Tensorized Clustered LoRA Merging for Multi-Task Interference, https://arxiv.org/abs/2508.03999
Feifan Xia, Mingyang Liao, Yuyang Fang, Defang Li, Yantong Xie, Weikang Li, Yang Li, Deguo Xia, Jizhou Huang, 7 Aug 2025, Cross-LoRA: A Data-Free LoRA Transfer Framework across Heterogeneous LLMs, https://arxiv.org/abs/2508.05232
Jinda Liu, Bo Cheng, Yi Chang, Yuan Wu, 7 Aug 2025, Align, Don't Divide: Revisiting the LoRA Architecture in Multi-Task Learning, https://arxiv.org/abs/2508.05078
Agnieszka Polowczyk, Alicja Polowczyk, Dawid Malarz, Artur Kasymov, Marcin Mazur, Jacek Tabor, Przemys{\l}aw Spurek, 7 Aug 2025, UnGuide: Learning to Forget with LoRA-Guided Diffusion Models, https://arxiv.org/abs/2508.05755
Chang Che, Ziqi Wang, Pengwan Yang, Qi Wang, Hui Ma, Zenglin Shi, 8 Aug 2025, LoRA in LoRA: Towards Parameter-Efficient Architecture Expansion for Continual Visual Instruction Tuning, https://arxiv.org/abs/2508.06202
Timo Bertram, 11 Aug 2025, UrzaGPT: LoRA-Tuned Large Language Models for Card Selection in Collectible Card Games, https://arxiv.org/abs/2508.08382
Janne Laakkonen, Ivan Kukanov, Ville Hautam\"aki, 15 Aug 2025, Generalizable speech deepfake detection via meta-learned LoRA, https://arxiv.org/abs/2502.10838
Haojie Zhang, Yixiong Liang, Hulin Kuang, Lihui Cen, Zhe Qu, Yigang Cen, Min Zeng, Shichao Kan, 8 Aug 2025, Contrastive Regularization over LoRA for Multimodal Biomedical Image Incremental Learning, https://arxiv.org/abs/2508.11673
Zhanhao Cao, Clement Truong, Andrew Lizarraga, 16 Aug 2025, Efficient Modular Learning through Naive LoRA Summation: Leveraging Orthogonality in High-Dimensional Models, https://arxiv.org/abs/2508.11985
Xinhe Li, Jiajun Liu, Peng Wang, 18 Aug 2025, Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction, https://arxiv.org/abs/2508.13037
Emmanuel Oladokun, Yuxuan Ou, Anna Novikova, Daria Kulikova, Sarina Thomas, Jurica \v{S}prem and Vicente Grau, 18 Aug 2025, From Transthoracic to Transesophageal: Cross-Modality Generation using LoRA Diffusion, https://arxiv.org/abs/2508.13077
Shiwei Li, Xiandi Luo, Xing Tang, Haozhao Wang, Hao Chen, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li, 17 Aug 2025, Beyond Zero Initialization: Investigating the Impact of Non-Zero Initialization on LoRA Fine-Tuning Dynamics, https://arxiv.org/abs/2505.23194
William Fleshman and Benjamin Van Durme, 15 Aug 2025, LoRA-Augmented Generation (LAG) for Knowledge-Intensive Language Tasks, https://arxiv.org/abs/2507.05346
Hassan Barmandah, 19 Aug 2025, Saudi-Dialect-ALLaM: LoRA Fine-Tuning for Dialectal Arabic Generation, https://arxiv.org/abs/2508.13525
Klaudia Ba{\l}azy, Mohammadreza Banaei, Karl Aberer, Jacek Tabor, 19 Aug 2025, LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters, https://arxiv.org/abs/2405.17604
Daniel Frees, Aditri Bhagirath, Moritz Bolling, 25 Aug 2025, Exploring Efficient Learning of Small BERT Networks with LoRA and DoRA, https://arxiv.org/abs/2508.17586
Juneyoung Park, Minjae Kang, Seongbae Lee, Haegang Lee, Seongwan Kim, Jaeho Lee, 25 Aug 2025, Riemannian Optimization for LoRA on the Stiefel Manifold, https://arxiv.org/abs/2508.17901
Yuhang Liu, Tao Li, Zhehao Huang, Zuopeng Yang, and Xiaolin Huang, 27 Aug 2025, Bi-LoRA: Efficient Sharpness-Aware Minimization for Fine-Tuning Large-Scale Models, https://arxiv.org/abs/2508.19564
Yejin Kim, Eunwon Kim, Buru Chang, Junsuk Choe, 29 Aug 2025, Improving Fisher Information Estimation and Efficiency for LoRA-based LLM Unlearning, https://arxiv.org/abs/2508.21300
Jessica Liang, Anirudh Bharadwaj, 29 Aug 2025, QR-LoRA: QR-Based Low-Rank Adaptation for Efficient Fine-Tuning of Large Language Models, https://arxiv.org/abs/2508.21810
Patryk Marsza{\l}ek, Klaudia Ba{\l}azy, Jacek Tabor, Tomasz Ku\'smierczyk, 2 Sep 2025, Minimal Ranks, Maximum Confidence: Parameter-efficient Uncertainty Quantification for LoRA, https://arxiv.org/abs/2502.12122
Hamza Rasaee, Taha Koleilat, Hassan Rivaz, 3 Sep 2025, GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models, https://arxiv.org/abs/2506.23903
Jiacheng Wei, Faguo Wu, Xiao Zhang, 5 Sep 2025, A Lightweight Framework for Trigger-Guided LoRA-Based Self-Adaptation in LLMs, https://arxiv.org/abs/2509.05385
Lishan Yang, Nam Kha Nguygen, Po Hu, Wei Emma Zhang, Yanjun Shu, Mong Yuan Sim and Weitong Chen, 1 Sep 2025, FediLoRA: Heterogeneous LoRA for Federated Multimodal Fine-tuning under Missing Modalities, https://arxiv.org/abs/2509.06984
Hao Zhang, Bo Huang, Zhenjia Li, Xi Xiao, Hui Yi Leong, Zumeng Zhang, Xinwei Long, Tianyang Wang, Hao Xu, 11 Sep 2025, Sensitivity-LoRA: Low-Load Sensitivity-Based Fine-Tuning for Large Language Models, https://arxiv.org/abs/2509.09119
Yujian Ma, Jinqiu Sang, Ruizhe Li, 11 Sep 2025, Behind the Scenes: Mechanistic Interpretability of LoRA-adapted Whisper for Speech Emotion Recognition, https://arxiv.org/abs/2509.08454
Zhaokun Wang, Jinyu Guo, Jingwen Pu, Lingfeng Chen, Hongli Pu, Jie Ou, Libo Qin, Wenhong Tian, 19 Sep 2025, Noise-Robustness Through Noise: Asymmetric LoRA Adaption with Poisoning Expert, https://arxiv.org/abs/2505.23868
Aarushi Mahajan, Wayne Burleson, 19 Sep 2025, Watermarking and Anomaly Detection in Machine Learning Models for LORA RF Fingerprinting, https://arxiv.org/abs/2509.15170
MSR Avinash, 7 Sep 2025, Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study, https://arxiv.org/abs/2509.12229
Miaoge Li, Yang Chen, Zhijie Rao, Can Jiang, Jingcai Guo, 5 Sep 2025, Semantic-guided LoRA Parameters Generation, https://arxiv.org/abs/2509.10535
Zeyu Dong, Yuyang Yin, Yuqi Li, Eric Li, Hao-Xiang Guo, Yikai Wang, 14 Sep 2025, PanoLora: Bridging Perspective and Panoramic Video Generation with LoRA Adaptation, https://arxiv.org/abs/2509.11092
Yoon Pyo Lee, 15 Sep 2025, Mechanistic Interpretability of LoRA-Adapted Language Models for Nuclear Reactor Safety Applications, https://arxiv.org/abs/2507.09931
Lei Wang, Jieming Bian, Letian Zhang, Jie Xu, 18 Sep 2025, Adaptive LoRA Experts Allocation and Selection for Federated Fine-Tuning, https://arxiv.org/abs/2509.15087

Multi-LoRA

The use of multiple LoRA adapters got a boost when Apple chose this method for its Apple Intelligence platform. Several other platforms use multi-LoRA as an efficiency gain for both training and inference.

Research papers on multi-LoRA include:

Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi, 29 Apr 2024, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report, https://arxiv.org/abs/2405.00732 Code: https://huggingface.co/predibase
Jingwei Xu, Junyu Lai, Yunpeng Huang, 24 May 2024 (v2), MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models, https://arxiv.org/abs/2405.13053
Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
Lamini, June 2024, Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations, https://www.lamini.ai/blog/lamini-memory-tuning PDF: https://github.com/lamini-ai/Lamini-Memory-Tuning/blob/main/research-paper.pdf (Deploy models with many LoRA adapters, selecting between them with MoE.)
Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang, 2 Jul 2024, SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules, https://arxiv.org/abs/2407.02031 (Efficient diffusion models in systems with multi-LoRA, ControlNets, and other multi-module add-ons, including parallelizing execution of add-ons and more efficient loading of LoRA with faster updating or "patching" of model weights, including by performing some layers in parallel without LoRA weights, while loading the LoRA adapters.)
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy, 28 Oct 2023, Punica: Multi-Tenant LoRA Serving https://arxiv.org/abs/2310.18547 Code: https://github.com/punica-ai/punica
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag, 28 Mar 2024, CLoRA: A Contrastive Approach to Compose Multiple LoRA Models, https://arxiv.org/abs/2403.19776
Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang, 20 Jan 2024, CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference, https://arxiv.org/abs/2401.11240 (Multi-LoRA inference where it starts running prefill computations in the CPU while loading the LoRA weights into the GPU.)
Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu, 28 May 2024, LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design, https://arxiv.org/abs/2405.17741
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, 5 Jun 2024 (v3), S-LoRA: Serving Thousands of Concurrent LoRA Adapters, https://arxiv.org/abs/2311.03285 Code: https://github.com/S-LoRA/S-LoRA
Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
Bingyang Wu, Ruidong Zhu, and Zili Zhang, Peng Sun, Shanghai AI Lab; Xuanzhe Liu, Xin Jin, 2024, dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving, https://www.usenix.org/conference/osdi24/presentation/wu-bingyang
David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao, 23 Sep 2024 (v2), CITI: Enhancing Tool Utilizing Ability in Large Language Models without Sacrificing General Performance, https://arxiv.org/abs/2409.13202
Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao 12 Aug 2024 (v3), A Survey on LoRA of Large Language Models, https://arxiv.org/abs/2407.11046 https://github.com/ZJU-LLMs/Awesome-LoRAs.git
Yuxuan Zhang, Ruizhe Li, 2 Oct 2024, DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models, https://arxiv.org/abs/2410.01497 https://github.com/MeCuping/DLP-LoRA (Merging multiple LoRA adapters for parallel inference.)
Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU, Lingnan Xia, Hua Ma, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10721583
Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu, 1 Nov 2024, V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM, https://arxiv.org/abs/2411.00915
Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Hubertus Franke, Josep Torrellas, 24 Nov 2024, Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments, https://arxiv.org/abs/2411.17741
Jiaxuan Chen. 2024. Comparative Analysis and Optimization of LoRA Adapter Co-serving for Large Language Models. In Proceedings of the 25th International Middleware Conference: Demos, Posters and Doctoral Symposium (Middleware '24). Association for Computing Machinery, New York, NY, USA, 27–28. https://doi.org/10.1145/3704440.3704777 https://dl.acm.org/doi/abs/10.1145/3704440.3704777 https://dl.acm.org/doi/pdf/10.1145/3704440.3704777 (Serving multiple LoRA adapters while maintaining a single backbone LLM model in memory.)
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Menglin Yang, Jialin Chen, Yifei Zhang, Jiahong Liu, Jiasheng Zhang, Qiyao Ma, Harshit Verma, Qianru Zhang, Min Zhou, Irwin King, Rex Ying, 31 Dec 2024, Low-Rank Adaptation for Foundation Models: A Comprehensive Review, https://arxiv.org/abs/2501.00365 (Extensive survey of LoRA.)
Yimin Tian, Bolin Zhang , Zhiying Tu, Dianhui Chu, Jan 2025, Adapters Selector: Cross-domains and Multi-tasks LoRA Modules Integration Usage Method, Proceedings of the 31st International Conference on Computational Linguistics, pages 593–605, January 19–24, 2025, Association for Computational Linguistics, https://aclanthology.org/2025.coling-main.40.pdf https://github.com/tirant35/TASA (Training an adapter to choose which LoRA adapter to use for a prompt.)
Xiaozhe Yao, Qinghao Hu, Ana Klimovic, 1 Nov 2024 (v2), DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs, https://arxiv.org/abs/2312.05215 (Serve multiple fine-tuned models with full parameters by using deltas/diffs, rather than PEFT or multi-LoRA.)
Nikhil, January 31, 2025, Intel Labs Explores Low-Rank Adapters and Neural Architecture Search for LLM Compression, https://www.marktechpost.com/2025/01/31/intel-labs-introduces-lonas-a-hybrid-approach-combining-low-rank-adapters-and-neural-architecture-search-for-efficient-llm-compression/
J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain, 23 Jan 2025, Low-Rank Adapters Meet Neural Architecture Search for LLM Compression, https://arxiv.org/abs/2501.16372 https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning
Qizhen Zhang, Prajjwal Bhargava, Chloe Bi, Chris X. Cai, Jakob Foerster, Jeremy Fu, Punit Singh Koura, Ruan Silva, Sheng Shen, Emily Dinan, Suchin Gururangan, Mike Lewis, 31 Jan 2025, BTS: Harmonizing Specialized Experts into a Generalist LLM, https://arxiv.org/abs/2502.00075 (Combining multiple fine-tuned expert models via "layer stitching").
Hang Zhang, Jiuchen Shi, Yixiao Wang, Quan Chen, Yizhou Shan, Minyi Guo, 19 Apr 2025, Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management, https://arxiv.org/abs/2505.03756
Ryota Miyano, Yuki Arase, 30 May 2025, Adaptive LoRA Merge with Parameter Pruning for Low-Resource Generation, https://arxiv.org/abs/2505.24174
Apple, June 2025, Updates to Apple's On-Device and Server Foundation Language Models, https://machinelearning.apple.com/research/apple-foundation-models-2025-updates (Apple's 3B on-device model with cloud server alternative. The on-device architecture includes 2-bit quantization, 4-bit embeddings quantization, 8-bit KV quantization, a unique KV cache compression, interleaved local-global attention and multi-LoRA.)
Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, Ang Li, 2 Jul 2025, EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices, https://arxiv.org/abs/2507.01438
Xinhe Li, Jiajun Liu, Peng Wang, 18 Aug 2025, Can Large Models Teach Student Models to Solve Mathematical Problems Like Human Beings? A Reasoning Distillation Method via Multi-LoRA Interaction, https://arxiv.org/abs/2508.13037

LoRA Inference Optimizations

The popularity of LoRA as an efficient training method has also spawned research on maximizing its inference efficiency. Some of the loading and unloading of LoRA adapters can be quite expensive, and methods of optimizing multi-LORA platforms have seen various research papers.

Research papers on LoRA and multi-LoRA inference optimization include:

Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy, 28 Oct 2023, Punica: Multi-Tenant LoRA Serving https://arxiv.org/abs/2310.18547 Code: https://github.com/punica-ai/punica
Jingwei Xu, Junyu Lai, Yunpeng Huang, 24 May 2024 (v2), MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models, https://arxiv.org/abs/2405.13053
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag, 28 Mar 2024, CLoRA: A Contrastive Approach to Compose Multiple LoRA Models, https://arxiv.org/abs/2403.19776
Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang, 20 Jan 2024, CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference, https://arxiv.org/abs/2401.11240 (Multi-LoRA inference where it starts running prefill computations in the CPU while loading the LoRA weights into the GPU.)
Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu, 28 May 2024, LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design, https://arxiv.org/abs/2405.17741
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, 5 Jun 2024 (v3), S-LoRA: Serving Thousands of Concurrent LoRA Adapters, https://arxiv.org/abs/2311.03285 Code: https://github.com/S-LoRA/S-LoRA
Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
Bingyang Wu, Ruidong Zhu, and Zili Zhang, Peng Sun, Shanghai AI Lab; Xuanzhe Liu, Xin Jin, 2024, dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving, https://www.usenix.org/conference/osdi24/presentation/wu-bingyang
Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao 12 Aug 2024 (v3), A Survey on LoRA of Large Language Models, https://arxiv.org/abs/2407.11046 https://github.com/ZJU-LLMs/Awesome-LoRAs.git
Yuxuan Zhang, Ruizhe Li, 2 Oct 2024, DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models, https://arxiv.org/abs/2410.01497 https://github.com/MeCuping/DLP-LoRA (Merging multiple LoRA adapters for parallel inference.)
Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu, 1 Nov 2024, V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM, https://arxiv.org/abs/2411.00915
Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Hubertus Franke, Josep Torrellas, 24 Nov 2024, Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments, https://arxiv.org/abs/2411.17741
Jiaxuan Chen. 2024. Comparative Analysis and Optimization of LoRA Adapter Co-serving for Large Language Models. In Proceedings of the 25th International Middleware Conference: Demos, Posters and Doctoral Symposium (Middleware '24). Association for Computing Machinery, New York, NY, USA, 27–28. https://doi.org/10.1145/3704440.3704777 https://dl.acm.org/doi/abs/10.1145/3704440.3704777 https://dl.acm.org/doi/pdf/10.1145/3704440.3704777 (Serving multiple LoRA adapters while maintaining a single backbone LLM model in memory.)
https://arxiv.org/abs/2505.14468
Ranran Zhen, Juntao Li, Yixin Ji, Zhenlin Yang, Tong Liu, Qingrong Xia, Xinyu Duan, Zhefeng Wang, Baoxing Huai, Min Zhang, 28 Apr 2025, Taming the Titans: A Survey of Efficient LLM Inference Serving, https://arxiv.org/abs/2504.19720 (Surver of various inference and serving optimizations, such as parallelism, offloading, scheduling, length prediction, KV cache compression, and prefill-decode phase disaggregation.)

QLoRA

QLoRA is quantized LoRA. This is pretty standard nowadays, with most LoRA adapters using quantization. A lot of the research papers don't use the term "QLoRA" anymore. For example, Apple Intelligence uses QLoRA in its multi-LoRA architecture with 4-bit quantization.

Research papers on QLoRA include:

Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts, 26 Mar 2024, The Unreasonable Ineffectiveness of the Deeper Layers, https://arxiv.org/abs/2403.17887 (Static layer pruning with some PEFT re-training after removing layers, with quantization and QLoRA.)
Intel, 2024, https://github.com/intel/intel-extension-for-transformers
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 23 May 2023, QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314 Code: https://github.com/artidoro/qlora Code: https://github.com/TimDettmers/bitsandbytes (The original QLoRA paper.)
Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang, 13 Apr 2024 (v4), EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models, https://arxiv.org/abs/2310.03270 Code: https://github.com/ThisisBillhe/EfficientDM
Pranav Patel, 2024, In-depth guide to fine-tuning LLMs with LoRA and QLoRA, https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li, 24 Jul 2024, Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance, https://arxiv.org/abs/2407.17029 Code: https://github.com/xiaocaigou/qbaraqahira (Combining quantization and LoRA.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Shikhar Bajpai, Sep 27, 2024, Shrinking Elephants: A Funny Guide to 4-bit and 8-bit Quantization for LLMs with LoRA, https://medium.com/@shikharstruck/shrinking-elephants-a-funny-guide-to-4-bit-and-8-bit-quantization-for-llms-with-lora-ddf9f1a62070
Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
Neal Lawton, Aishwarya Padmakumar, Judith Gaspers, Jack FitzGerald, Anoop Kumar, Greg Ver Steeg, Aram Galstyan, 9 Oct 2024, QuAILoRA: Quantization-Aware Initialization for LoRA, https://arxiv.org/abs/2410.14713
Meta, October 24, 2024, Introducing quantized Llama models with increased speed and a reduced memory footprint, https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Towards AI, December 24, 2024, Llm Fine Tuning Guide: Do You Need It and How to Do It https://towardsai.net/p/artificial-intelligence/llm-fine-tuning-guide-do-you-need-it-and-how-to-do-it-4
MSR Avinash, 7 Sep 2025, Profiling LoRA/QLoRA Fine-Tuning Efficiency on Consumer GPUs: An RTX 4060 Case Study, https://arxiv.org/abs/2509.12229

Prompt Tuning (Extended Vocabulary PEFT)

Prompt tuning is a lengthwise PEFT that creates new tokens to extend the vocabulary, rather than training the parameters for existing tokens. Since the tokens are new, they don't have values, and obviously aren't frozen. The weights for the original tokens are frozen, however, which means most of them. TAs an example, this type of PEFT can be useful when extending the LLM via fine-tuning on a special curated data set, so as to have particular "trigger tokens" to launch integrated tools or perform other advanced capabilities. For example, adding new tokens that indicate a "tool launch" to the vocabulary and only fine-tuning for those ones.

IBM, 2024, What is prompt-tuning?, https://research.ibm.com/blog/what-is-ai-prompt-tuning
Abhinav Jain, Swarat Chaudhuri, Thomas Reps, Chris Jermaine, 24 May 2024, Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation, https://arxiv.org/abs/2405.15282
MohammadAli SadraeiJavaeri, Ehsaneddin Asgari, Alice Carolyn McHardy, Hamid Reza Rabiee, 7 Jun 2024, SuperPos-Prompt: Enhancing Soft Prompt Tuning of Language Models with Superposition of Multi Token Embeddings, https://arxiv.org/abs/2406.05279
Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella, 5 Jun 2024, Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need, https://arxiv.org/abs/2406.03216
Xuyang Wu, Zhiyuan Peng, Sravanthi Rajanala, Hsin-Tai Wu, Yi Fang, 31 May 2024, Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models, https://arxiv.org/abs/2405.20654
Wei Zhu, Aaron Xuxiang Tian, Congrui Yin, Yuan Ni, Xiaoling Wang, Guotong Xie, 7 Jun 2024 (v2), IAPT: Instruction-Aware Prompt Tuning for Large Language Models, https://arxiv.org/abs/2405.18203
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. https://aclanthology.org/2021.emnlp-main.243/
Shreyansh Shah, Oct 18, 2023, Prompt Tuning: A Powerful Technique for Adapting LLMs to New Tasks, https://medium.com/@shahshreyansh20/prompt-tuning-a-powerful-technique-for-adapting-llms-to-new-tasks-6d6fd9b83557
Data Camp, May 19, 2024, Understanding Prompt Tuning: Enhance Your Language Models with Precision, https://www.datacamp.com/tutorial/understanding-prompt-tuning
Sergey Sedov, Sumanth Bharadwaj Hachalli Karanam, Venu Gopal Kadamba, 24 Dec 2024, Exploring Embedding Priors in Prompt-Tuning for Improved Interpretability and Control, https://arxiv.org/abs/2412.18582
Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Lam Tam, Zhengxiao Du, Zhilin Yang, Jie Tang, 20 Mar 2022 (v3), P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks, Proceedings of the 60th Annual Meeting of the Association of Computational Linguistics, 2022, https://arxiv.org/abs/2110.07602 https://github.com/THUDM/P-tuning-v2 (Extends prompt tuning with extra soft prompt tokens at every layer, not just at the start of the input.)
Haowei Zhu, Fangyuan Zhang, Rui Qin, Tianxiang Pan, Junhai Yong, Bin Wang, 24 Dec 2024 (v2), Semantic Hierarchical Prompt Tuning for Parameter-Efficient Fine-Tuning, https://arxiv.org/abs/2412.16956
Xiang Lisa Li, Percy Liang, 1 Jan 2021, Prefix-Tuning: Optimizing Continuous Prompts for Generation, https://arxiv.org/abs/2101.00190 (Precursor to prompt tuning.)
Andrea Matarazzo, Riccardo Torlone, 3 Jan 2025, A Survey on Large Language Models with some Insights on their Capabilities and Limitations, https://arxiv.org/abs/2501.04040 (Broad survey with many LLM topics covered from history to architectures to optimizations.)
Qi Sun, Edoardo Cetin, Yujin Tang, 14 Jan 2025 (v2), Transformer2: Self-adaptive LLMs, https://arxiv.org/abs/2501.06252 (Using a vector to fine-tuning dynamically.)
Liu Yang, Ziqian Lin, Kangwook Lee, Dimitris Papailiopoulos, Robert Nowak, 16 Jan 2025, Task Vectors in In-Context Learning: Emergence, Formation, and Benefit, https://arxiv.org/abs/2501.09240
Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang, 23 Jan 2025, Parameter-Efficient Fine-Tuning for Foundation Models, https://arxiv.org/abs/2501.13787
X Li, C Jiang, Nov 2024, Optimizing Prompt Engineering Methods for Enhanced Logical Reasoning in Transformer Models, RMEL ’24, November 4–7, 2024, Hangzhou, China, https://www.researchgate.net/profile/Xiaoyan-Li-42/publication/389182048_Optimizing_Prompt_Engineering_Methods_for_Enhanced_Logical_Reasoning_in_Transformer_Models/links/67b82fa9461fb56424e3fc72/Optimizing-Prompt-Engineering-Methods-for-Enhanced-Logical-Reasoning-in-Transformer-Models.pdf https://github.com/xiaoyanLi629/RMELS2024
Anushka Tiwari, Sayantan Pal, Rohini K. Srihari, Kaiyi Ji, 19 Jul 2025, Task-Agnostic Continual Prompt Tuning with Gradient-Based Selection and Decoding, https://arxiv.org/abs/2507.14725
Lingyun Huang, Jianxu Mao, Junfei Yi, Ziming Tao, Yaonan Wang, 19 Jul 2025, CVPT: Cross Visual Prompt Tuning, https://arxiv.org/abs/2408.14961
Ruijun Feng, Hammond Pearce, Pietro Liguori, Yulei Sui, 21 Jul 2025, CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection, https://arxiv.org/abs/2501.04510
Jiong Yin, Liang Li, Jiehua Zhang, Yuhan Gao, Chenggang Yan, Xichun Sheng, 29 Jul 2025, Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning, https://arxiv.org/abs/2507.21588
Fei Zhang, Tianfei Zhou, Jiangchao Yao, Ya Zhang, Ivor W. Tsang, Yanfeng Wang, 1 Aug 2025, Decouple before Align: Visual Disentanglement Enhances Prompt Tuning, https://arxiv.org/abs/2508.00395
Haitong Luo, Suhang Wang, Weiyao Zhang, Ruiqi Meng, Xuying Meng, Yujun Zhang, 15 Aug 2025, Generalize across Homophily and Heterophily: Hybrid Spectral Graph Pre-Training and Prompt Tuning, https://arxiv.org/abs/2508.11328
Zian Zhai, Sima Qing, Xiaoyang Wang, Wenjie Zhang, 17 Aug 2025, SGPT: Few-Shot Prompt Tuning for Signed Graphs, https://arxiv.org/abs/2412.12155
Pi-Wei Chen, Jerry Chun-Wei Lin, Wei-Han Chen, Jia Ji, Zih-Ching Chen, Feng-Hao Yeh, Chao-Chun Chen, 22 Aug 2025, Beyond Human-prompting: Adaptive Prompt Tuning with Semantic Alignment for Anomaly Detection, https://arxiv.org/abs/2508.16157
Finn Rietz, Oleg Smirnov, Sara Karimi, Lele Cao, 18 Jul 2025, Prompt-Tuning Bandits: Enabling Few-Shot Generalization for Efficient Multi-Task Offline RL, https://arxiv.org/abs/2502.06358
Ivan Zhang, 10 Aug 2025, A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection, https://arxiv.org/abs/2508.07139
Ali Shakeri, Wei Emma Zhang, Amin Beheshti, Weitong Chen, Jian Yang and Lishan Yang, 22 Jul 2025, FedDPG: An Adaptive Yet Efficient Prompt-tuning Approach in Federated Learning Settings, https://arxiv.org/abs/2507.19534
Xinxu Wei, Kanhao Zhao, Yong Jiao, Lifang He and Yu Zhang, 3 Aug 2025, A Brain Graph Foundation Model: Pre-Training and Prompt-Tuning for Any Atlas and Disorder, https://arxiv.org/abs/2506.02044
Han Gao, Timo Hartmann, Botao Zhong, Kai Lia, Hanbin Luo, 5 Aug 2025, Domain-Specific Fine-Tuning and Prompt-Based Learning: A Comparative Study for developing Natural Language-Based BIM Information Retrieval Systems, https://arxiv.org/abs/2508.05676
Ryo Takahashi, Naoki Saito, Keisuke Maeda, Takahiro Ogawa, Miki Haseyama, 30 Aug 2025, Discrete Prompt Tuning via Recursive Utilization of Black-box Multimodal Large Language Model for Personalized Visual Emotion Recognition, https://arxiv.org/abs/2509.04480
Tiandi Ye, Wenyan Liu, Kai Yao, Lichun Li, Shangchao Su, Cen Chen, Xiang Li, Shan Yin, Ming Gao, 27 Aug 2025, Towards Instance-wise Personalized Federated Learning via Semi-Implicit Bayesian Prompt Tuning, https://arxiv.org/abs/2508.19621
Lijun Sheng, Jian Liang, Zilei Wang, Ran He, 27 Aug 2025, R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning, https://arxiv.org/abs/2504.11195
Maxime Meyer, Mario Michelessa, Caroline Chaux, Vincent Y. F. Tan, 30 Aug 2025, Memory Limitations of Prompt Tuning in Transformers, https://arxiv.org/abs/2509.00421
Runjia Zeng, Guangyan Sun, Qifan Wang, Tong Geng, Sohail Dianat, Xiaotian Han, Raghuveer Rao, Xueling Zhang, Cheng Han, Lifu Huang, Dongfang Liu, 31 Aug 2025, MEPT: Mixture of Expert Prompt Tuning as a Manifold Mapper, https://arxiv.org/abs/2509.00996
Yutong Gao, Maoyuan Shao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Yu Weng, Xuan Liu, Guoshun Nan, 31 Aug 2025, Spotlighter: Revisiting Prompt Tuning from a Representative Mining View, https://arxiv.org/abs/2509.00905
Ahmad Pouramini, Hesham Faili, 11 Sep 2025, Enhancing Few-Shot Transfer Learning with Optimized Multi-Task Prompt Tuning through Modular Prompt Composition, https://arxiv.org/abs/2408.13227
Xu Li and Fan Lyu, 11 Sep 2025, MM-Prompt: Cross-Modal Prompt Tuning for Continual Visual Question Answering, https://arxiv.org/abs/2505.19455
Rodrigo M Carrillo-Larco, 16 Sep 2025, LLMs for energy and macronutrients estimation using only text data from 24-hour dietary recalls: a parameter-efficient fine-tuning experiment using a 10-shot prompt, https://arxiv.org/abs/2509.13268
Ahmad Pouramini and Hesham Faili, 11 Sep 2025, CrossPT: Exploring Cross-Task Transferability through Multi-Task Prompt Tuning, https://arxiv.org/abs/2509.14253
Gustavo Sandoval, Denys Fenchenko and Junyao Chen, 15 Sep 2025, Early Approaches to Adversarial Fine-Tuning for Prompt Injection Defense: A 2022 Study of GPT-3 and Contemporary Models, https://arxiv.org/abs/2509.14271
Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul, 10 Sep 2025, Prior Prompt Engineering for Reinforcement Fine-Tuning, https://arxiv.org/abs/2505.14157

Research Papers on PEFT

PEFT is a popular technique that receives a lot of research attention:

Libo Qin, Qiguang Chen, Xiachong Feng, Yang Wu, Yongheng Zhang, Yinghui Li, Min Li, Wanxiang Che, Philip S. Yu, 21 May 2024, Large Language Models Meet NLP: A Survey, https://arxiv.org/abs/2405.12819 (A survey of research into how LLMs, with and without fine-tuning, perform in various NLP use cases, such as mathematical reasoning, dialogue understanding, translation, and more.)
Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi, 29 Apr 2024, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report, https://arxiv.org/abs/2405.00732 Code: https://huggingface.co/predibase
Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
Akash Takyar, 2024, Parameter-efficient Fine-tuning (PEFT): Overview, benefits, techniques and model training, https://www.leewayhertz.com/parameter-efficient-fine-tuning/
Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang, 29 Apr 2024 (v5), Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey, https://arxiv.org/abs/2403.14608
Guochang Li, Chen Zhi, Jialiang Chen, Junxiao Han, Shuiguang Deng, 9 Jun 2024, A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Automated Program Repair, https://arxiv.org/abs/2406.05639
Xiongtao Zhou, Jie He, Yuhua Ke, Guangyao Zhu, Víctor Gutiérrez-Basulto, Jeff Z. Pan, 7 Jun 2024, An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models, https://arxiv.org/abs/2406.05130
Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren, 7 Jun 2024, MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter, https://arxiv.org/abs/2406.04984 Code: https://github.com/CURRENTF/MEFT
Naibin Gu, Peng Fu, Xiyu Liu, Bowen Shen, Zheng Lin, Weiping Wang, 6 Jun 2024, Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning, https://arxiv.org/abs/2406.03792
Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella, 5 Jun 2024, Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need, https://arxiv.org/abs/2406.03216
Canwen Xu, 2024, Efficient Natural Language Processing for Language Models, Ph.D. thesis, Computer Science, UNIVERSITY OF CALIFORNIA SAN DIEGO, PDF: https://escholarship.org/uc/item/9dv1k5xv PDF: https://escholarship.org/content/qt9dv1k5xv/qt9dv1k5xv.pdf?t=sc34ay (Evaluates several acceleration methods including early-exit, PEFT, and distillation.)
Benjue Weng, 13 Apr 2024, Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning Strategies, https://arxiv.org/abs/2404.09022 (Reviewing fine-tuning of large models.)
Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang, 28 Jan 2024, Efficient Tuning and Inference for Large Language Models on Textual Graphs, https://arxiv.org/abs/2401.15569 (Optimizing Graph Neural Networks on textual graphs using caching and early exit inference.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
Chang, Xiangyu; Miraj Ahmed, Sk; Krishnamurthy, Srikanth V.; Guler, Basak; Swami, Ananthram; Oymak, Samet; Roy-Chowdhury, Amit K., Jan 2024, Plug-and-Play Transformer Modules for Test-Time Adaptation, https://arxiv.org/abs/2401.04130 https://ui.adsabs.harvard.edu/abs/2024arXiv240104130C/abstract
Chengyu Wang, Junbing Yan, Wei Zhang, Jun Huang, Nov 2023, Towards Better Parameter-Efficient Fine-Tuning for Large Language Models: A Position Paper, https://arxiv.org/abs/2311.13126
Shivansh Kaushik, Aug 1, 2023, Efficient Model Fine-Tuning for LLMs: Understanding PEFT by Implementation, https://medium.com/@shivansh.kaushik/efficient-model-fine-tuning-for-llms-understanding-peft-by-implementation-fc4d5e985389
Abhinav Jain, Swarat Chaudhuri, Thomas Reps, Chris Jermaine, 24 May 2024, Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation, https://arxiv.org/abs/2405.15282
Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 16 Oct 2021 (v2), LoRA: Low-Rank Adaptation of Large Language Models, Microsoft Research, https://arxiv.org/abs/2106.09685 https://github.com/microsoft/LoRA (The original LoRA paper from 2021.)
Adarsh Shrivastav, Aug 17, 2023, LoRA Parameter-Efficient Tuning with Understanding of Self-Attention, https://medium.com/datadreamers/exploring-lora-unveiling-parameter-efficient-tuning-and-self-attention-mechanisms-in-depth-58e4c3b5ce30
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 23 May 2023, QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314 Code: https://github.com/artidoro/qlora Code: https://github.com/TimDettmers/bitsandbytes (The original QLoRA paper.)
Pranav Patel, 2024, In-depth guide to fine-tuning LLMs with LoRA and QLoRA, https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. 2022, Diffusers: State-of-the-art diffusion models, https://github.com/huggingface/diffusers
Taolin Zhang, Jiawang Bai, Zhihe Lu, Dongze Lian, Genping Wang, Xinchao Wang, Shu-Tao Xia, 9 Jul 2024, Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach, https://arxiv.org/abs/2407.06964
18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
H Wang, 2024, Minimalism Yields Maximum Results: Deep Learning with Limited Resource, Ph.D. Thesis, Purdue University, PDF: https://hammer.purdue.edu/articles/thesis/Minimalism_Yields_Maximum_Results_Deep_Learning_with_Limited_Resource/26349415/1/files/47855029.pdf
Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, Simone Scardapane, 16 Aug 2024, Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning, https://arxiv.org/abs/2408.08670 (Faster fine-tuning by selecting layers, freezing layers, or slimming them to fewer fine-tuned parameters.)
Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, 4 Jun 2024 (v2), APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, ICML 2024 Oral, https://arxiv.org/abs/2401.12200 https://github.com/ROIM1998/APT
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
Langa, K., Wang, H., Okuboyejo, O. (2025). Parameter-Efficient Fine-Tuning of Pre-trained Large Language Models for Financial Text Analysis. In: Gerber, A., Maritz, J., Pillay, A.W. (eds) Artificial Intelligence Research. SACAIR 2024. Communications in Computer and Information Science, vol 2326. Springer, Cham. https://doi.org/10.1007/978-3-031-78255-8_1 https://link.springer.com/chapter/10.1007/978-3-031-78255-8_1
Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, Arsalan Shahid, 30 Oct 2024 (v3), The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities, https://arxiv.org/abs/2408.13296
Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang, 23 Jan 2025, Parameter-Efficient Fine-Tuning for Foundation Models, https://arxiv.org/abs/2501.13787
Q Wang, S Shen, Jan 2025, Activation-Guided Low-Rank Parameter Adaptation for Efficient Model Fine-Tuning, IEEE Access, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10852296 (Modified LoRA algorithm using activations for weighting.)
Xiaozhe Yao, Qinghao Hu, Ana Klimovic, 1 Nov 2024 (v2), DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs, https://arxiv.org/abs/2312.05215 (Serve multiple fine-tuned models with full parameters by using deltas/diffs, rather than PEFT or multi-LoRA.)
RR Ghimire, P Poudyal, BK Bal, 2024, Improving on the Limitations of the ASR Model in Low-Resourced Environments Using Parameter-Efficient Fine-Tuning, https://aclanthology.org/2024.icon-1.47.pdf
Nikhil, January 31, 2025, Intel Labs Explores Low-Rank Adapters and Neural Architecture Search for LLM Compression, https://www.marktechpost.com/2025/01/31/intel-labs-introduces-lonas-a-hybrid-approach-combining-low-rank-adapters-and-neural-architecture-search-for-efficient-llm-compression/
J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain, 23 Jan 2025, Low-Rank Adapters Meet Neural Architecture Search for LLM Compression, https://arxiv.org/abs/2501.16372 https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning
TY Zhuo, AR Zebaze, L Von Werra, H de Vries, Q Liu, Mar 2025, Parameter-Efficient Instruction Tuning Code Large Language Models: An Empirical Study, ICLR 2025 review, https://openreview.net/pdf?id=dAiUf1MAbS
Benyamin Jamialahmadi, Parsa Kavehzadeh, Mehdi Rezagholizadeh, Parsa Farinneya, Hossein Rajabzadeh, Aref Jafari, Boxing Chen, Marzieh S.Tahaei, 10 Mar 2025 (v2), Balcony: A Lightweight Approach to Dynamic Inference of Generative Language Models, https://arxiv.org/abs/2503.05005 (A form of PEFT where additional tokens are added at some layer points.)
Guiyao Tie, Zeli Zhao, Dingjie Song, Fuyang Wei, Rong Zhou, Yurou Dai, Wen Yin, Zhejian Yang, Jiangyue Yan, Yao Su, Zhenhan Dai, Yifeng Xie, Yihan Cao, Lichao Sun, Pan Zhou, Lifang He, Hechang Chen, Yu Zhang, Qingsong Wen, Tianming Liu, Neil Zhenqiang Gong, Jiliang Tang, Caiming Xiong, Heng Ji, Philip S. Yu, Jianfeng Gao, 8 Mar 2025, A Survey on Post-training of Large Language Models, https://arxiv.org/abs/2503.06072
Luat Gia Khoi Nguyen, June 26, 2025, Optimizing the Computational Efficiency of Fine-tuning and Inference for Large Language Models, Ph.D. Thesis, Department of Computer Science, University of Twente, Netherlands, http://essay.utwente.nl/106461/1/Nguyen_MA_EEMCS.pdf
Eric Nuertey Coleman, Luigi Quarantiello, Ziyue Liu, Qinwen Yang, Samrat Mukherjee, Julio Hurtado, Vincenzo Lomonaco, 18 Apr 2025, Parameter-Efficient Continual Fine-Tuning: A Survey, https://arxiv.org/abs/2504.13822
Haotian Chen, Zhiyong Xiao, 23 Jul 2025, Swin-TUNA : A Novel PEFT Approach for Accurate Food Image Segmentation, https://arxiv.org/abs/2507.17347
Royson Lee, Minyoung Kim, Fady Rezk, Rui Li, Stylianos I. Venieris, Timothy Hospedales, 3 Sep 2025, FedP$^2$EFT: Federated Learning to Personalize PEFT for Multilingual LLMs, https://arxiv.org/abs/2502.04387