Aussie AI

Parameter-Efficient Fine-Tuning (PEFT)

  • Last Updated 7 December, 2024
  • by David Spuler, Ph.D.

Parameter-Efficient Fine-Tuning (PEFT) is fine-tuning that's efficient with parameters. Instead of updating all of the model's parameters, which is slow and inefficient, only a subset of parameters is updated. The rest of the model parameters are "frozen" during the fine-tuning procedure.

Various types of PEFT have been examined, such as:

  • Low-Rank Adapters (LoRA)
  • Quantized Low-Rank Adapters (LoRA)
  • Lengthwise PEFT — extending the vocabulary.

LoRA

The idea behind LoRA is to use "low-rank" matrices, which have a smaller size, and thus are much less costly to fine-tuning. These matrices can be multiplied together to create data than can be combined with the original model.

Multi-LoRA

The use of multiple LoRA adapters got a boost when Apple chose this method for its Apple Intelligence platform. Several other platforms use multi-LoRA as an efficiency gain for both training and inference.

Research papers on multi-LoRA include:

LoRA Inference Optimizations

The popularity of LoRA as an efficient training method has also spawned research on maximizing its inference efficiency. Some of the loading and unloading of LoRA adapters can be quite expensive, and methods of optimizing multi-LORA platforms have seen various research papers.

Research papers on LoRA and multi-LoRA inference optimization include:

  • Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy, 28 Oct 2023, Punica: Multi-Tenant LoRA Serving https://arxiv.org/abs/2310.18547 Code: https://github.com/punica-ai/punica
  • Jingwei Xu, Junyu Lai, Yunpeng Huang, 24 May 2024 (v2), MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models, https://arxiv.org/abs/2405.13053
  • Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag, 28 Mar 2024, CLoRA: A Contrastive Approach to Compose Multiple LoRA Models, https://arxiv.org/abs/2403.19776
  • Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang, 20 Jan 2024, CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference, https://arxiv.org/abs/2401.11240 (Multi-LoRA inference where it starts running prefill computations in the CPU while loading the LoRA weights into the GPU.)
  • Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu, 28 May 2024, LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design, https://arxiv.org/abs/2405.17741
  • Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, 5 Jun 2024 (v3), S-LoRA: Serving Thousands of Concurrent LoRA Adapters, https://arxiv.org/abs/2311.03285 Code: https://github.com/S-LoRA/S-LoRA
  • Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
  • Bingyang Wu, Ruidong Zhu, and Zili Zhang, Peng Sun, Shanghai AI Lab; Xuanzhe Liu, Xin Jin, 2024, dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving, https://www.usenix.org/conference/osdi24/presentation/wu-bingyang
  • Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
  • Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao 12 Aug 2024 (v3), A Survey on LoRA of Large Language Models, https://arxiv.org/abs/2407.11046 https://github.com/ZJU-LLMs/Awesome-LoRAs.git
  • Yuxuan Zhang, Ruizhe Li, 2 Oct 2024, DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models, https://arxiv.org/abs/2410.01497 https://github.com/MeCuping/DLP-LoRA (Merging multiple LoRA adapters for parallel inference.)
  • Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu, 1 Nov 2024, V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM, https://arxiv.org/abs/2411.00915
  • Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Hubertus Franke, Josep Torrellas, 24 Nov 2024, Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments, https://arxiv.org/abs/2411.17741

QLoRA

QLoRA is quantized LoRA. This is pretty standard nowadays, with most LoRA adapters using quantization. A lot of the research papers don't use the term "QLoRA" anymore. For example, Apple Intelligence uses QLoRA in its multi-LoRA architecture with 4-bit quantization.

Research papers on QLoRA include:

Prompt Tuning (Extended Vocabulary PEFT)

Prompt tuning is a lengthwise PEFT that creates new tokens to extend the vocabulary, rather than training the parameters for existing tokens. Since the tokens are new, they don't have values, and obviously aren't frozen. The weights for the original tokens are frozen, however, which means most of them. TAs an example, this type of PEFT can be useful when extending the LLM via fine-tuning on a special curated data set, so as to have particular "trigger tokens" to launch integrated tools or perform other advanced capabilities. For example, adding new tokens that indicate a "tool launch" to the vocabulary and only fine-tuning for those ones.

  • IBM, 2024, What is prompt-tuning?, https://research.ibm.com/blog/what-is-ai-prompt-tuning
  • Abhinav Jain, Swarat Chaudhuri, Thomas Reps, Chris Jermaine, 24 May 2024, Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation, https://arxiv.org/abs/2405.15282
  • MohammadAli SadraeiJavaeri, Ehsaneddin Asgari, Alice Carolyn McHardy, Hamid Reza Rabiee, 7 Jun 2024, SuperPos-Prompt: Enhancing Soft Prompt Tuning of Language Models with Superposition of Multi Token Embeddings, https://arxiv.org/abs/2406.05279
  • Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella, 5 Jun 2024, Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need, https://arxiv.org/abs/2406.03216
  • Xuyang Wu, Zhiyuan Peng, Sravanthi Rajanala, Hsin-Tai Wu, Yi Fang, 31 May 2024, Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models, https://arxiv.org/abs/2405.20654
  • Wei Zhu, Aaron Xuxiang Tian, Congrui Yin, Yuan Ni, Xiaoling Wang, Guotong Xie, 7 Jun 2024 (v2), IAPT: Instruction-Aware Prompt Tuning for Large Language Models, https://arxiv.org/abs/2405.18203
  • Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
  • 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
  • M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418

Research Papers on PEFT

PEFT is a popular technique that receives a lot of research attention:

More AI Research

Read more about: