Aussie AI
Parameter-Efficient Fine-Tuning (PEFT)
-
Last Updated 7 December, 2024
-
by David Spuler, Ph.D.
Parameter-Efficient Fine-Tuning (PEFT) is fine-tuning that's efficient with parameters. Instead of updating all of the model's parameters, which is slow and inefficient, only a subset of parameters is updated. The rest of the model parameters are "frozen" during the fine-tuning procedure.
Various types of PEFT have been examined, such as:
- Low-Rank Adapters (LoRA)
- Quantized Low-Rank Adapters (LoRA)
- Lengthwise PEFT — extending the vocabulary.
LoRA
The idea behind LoRA is to use "low-rank" matrices, which have a smaller size, and thus are much less costly to fine-tuning. These matrices can be multiplied together to create data than can be combined with the original model.
- Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi, 29 Apr 2024, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report, https://arxiv.org/abs/2405.00732 Code: https://huggingface.co/predibase
- Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
- Lequn Chen, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/bitstream/handle/1773/51337/Chen_washington_0250E_26603.pdf?sequence=1&isAllowed=y
- Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella, 5 Jun 2024, Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need, https://arxiv.org/abs/2406.03216
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- Rahul Chand, Yashoteja Prabhu, Pratyush Kumar, 20 Dec 2023, DSFormer: Effective Compression of Text-Transformers by Dense-Sparse Weight Factorization, https://arxiv.org/abs/2312.13211
- Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao, 4 Jan 2024, Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models https://arxiv.org/abs/2401.00625 (A general survey paper with coverage of many techniques including this one.)
- Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno, 8 Feb 2024, Accurate LoRA-Finetuning Quantization of LLMs via Information Retention, https://arxiv.org/abs/2402.05445 Code: https://github.com/htqin/ir-qlora
- Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim, Nov 2023, LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning, https://arxiv.org/abs/2311.12023
- Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song, Dec 2023, Compressed Context Memory For Online Language Model Interaction, https://arxiv.org/abs/2312.03414 Code: https://github.com/snu-mllab/context-memory
- S Guo, J Xu, LL Zhang, M Yang, Oct 2023, Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models, arXiv preprint arXiv:2310.05015, https://arxiv.org/pdf/2310.05015.pdf Code: https://github.com/microsoft/Moonlit/tree/main/Compresso
- Shivansh Kaushik, Aug 1, 2023, Efficient Model Fine-Tuning for LLMs: Understanding PEFT by Implementation, https://medium.com/@shivansh.kaushik/efficient-model-fine-tuning-for-llms-understanding-peft-by-implementation-fc4d5e985389
- Shashank Verma, Neal Vaidya, Vinh Nguyen, Wei Du, Scot Junkin and BoYang Hsueh, Jun 07, 2024, Seamlessly Deploying a Swarm of LoRA Adapters with NVIDIA NIM, NVIDIA Technical Blog, https://developer.nvidia.com/blog/seamlessly-deploying-a-swarm-of-lora-adapters-with-nvidia-nim/
- 8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
- vLLM, 2024, Using LoRA adapters, https://docs.vllm.ai/en/v0.4.2/models/lora.html
- Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 16 Oct 2021 (v2), LoRA: Low-Rank Adaptation of Large Language Models, Microsoft Research, https://arxiv.org/abs/2106.09685 https://github.com/microsoft/LoRA (The original LoRA paper from 2021.)
- Adarsh Shrivastav, Aug 17, 2023, LoRA Parameter-Efficient Tuning with Understanding of Self-Attention, https://medium.com/datadreamers/exploring-lora-unveiling-parameter-efficient-tuning-and-self-attention-mechanisms-in-depth-58e4c3b5ce30
- Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 23 May 2023, QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314 Code: https://github.com/artidoro/qlora Code: https://github.com/TimDettmers/bitsandbytes (The original QLoRA paper.)
- Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang, 13 Apr 2024 (v4), EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models, https://arxiv.org/abs/2310.03270 Code: https://github.com/ThisisBillhe/EfficientDM
- Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, 8 Mar 2024 (v3), LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, https://arxiv.org/abs/2309.12307 Code: https://github.com/dvlab-research/LongLoRA
- Pranav Patel, 2024, In-depth guide to fine-tuning LLMs with LoRA and QLoRA, https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
- Jingwei Xu, Junyu Lai, Yunpeng Huang, 24 May 2024 (v2), MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models, https://arxiv.org/abs/2405.13053
- Lamini, June 2024, Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations, https://www.lamini.ai/blog/lamini-memory-tuning PDF: https://github.com/lamini-ai/Lamini-Memory-Tuning/blob/main/research-paper.pdf (Deploy models with many LoRA adapters, selecting between them with MoE.)
- Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang, 2 Jul 2024, SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules, https://arxiv.org/abs/2407.02031 (Efficient diffusion models in systems with multi-LoRA, ControlNets, and other multi-module add-ons, including parallelizing execution of add-ons and more efficient loading of LoRA with faster updating or "patching" of model weights, including by performing some layers in parallel without LoRA weights, while loading the LoRA adapters.)
- Tianyi Tang, Yiwen Hu, Bingqian Li, Wenyang Luo, Zijing Qin, Haoxiang Sun, Jiapeng Wang, Shiyi Xu, Xiaoxue Cheng, Geyang Guo, Han Peng, Bowen Zheng, Yiru Tang, Yingqian Min, Yushuo Chen, Jie Chen, Yuanqian Zhao, Luran Ding, Yuhao Wang, Zican Dong, Chunxuan Xia, Junyi Li, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen, 8 Jul 2024, LLMBox: A Comprehensive Library for Large Language Models, https://arxiv.org/abs/2407.05563 Code: https://github.com/RUCAIBox/LLMBox
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li, 24 Jul 2024, Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance, https://arxiv.org/abs/2407.17029 Code: https://github.com/xiaocaigou/qbaraqahira (Combining quantization and LoRA.)
- Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy, 28 Oct 2023, Punica: Multi-Tenant LoRA Serving https://arxiv.org/abs/2310.18547 Code: https://github.com/punica-ai/punica
- Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag, 28 Mar 2024, CLoRA: A Contrastive Approach to Compose Multiple LoRA Models, https://arxiv.org/abs/2403.19776
- Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang, 20 Jan 2024, CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference, https://arxiv.org/abs/2401.11240 (Multi-LoRA inference where it starts running prefill computations in the CPU while loading the LoRA weights into the GPU.)
- Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu, 28 May 2024, LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design, https://arxiv.org/abs/2405.17741
- Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, 5 Jun 2024 (v3), S-LoRA: Serving Thousands of Concurrent LoRA Adapters, https://arxiv.org/abs/2311.03285 Code: https://github.com/S-LoRA/S-LoRA
- Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
- Bingyang Wu, Ruidong Zhu, and Zili Zhang, Peng Sun, Shanghai AI Lab; Xuanzhe Liu, Xin Jin, 2024, dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving, https://www.usenix.org/conference/osdi24/presentation/wu-bingyang
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- Shikhar Bajpai, Sep 27, 2024, Shrinking Elephants: A Funny Guide to 4-bit and 8-bit Quantization for LLMs with LoRA, https://medium.com/@shikharstruck/shrinking-elephants-a-funny-guide-to-4-bit-and-8-bit-quantization-for-llms-with-lora-ddf9f1a62070
- Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Multi-LoRA
The use of multiple LoRA adapters got a boost when Apple chose this method for its Apple Intelligence platform. Several other platforms use multi-LoRA as an efficiency gain for both training and inference.
Research papers on multi-LoRA include:
- Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi, 29 Apr 2024, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report, https://arxiv.org/abs/2405.00732 Code: https://huggingface.co/predibase
- Jingwei Xu, Junyu Lai, Yunpeng Huang, 24 May 2024 (v2), MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models, https://arxiv.org/abs/2405.13053
- Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
- Lamini, June 2024, Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations, https://www.lamini.ai/blog/lamini-memory-tuning PDF: https://github.com/lamini-ai/Lamini-Memory-Tuning/blob/main/research-paper.pdf (Deploy models with many LoRA adapters, selecting between them with MoE.)
- Suyi Li, Lingyun Yang, Xiaoxiao Jiang, Hanfeng Lu, Zhipeng Di, Weiyi Lu, Jiawei Chen, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang, 2 Jul 2024, SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules, https://arxiv.org/abs/2407.02031 (Efficient diffusion models in systems with multi-LoRA, ControlNets, and other multi-module add-ons, including parallelizing execution of add-ons and more efficient loading of LoRA with faster updating or "patching" of model weights, including by performing some layers in parallel without LoRA weights, while loading the LoRA adapters.)
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy, 28 Oct 2023, Punica: Multi-Tenant LoRA Serving https://arxiv.org/abs/2310.18547 Code: https://github.com/punica-ai/punica
- Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag, 28 Mar 2024, CLoRA: A Contrastive Approach to Compose Multiple LoRA Models, https://arxiv.org/abs/2403.19776
- Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang, 20 Jan 2024, CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference, https://arxiv.org/abs/2401.11240 (Multi-LoRA inference where it starts running prefill computations in the CPU while loading the LoRA weights into the GPU.)
- Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu, 28 May 2024, LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design, https://arxiv.org/abs/2405.17741
- Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, 5 Jun 2024 (v3), S-LoRA: Serving Thousands of Concurrent LoRA Adapters, https://arxiv.org/abs/2311.03285 Code: https://github.com/S-LoRA/S-LoRA
- Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
- Bingyang Wu, Ruidong Zhu, and Zili Zhang, Peng Sun, Shanghai AI Lab; Xuanzhe Liu, Xin Jin, 2024, dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving, https://www.usenix.org/conference/osdi24/presentation/wu-bingyang
- David Spuler, 25th August, 2024, Hot Inference Optimization Techniques, https://www.aussieai.com/blog/hot-inference-research
- Yao Lu, Song Bian, Lequn Chen, Yongjun He, Yulong Hui, Matthew Lentz, Beibin Li, Fei Liu, Jialin Li, Qi Liu, Rui Liu, Xiaoxuan Liu, Lin Ma, Kexin Rong, Jianguo Wang, Yingjun Wu, Yongji Wu, Huanchen Zhang, Minjia Zhang, Qizhen Zhang, Tianyi Zhou, Danyang Zhuo, 17 Jan 2024, Computing in the Era of Large Generative Models: From Cloud-Native to AI-Native, https://arxiv.org/abs/2401.12230
- David Spuler, 26th August, 2024, State-of-the-Art LLM Backends, Aussie AI Blog, https://www.aussieai.com/blog/state-of-the-art-llm-backends
- David Spuler, 26th August, 2024, Inference Optimization Research Ideas, https://www.aussieai.com/blog/inference-optimization-ideas
- Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao, 23 Sep 2024 (v2), CITI: Enhancing Tool Utilizing Ability in Large Language Models without Sacrificing General Performance, https://arxiv.org/abs/2409.13202
- Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao 12 Aug 2024 (v3), A Survey on LoRA of Large Language Models, https://arxiv.org/abs/2407.11046 https://github.com/ZJU-LLMs/Awesome-LoRAs.git
- Yuxuan Zhang, Ruizhe Li, 2 Oct 2024, DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models, https://arxiv.org/abs/2410.01497 https://github.com/MeCuping/DLP-LoRA (Merging multiple LoRA adapters for parallel inference.)
- Enhancing LoRA Model Serving Capacity via Adaptive Operator Scheduling for Multi-Tenancy on GPU, Lingnan Xia, Hua Ma, https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10721583
- Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu, 1 Nov 2024, V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM, https://arxiv.org/abs/2411.00915
- Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Hubertus Franke, Josep Torrellas, 24 Nov 2024, Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments, https://arxiv.org/abs/2411.17741
LoRA Inference Optimizations
The popularity of LoRA as an efficient training method has also spawned research on maximizing its inference efficiency. Some of the loading and unloading of LoRA adapters can be quite expensive, and methods of optimizing multi-LORA platforms have seen various research papers.
Research papers on LoRA and multi-LoRA inference optimization include:
- Lequn Chen, Zihao Ye, Yongji Wu, Danyang Zhuo, Luis Ceze, Arvind Krishnamurthy, 28 Oct 2023, Punica: Multi-Tenant LoRA Serving https://arxiv.org/abs/2310.18547 Code: https://github.com/punica-ai/punica
- Jingwei Xu, Junyu Lai, Yunpeng Huang, 24 May 2024 (v2), MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models, https://arxiv.org/abs/2405.13053
- Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag, 28 Mar 2024, CLoRA: A Contrastive Approach to Compose Multiple LoRA Models, https://arxiv.org/abs/2403.19776
- Suyi Li, Hanfeng Lu, Tianyuan Wu, Minchen Yu, Qizhen Weng, Xusheng Chen, Yizhou Shan, Binhang Yuan, Wei Wang, 20 Jan 2024, CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference, https://arxiv.org/abs/2401.11240 (Multi-LoRA inference where it starts running prefill computations in the CPU while loading the LoRA weights into the GPU.)
- Rui Kong, Qiyang Li, Xinyu Fang, Qingtian Feng, Qingfeng He, Yazhu Dong, Weijun Wang, Yuanchun Li, Linghe Kong, Yunxin Liu, 28 May 2024, LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design, https://arxiv.org/abs/2405.17741
- Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica, 5 Jun 2024 (v3), S-LoRA: Serving Thousands of Concurrent LoRA Adapters, https://arxiv.org/abs/2311.03285 Code: https://github.com/S-LoRA/S-LoRA
- Chen, Lequn, 2024, Multi-tenant Machine Learning Model Serving Systems on GPU Clusters, PhD Thesis, University of Washington, https://digital.lib.washington.edu/researchworks/items/13e14599-b4ee-4fbb-86bf-e58a4118d0f9
- Bingyang Wu, Ruidong Zhu, and Zili Zhang, Peng Sun, Shanghai AI Lab; Xuanzhe Liu, Xin Jin, 2024, dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving, https://www.usenix.org/conference/osdi24/presentation/wu-bingyang
- Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang, 13 Jun 2024, ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models, https://arxiv.org/abs/2406.09041 (How to load multiple experts for MoE in a memory-efficient way using mixed-precision quantization based on identifying the few salient channels that need higher precision, as an alternative to multi-LoRA.)
- Yuren Mao, Yuhang Ge, Yijiang Fan, Wenyi Xu, Yu Mi, Zhonghao Hu, Yunjun Gao 12 Aug 2024 (v3), A Survey on LoRA of Large Language Models, https://arxiv.org/abs/2407.11046 https://github.com/ZJU-LLMs/Awesome-LoRAs.git
- Yuxuan Zhang, Ruizhe Li, 2 Oct 2024, DLP-LoRA: Efficient Task-Specific LoRA Fusion with a Dynamic, Lightweight Plugin for Large Language Models, https://arxiv.org/abs/2410.01497 https://github.com/MeCuping/DLP-LoRA (Merging multiple LoRA adapters for parallel inference.)
- Liang Mi, Weijun Wang, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yunchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu, 1 Nov 2024, V-LoRA: An Efficient and Flexible System Boosts Vision Applications with LoRA LMM, https://arxiv.org/abs/2411.00915
- Nikoleta Iliakopoulou, Jovan Stojkovic, Chloe Alverti, Tianyin Xu, Hubertus Franke, Josep Torrellas, 24 Nov 2024, Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments, https://arxiv.org/abs/2411.17741
QLoRA
QLoRA is quantized LoRA. This is pretty standard nowadays, with most LoRA adapters using quantization. A lot of the research papers don't use the term "QLoRA" anymore. For example, Apple Intelligence uses QLoRA in its multi-LoRA architecture with 4-bit quantization.
Research papers on QLoRA include:
- Andrey Gromov, Kushal Tirumala, Hassan Shapourian, Paolo Glorioso, Daniel A. Roberts, 26 Mar 2024, The Unreasonable Ineffectiveness of the Deeper Layers, https://arxiv.org/abs/2403.17887 (Static layer pruning with some PEFT re-training after removing layers, with quantization and QLoRA.)
- Intel, 2024, https://github.com/intel/intel-extension-for-transformers
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models (Apple's on-device models feature optimizations including small models, grouped query attention, 2-bit/4-bit quantization including activation quantization, shared embedding/unembedding tensors, small-ish vocabulary size of 49k, an undisclosed efficient KV cache optimization for neural engines, and layer-specific 16-bit LoRA/QLoRA adapters of size "10s of megabytes" for fine-tuned specialized model versions, also sometimes in 2-bit/4-bit, claiming speed rates of 0.6ms/token in prefill, and 30 tokens per second in decoding.)
- Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 23 May 2023, QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314 Code: https://github.com/artidoro/qlora Code: https://github.com/TimDettmers/bitsandbytes (The original QLoRA paper.)
- Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang, 13 Apr 2024 (v4), EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models, https://arxiv.org/abs/2310.03270 Code: https://github.com/ThisisBillhe/EfficientDM
- Pranav Patel, 2024, In-depth guide to fine-tuning LLMs with LoRA and QLoRA, https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li, 24 Jul 2024, Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance, https://arxiv.org/abs/2407.17029 Code: https://github.com/xiaocaigou/qbaraqahira (Combining quantization and LoRA.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Shikhar Bajpai, Sep 27, 2024, Shrinking Elephants: A Funny Guide to 4-bit and 8-bit Quantization for LLMs with LoRA, https://medium.com/@shikharstruck/shrinking-elephants-a-funny-guide-to-4-bit-and-8-bit-quantization-for-llms-with-lora-ddf9f1a62070
- Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu, 25 Sep 2024, A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms, https://arxiv.org/abs/2409.16694
- Neal Lawton, Aishwarya Padmakumar, Judith Gaspers, Jack FitzGerald, Anoop Kumar, Greg Ver Steeg, Aram Galstyan, 9 Oct 2024, QuAILoRA: Quantization-Aware Initialization for LoRA, https://arxiv.org/abs/2410.14713
- Meta, October 24, 2024, Introducing quantized Llama models with increased speed and a reduced memory footprint, https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/
Prompt Tuning (Extended Vocabulary PEFT)
Prompt tuning is a lengthwise PEFT that creates new tokens to extend the vocabulary, rather than training the parameters for existing tokens. Since the tokens are new, they don't have values, and obviously aren't frozen. The weights for the original tokens are frozen, however, which means most of them. TAs an example, this type of PEFT can be useful when extending the LLM via fine-tuning on a special curated data set, so as to have particular "trigger tokens" to launch integrated tools or perform other advanced capabilities. For example, adding new tokens that indicate a "tool launch" to the vocabulary and only fine-tuning for those ones.
- IBM, 2024, What is prompt-tuning?, https://research.ibm.com/blog/what-is-ai-prompt-tuning
- Abhinav Jain, Swarat Chaudhuri, Thomas Reps, Chris Jermaine, 24 May 2024, Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation, https://arxiv.org/abs/2405.15282
- MohammadAli SadraeiJavaeri, Ehsaneddin Asgari, Alice Carolyn McHardy, Hamid Reza Rabiee, 7 Jun 2024, SuperPos-Prompt: Enhancing Soft Prompt Tuning of Language Models with Superposition of Multi Token Embeddings, https://arxiv.org/abs/2406.05279
- Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella, 5 Jun 2024, Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need, https://arxiv.org/abs/2406.03216
- Xuyang Wu, Zhiyuan Peng, Sravanthi Rajanala, Hsin-Tai Wu, Yi Fang, 31 May 2024, Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models, https://arxiv.org/abs/2405.20654
- Wei Zhu, Aaron Xuxiang Tian, Congrui Yin, Yuan Ni, Xiaoling Wang, Guotong Xie, 7 Jun 2024 (v2), IAPT: Instruction-Aware Prompt Tuning for Large Language Models, https://arxiv.org/abs/2405.18203
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- M Xu, D Cai, W Yin, S Wang, X Jin, X Liu - ACM Computing Surveys, 2024, Resource-efficient Algorithms and Systems of Foundation Models: A Survey, https://dl.acm.org/doi/pdf/10.1145/3706418
Research Papers on PEFT
PEFT is a popular technique that receives a lot of research attention:
- Libo Qin, Qiguang Chen, Xiachong Feng, Yang Wu, Yongheng Zhang, Yinghui Li, Min Li, Wanxiang Che, Philip S. Yu, 21 May 2024, Large Language Models Meet NLP: A Survey, https://arxiv.org/abs/2405.12819 (A survey of research into how LLMs, with and without fine-tuning, perform in various NLP use cases, such as mathematical reasoning, dialogue understanding, translation, and more.)
- Justin Zhao, Timothy Wang, Wael Abid, Geoffrey Angus, Arnav Garg, Jeffery Kinnison, Alex Sherstinsky, Piero Molino, Travis Addair, Devvret Rishi, 29 Apr 2024, LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report, https://arxiv.org/abs/2405.00732 Code: https://huggingface.co/predibase
- Runheng Liu, Xingchen Xiao, Heyan Huang, Zewen Chi, Zhijing Wu, 7 May 2024, FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference, https://arxiv.org/abs/2405.04065 (Optimize RAG by appending rather than prepending documents, and modifying the attention for improvements in KV caching, by shimming or replacing some of the CUDA GPU low-level memory management APIs to avoid the need to rewrite kernels with extra higher-level memory management code.)
- Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari, 22 Apr 2024, OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework, Apple Research, https://arxiv.org/abs/2404.14619 Code: https://huggingface.co/apple/OpenELM
- Akash Takyar, 2024, Parameter-efficient Fine-tuning (PEFT): Overview, benefits, techniques and model training, https://www.leewayhertz.com/parameter-efficient-fine-tuning/
- Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, Sai Qian Zhang, 29 Apr 2024 (v5), Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey, https://arxiv.org/abs/2403.14608
- Guochang Li, Chen Zhi, Jialiang Chen, Junxiao Han, Shuiguang Deng, 9 Jun 2024, A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Automated Program Repair, https://arxiv.org/abs/2406.05639
- Xiongtao Zhou, Jie He, Yuhua Ke, Guangyao Zhu, Víctor Gutiérrez-Basulto, Jeff Z. Pan, 7 Jun 2024, An Empirical Study on Parameter-Efficient Fine-Tuning for MultiModal Large Language Models, https://arxiv.org/abs/2406.05130
- Jitai Hao, WeiWei Sun, Xin Xin, Qi Meng, Zhumin Chen, Pengjie Ren, Zhaochun Ren, 7 Jun 2024, MEFT: Memory-Efficient Fine-Tuning through Sparse Adapter, https://arxiv.org/abs/2406.04984 Code: https://github.com/CURRENTF/MEFT
- Naibin Gu, Peng Fu, Xiyu Liu, Bowen Shen, Zheng Lin, Weiping Wang, 6 Jun 2024, Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning, https://arxiv.org/abs/2406.03792
- Martin Wistuba, Prabhu Teja Sivaprasad, Lukas Balles, Giovanni Zappella, 5 Jun 2024, Choice of PEFT Technique in Continual Learning: Prompt Tuning is Not All You Need, https://arxiv.org/abs/2406.03216
- Canwen Xu, 2024, Efficient Natural Language Processing for Language Models, Ph.D. thesis, Computer Science, UNIVERSITY OF CALIFORNIA SAN DIEGO, PDF: https://escholarship.org/uc/item/9dv1k5xv PDF: https://escholarship.org/content/qt9dv1k5xv/qt9dv1k5xv.pdf?t=sc34ay (Evaluates several acceleration methods including early-exit, PEFT, and distillation.)
- Benjue Weng, 13 Apr 2024, Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning Strategies, https://arxiv.org/abs/2404.09022 (Reviewing fine-tuning of large models.)
- Wangbo Zhao, Jiasheng Tang, Yizeng Han, Yibing Song, Kai Wang, Gao Huang, Fan Wang, Yang You, 18 Mar 2024, Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation, https://arxiv.org/abs/2403.11808 (PEFT and adaptive inference and token pruning in Vision Transformers.)
- Yun Zhu, Yaoke Wang, Haizhou Shi, Siliang Tang, 28 Jan 2024, Efficient Tuning and Inference for Large Language Models on Textual Graphs, https://arxiv.org/abs/2401.15569 (Optimizing Graph Neural Networks on textual graphs using caching and early exit inference.)
- Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
- You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
- Chang, Xiangyu; Miraj Ahmed, Sk; Krishnamurthy, Srikanth V.; Guler, Basak; Swami, Ananthram; Oymak, Samet; Roy-Chowdhury, Amit K., Jan 2024, Plug-and-Play Transformer Modules for Test-Time Adaptation, https://arxiv.org/abs/2401.04130 https://ui.adsabs.harvard.edu/abs/2024arXiv240104130C/abstract
- Chengyu Wang, Junbing Yan, Wei Zhang, Jun Huang, Nov 2023, Towards Better Parameter-Efficient Fine-Tuning for Large Language Models: A Position Paper, https://arxiv.org/abs/2311.13126
- Shivansh Kaushik, Aug 1, 2023, Efficient Model Fine-Tuning for LLMs: Understanding PEFT by Implementation, https://medium.com/@shivansh.kaushik/efficient-model-fine-tuning-for-llms-understanding-peft-by-implementation-fc4d5e985389
- Abhinav Jain, Swarat Chaudhuri, Thomas Reps, Chris Jermaine, 24 May 2024, Prompt Tuning Strikes Back: Customizing Foundation Models with Low-Rank Prompt Adaptation, https://arxiv.org/abs/2405.15282
- Apple, June 2024, Introducing Apple’s On-Device and Server Foundation Models, https://machinelearning.apple.com/research/introducing-apple-foundation-models
- Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 16 Oct 2021 (v2), LoRA: Low-Rank Adaptation of Large Language Models, Microsoft Research, https://arxiv.org/abs/2106.09685 https://github.com/microsoft/LoRA (The original LoRA paper from 2021.)
- Adarsh Shrivastav, Aug 17, 2023, LoRA Parameter-Efficient Tuning with Understanding of Self-Attention, https://medium.com/datadreamers/exploring-lora-unveiling-parameter-efficient-tuning-and-self-attention-mechanisms-in-depth-58e4c3b5ce30
- Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 23 May 2023, QLoRA: Efficient Finetuning of Quantized LLMs, https://arxiv.org/abs/2305.14314 Code: https://github.com/artidoro/qlora Code: https://github.com/TimDettmers/bitsandbytes (The original QLoRA paper.)
- Pranav Patel, 2024, In-depth guide to fine-tuning LLMs with LoRA and QLoRA, https://www.mercity.ai/blog-post/guide-to-fine-tuning-llms-with-lora-and-qlora
- Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. 2022, Diffusers: State-of-the-art diffusion models, https://github.com/huggingface/diffusers
- Taolin Zhang, Jiawang Bai, Zhihe Lu, Dongze Lian, Genping Wang, Xinchao Wang, Shu-Tao Xia, 9 Jul 2024, Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach, https://arxiv.org/abs/2407.06964
- 18 Apr 2024 (v2), The Efficiency Spectrum of Large Language Models: An Algorithmic Survey, Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang, https://arxiv.org/abs/2312.00678
- Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
- Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, Kaibin Huang, July 2024, Mobile Edge Intelligence for Large Language Models: A Contemporary Survey, https://www.techrxiv.org/doi/pdf/10.36227/techrxiv.172115025.57884352
- H Wang, 2024, Minimalism Yields Maximum Results: Deep Learning with Limited Resource, Ph.D. Thesis, Purdue University, PDF: https://hammer.purdue.edu/articles/thesis/Minimalism_Yields_Maximum_Results_Deep_Learning_with_Limited_Resource/26349415/1/files/47855029.pdf
- Alessio Devoto, Federico Alvetreti, Jary Pomponi, Paolo Di Lorenzo, Pasquale Minervini, Simone Scardapane, 16 Aug 2024, Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning, https://arxiv.org/abs/2408.08670 (Faster fine-tuning by selecting layers, freezing layers, or slimming them to fewer fine-tuned parameters.)
- Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, Kurt Keutzer, 1 May 2024 (v6), LLM Inference Unveiled: Survey and Roofline Model Insights, https://arxiv.org/abs/2402.16363 Code: https://github.com/hahnyuan/LLM-Viewer
- Bowen Zhao, Hannaneh Hajishirzi, Qingqing Cao, 4 Jun 2024 (v2), APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference, ICML 2024 Oral, https://arxiv.org/abs/2401.12200 https://github.com/ROIM1998/APT
- Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu, 8 May 2024 (v2), Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security, https://arxiv.org/abs/2401.05459 https://github.com/MobileLLM/Personal_LLM_Agents_Survey
- Yiheng Liu, Hao He, Tianle Han, Xu Zhang, Mengyuan Liu, Jiaming Tian, Yutong Zhang, Jiaqi Wang, Xiaohui Gao, Tianyang Zhong, Yi Pan, Shaochen Xu, Zihao Wu, Zhengliang Liu, Xin Zhang, Shu Zhang, Xintao Hu, Tuo Zhang, Ning Qiang, Tianming Liu, Bao Ge, 6 Jan 2024 (v2), Understanding LLMs: A Comprehensive Overview from Training to Inference, https://arxiv.org/abs/2401.02038
- Xia, Wenhan, Sep 2024, Methods for Efficient and Scalable Deep Learning, Ph.D. Thesis, Electrical and Computer Engineering Department, Princeton University, http://arks.princeton.edu/ark:/88435/dsp015q47rs12x (Covers PEFT/LoRA on training, and dual pruning with layer skipping and channel/width pruning for inference.)
- Langa, K., Wang, H., Okuboyejo, O. (2025). Parameter-Efficient Fine-Tuning of Pre-trained Large Language Models for Financial Text Analysis. In: Gerber, A., Maritz, J., Pillay, A.W. (eds) Artificial Intelligence Research. SACAIR 2024. Communications in Computer and Information Science, vol 2326. Springer, Cham. https://doi.org/10.1007/978-3-031-78255-8_1 https://link.springer.com/chapter/10.1007/978-3-031-78255-8_1
More AI Research
Read more about: