Aussie AI

Neural Architecture Search

Last Updated 22 October, 2025

by David Spuler, Ph.D.

Neural Architecture Search (NAS) is the very fancy way that AI researchers say things like this: how big should I make the model? How many weights? How many layers? What vocabulary?

Why NAS?

Choosing these numbers is actually a very hard problem. In the early days, these choices were done either randomly or by trial-and-error, which is expensive when you're talking about GPUs. If you go too large, then the model is over-parameterized and unnecessarily expensive. Go too low, and the model won't be very accurate, or might not even work at all. Hence, a large body of research on "NAS" has developed about systematic ways to find optimal sizes of the models on the various dimensions.

The biggest number is how many billions of weights the model should use, but this is actually dependent on a number of other numeric sizes. These weights are called "parameters" and the various other sizes are called "hyper-parameters" of the model, so NAS is also sometimes called "Hyper-Parameter Optimization" (HPO). The sizes and dimensions of models that NAS aims to determine includes:

Number of layers
Embedding size
Vocabulary size
Number of attention heads
Context size

NAS versus Model Compression

There are some parallels between neural architecture search and model compression, especially structural pruning. NAS aims to select the model hyperparamters before or during training, whereas model compression comes in afterwards and changes the model. Some types of pruning are very similar to NAS outcomes, such as:

Depth pruning (e.g. layer pruning)
Width pruning (e.g. head pruning)
Length pruning (e.g. token pruning, embedding pruning)

As an example, any type of layer pruning is very similar to NAS choosing the number of layers. If you train your model, choosing a layer number via NAS, and then subsequently layer prune away some of those layers, that's the same as NAS choosing a smaller number of layers. Of course, that's only true for static layer pruning, whereas dynamic layer pruning such as early exiting has other runtime effects.

Survey Papers on NAS

Some of the review and survey papers on NAS:

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, Xin Wang, 2022, A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions. ACM Computing Surveys 54(4):76:1–76:34, https://arxiv.org/abs/2006.02903
Hadjer Benmeziane, Kaoutar El Maghraoui, Hamza Ouarnoughi, Smail Niar, Martin Wistuba, Naigang Wang, 2021, Hardware-Aware Neural Architecture Search: Survey and Taxonomy. In: International Joint Conference on Artificial Intelligence (IJCAI), https://arxiv.org/abs/2101.09336
Dilyara Baymurzina, Eugene Golikov, Mikhail Burtsev, 2022, Neurocomputing A review of neural architecture search, Volume 474, 14 February 2022, Pages 82-93, https://www.sciencedirect.com/science/article/abs/pii/S0925231221018439
Thomas Elsken, Jan Hendrik Metzen, Frank Hutter, 2019, Neural architecture search: a survey, The Journal of Machine Learning Research, Volume 20Issue 1, pp. 1997–2017, https://dl.acm.org/doi/10.5555/3322706.3361996, https://arxiv.org/abs/1808.05377
Martin Wistuba, Ambrish Rawat, Tejaswini Pedapati, 2019, A Survey on Neural Architecture Search, https://arxiv.org/abs/1905.01392
Shiqing Liu, Haoyu Zhang, Yaochu Jin, Oct 2022, A Survey on Computationally Efficient Neural Architecture Search, https://arxiv.org/abs/2206.01520
Colin White, Mahmoud Safari, Rhea Sukthanker, Binxin Ru, Thomas Elsken, Arber Zela, Debadeepta Dey, Frank Hutter, Jan 2023, Neural Architecture Search: Insights from 1000 Papers, https://arxiv.org/abs/2301.08727
Bernd Bischl, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, Theresa Ullmann, Marc Becker, Anne-Laure Boulesteix, Difan Deng, Marius Lindauer, Nov 2021, Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges, https://arxiv.org/abs/2107.05847
Inas Bachiri, September 2024, A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration, DOI:10.13140/RG.2.2.10612.16009, Thesis for: Master in Computer Science, https://www.researchgate.net/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration https://www.researchgate.net/profile/Inas-Bachiri/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration/links/66ed912c6b101f6fa4f3d6ce/A-Literature-Review-on-Combining-Neural-Architecture-Search-and-Compiler-Optimizations-for-Neural-Network-Acceleration.pdf

General Research Papers on NAS

Some of the research papers on NAS:

Odema, M., Rashid, N., Demirel, B. U., and Faruque, M. A. A. (2021). Lens: Layer distribution enabled neural architecture search in edge-cloud hierarchies. In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 403–408, https://arxiv.org/abs/2107.09309
A. Wong, M. Famuori, M. J. Shafiee, F. Li, B. Chwyl, and J. Chung, “YOLO nano: A highly compact you only look once convolutional neural network for object detection,” 2019, arXiv:1910.01271. https://arxiv.org/abs/1910.01271
David R So, Chen Liang, and Quoc V Le. 2019. The evolved transformer. arXiv preprint arXiv:1901.11117. https://arxiv.org/abs/1901.11117
Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.1, https://arxiv.org/abs/1905.11946, Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
Guihong Li, Duc Hoang, Kartikeya Bhardwaj, Ming Lin, Zhangyang Wang, Radu Marculescu, July 2023, Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities https://arxiv.org/abs/2307.01998
C Fu, 2023, Machine Learning Algorithm and System Co-design for Hardware Efficiency, Ph.D. thesis, Computer Science, University of California San Diego, https://escholarship.org/content/qt52q368p3/qt52q368p3.pdf
Aaron Klein, Jacek Golebiowski, Xingchen Ma, Valerio Perrone, Cedric Archambeau, 3 May 2024, Structural Pruning of Pre-trained Language Models via Neural Architecture Search, https://arxiv.org/abs/2405.02267 (Post-training structured pruning of sub-networks based on NAS, also with weight sharing and several different focus areas of pruning including attention heads, FFNs, and layers.)
Ganesh Jawahar, April 2024, Methods for design of efficient on-device natural language processing architectures, Ph.D. thesis, Computer Science, The University of British Columbia (Vancouver) https://open.library.ubc.ca/media/download/pdf/24/1.0441384/4
Hou-I Liu, Marco Galindo, Hongxia Xie, Lai-Kuan Wong, Hong-Han Shuai, Yung-Yui Li, Wen-Huang Cheng, 8 Apr 2024, Lightweight Deep Learning for Resource-Constrained Environments: A Survey, https://arxiv.org/abs/2404.07236 (A survey of various optimizations, with a lot of focus on image and vision models, including CNNs, RNNs, and Transformers.)
Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu, 16 Jan 2024, A Survey of Resource-efficient LLM and Multimodal Foundation Models, https://arxiv.org/abs/2401.08092 Project: https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey (Broad survey with many optimizations including this topic.)
You Zhou, Xiujing Lin, Xiang Zhang, Maolin Wang, Gangwei Jiang, Huakang Lu, Yupeng Wu, Kai Zhang, Zhe Yang, Kehang Wang, Yongduo Sui, Fengwei Jia, Zuoli Tang, Yao Zhao, Hongxuan Zhang, Tiannuo Yang, Weibo Chen, Yunong Mao, Yi Li, De Bao, Yu Li, Hongrui Liao, Ting Liu, Jingwen Liu, Jinchi Guo, Xiangyu Zhao, Ying WEI, Hong Qian, Qi Liu, Xiang Wang, Wai Kin (Victor)Chan, Chenliang Li, Yusen Li, Shiyu Yang, Jining Yan, Chao Mou, Shuai Han, Wuxia Jin, Guannan Zhang, Xiaodong Zeng, Nov 2023, On the Opportunities of Green Computing: A Survey, https://arxiv.org/abs/2311.00447 (Extensive survey of environmental and green AI issues, along with a survey of various optimization methods to reduce AI resource requirements in training and inference.)
62. Chu, X.; Zhang, B.; Xu, R. FairNAS: Rethinking Evaluation Fairness of Weight Sharing Neural Architecture Search. In Proceedings of the IEEE International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 12219–12228. http://dx.doi.org/10.1109/ICCV48922.2021.01202 https://arxiv.org/abs/1907.01845 (NAS in the context of weight sharing architectures.)
Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani, 2023, A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
Tan, M., Chen, B., Pang, R., Vasudevan, V., Sandler, M., Howard, A., and Le, Q. V. (2019). Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://openaccess.thecvf.com/content_CVPR_2019/html/Tan_MnasNet_Platform_Aware_Neural_Architecture_Search_for_Mobile_CVPR_2019_paper.html.
David Spuler, March 2024, Chapter 56. Neural Architecture Search, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
8 Jun 2024 (v2), A Survey on Efficient Inference for Large Language Models, Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang, https://arxiv.org/abs/2404.14294
kipply's blog, 2023-03-30, Transformer Taxonomy (the last lit review), https://kipp.ly/transformer-taxonomy/ (Papers for all the Transformer architectures and milestone papers for the major optimization improvements on them.)
Zechun Liu, Changsheng Zhao, Forrest Iandola, Chen Lai, Yuandong Tian, Igor Fedorov, Yunyang Xiong, Ernie Chang, Yangyang Shi, Raghuraman Krishnamoorthi, Liangzhen Lai, Vikas Chandra, 27 Jun 2024 (v2), MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases, Meta Research, https://arxiv.org/abs/2402.14905 Code: https://github.com/facebookresearch/MobileLLM
Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He, 15 Feb 2024, Model Compression and Efficient Inference for Large Language Models: A Survey, https://arxiv.org/abs/2402.09748
Matias Martinez, 2 Aug 2024, The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines, https://arxiv.org/abs/2408.01050
Szabolcs Cséfalvay, James Imber, 31 Jan 2023 (v2), Self-Compressing Neural Networks, https://arxiv.org/abs/2301.13142
Ye Qiao, Haocheng Xu, Yifan Zhang, Sitao Huang, 26 Aug 2024, MONAS: Efficient Zero-Shot Neural Architecture Search for MCUs, https://arxiv.org/abs/2408.15034
Fred Hohman, Chaoqun Wang, Jinmook Lee, Jochen Görtler, Dominik Moritz, Jeffrey P Bigham, Zhile Ren, Cecile Foret, Qi Shan, and Xiaoyi Zhang. 2024. Talaria: Interactively Optimizing Machine Learning Models for Efficient Inference. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI '24). Association for Computing Machinery, New York, NY, USA, Article 648, 1–19. https://doi.org/10.1145/3613904.3642628 https://dl.acm.org/doi/full/10.1145/3613904.3642628
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
Jiantong Jiang, Ajmal Mian, 1 Sep 2024, FastBO: Fast HPO and NAS with Adaptive Fidelity Identification, https://arxiv.org/abs/2409.00584
Hung-Yueh Chiang, Diana Marculescu, 27 Aug 2024, SCAN-Edge: Finding MobileNet-speed Hybrid Networks for Diverse Edge Devices via Hardware-Aware Evolutionary Search, https://arxiv.org/abs/2408.15395
Junfeng Gong, Cheng Liu, Long Cheng, Huawei Li, Xiaowei Li, 17 Jul 2024, MCU-MixQ: A HW/SW Co-optimized Mixed-precision Neural Network Design Framework for MCUs, https://arxiv.org/abs/2407.18267
Xuan Shen, Pu Zhao, Yifan Gong, Zhenglun Kong, Zheng Zhan, Yushu Wu, Ming Lin, Chao Wu, Xue Lin, Yanzhi Wang, 25 Sep 2024, Search for Efficient Large Language Models, https://arxiv.org/abs/2409.17372 (Looking for subnets inside models as an alternative to NAS.)
Rhea Sanjay Sukthanker, Benedikt Staffler, Frank Hutter, Aaron Klein, 9 Oct 2024, LLM Compression with Neural Architecture Search, https://arxiv.org/abs/2410.06479 (NAS with width/attention head and layer pruning.)
Inas Bachiri, September 2024, A Literature Review on Combining Neural Architecture Search and Compiler Optimizations for Neural Network Acceleration, DOI:10.13140/RG.2.2.10612.16009, Thesis for: Master in Computer Science, https://www.researchgate.net/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration https://www.researchgate.net/profile/Inas-Bachiri/publication/384190836_A_Literature_Review_on_Combining_Neural_Architecture_Search_and_Compiler_Optimizations_for_Neural_Network_Acceleration/links/66ed912c6b101f6fa4f3d6ce/A-Literature-Review-on-Combining-Neural-Architecture-Search-and-Compiler-Optimizations-for-Neural-Network-Acceleration.pdf
Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong, 1 Nov 2024 (v3), Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies, https://arxiv.org/abs/2407.13623 https://github.com/sail-sg/scaling-with-vocab https://hf.co/spaces/sail/scaling-with-vocab-demo
Nikhil, January 31, 2025, Intel Labs Explores Low-Rank Adapters and Neural Architecture Search for LLM Compression, https://www.marktechpost.com/2025/01/31/intel-labs-introduces-lonas-a-hybrid-approach-combining-low-rank-adapters-and-neural-architecture-search-for-efficient-llm-compression/
J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain, 23 Jan 2025, Low-Rank Adapters Meet Neural Architecture Search for LLM Compression, https://arxiv.org/abs/2501.16372 https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning
Adel Ammar, Anis Koubaa, Omer Nacar, Wadii Boulila, 13 May 2025, Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact on Performance and Efficiency, https://arxiv.org/abs/2505.08445
Hemanth Saratchandran, Damien Teney, Simon Lucey, 27 May 2025, Leaner Transformers: More Heads, Less Depth, AIML, https://arxiv.org/abs/2505.20802
Yixiu Liu, Yang Nan, Weixian Xu, Xiangkun Hu, Lyumanshan Ye, Zhen Qin, Pengfei Liu, 24 Jul 2025, AlphaGo Moment for Model Architecture Discovery, https://arxiv.org/abs/2507.18074
Abhash Kumar Jha, Shakiba Moradian, Arjun Krishnakumar, Martin Rapp, Frank Hutter, 22 Jul 2025, confopt: A Library for Implementation and Evaluation of Gradient-based One-Shot NAS Methods, https://arxiv.org/abs/2507.16533
Yinhui Ma, Tomomasa Yamasaki, Zhehui Wang, Tao Luo, Bo Wang, 31 Jul 2025, Coflex: Enhancing HW-NAS with Sparse Gaussian Processes for Efficient and Scalable DNN Accelerator Design, https://arxiv.org/abs/2507.23437
Zijun Sun, Yanning Shen, 13 Aug 2025, Learn to Explore: Meta NAS via Bayesian Optimization Guided Graph Generation, https://arxiv.org/abs/2508.09467
Bo Lyu, Yu Cui, Tuo Shi, Ke Li, 6 Sep 2025, OptiProxy-NAS: Optimization Proxy based End-to-End Neural Architecture Search, https://arxiv.org/abs/2509.05656
Imane Hamzaoui, Riyadh Baghdadi, 17 Jul 2025, Neural Architecture Search with Mixed Bio-inspired Learning Rules, https://arxiv.org/abs/2507.13485
Yujia Shi, Emil Njor, Pablo Mart\'inez-Nuevo, Sven Ewan Shepstone, Xenofon Fafoutis, 21 Jul 2025, Data Aware Differentiable Neural Architecture Search for Tiny Keyword Spotting Applications, https://arxiv.org/abs/2507.15545
Matteo Gambella, Fabrizio Pittorino, Manuel Roveri, 28 Jul 2025, Architecture-Aware Minimization (A$^2$M): How to Find Flat Minima in Neural Architecture Search, https://arxiv.org/abs/2503.10404
Wendong Mao, Mingfan Zhao, Jianfeng Guan, Qiwei Dong, Zhongfeng Wang, 26 Jul 2025, A Memory-Efficient Framework for Deformable Transformer with Neural Architecture Search, https://arxiv.org/abs/2507.11549
Azaz-Ur-Rehman Nasir, Samroz Ahmad Shoaib, Muhammad Abdullah Hanif, Muhammad Shafique, 2 Aug 2025, ESM: A Framework for Building Effective Surrogate Models for Hardware-Aware Neural Architecture Search, https://arxiv.org/abs/2508.01505
Gianluca Peri, Lorenzo Chicchi, Duccio Fanelli, Lorenzo Giambagli, 5 Aug 2025, Spectral Architecture Search for Neural Network Models, https://arxiv.org/abs/2504.00885
Matteo Gambella, Jary Pomponi, Simone Scardapane, and Manuel Roveri, 6 Aug 2025, NACHOS: Neural Architecture Search for Hardware Constrained Early Exit Neural Networks, https://arxiv.org/abs/2401.13330
Kun Jing, Luoyu Chen, Jungang Xu, Jianwei Tai, Yiyu Wang, Shuaimin Li, 6 Aug 2025, Zero-Shot Neural Architecture Search with Weighted Response Correlation, https://arxiv.org/abs/2507.08841
Anurag Tripathi, Ajeet Kumar Singh, Rajsabi Surya, Aum Gupta, Sahiinii Lemaina Veikho, Dorien Herremans, Sudhir Bisane, 20 Aug 2025, HHNAS-AM: Hierarchical Hybrid Neural Architecture Search using Adaptive Mutation Policies, https://arxiv.org/abs/2508.14946
Yuxian Gu, Qinghao Hu, Shang Yang, Haocheng Xi, Junyu Chen, Song Han, Han Cai, 21 Aug 2025, Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search, https://arxiv.org/abs/2508.15884
Maolin Wang, Tianshuo Wei, Sheng Zhang, Ruocheng Guo, Wanyu Wang, Shanshan Ye, Lixin Zou, Xuetao Wei, Xiangyu Zhao, 28 Aug 2025, DANCE: Resource-Efficient Neural Architecture Search with Data-Aware and Continuous Adaptation, https://arxiv.org/abs/2507.04671
Zhao Wei, Chin Chun Ooi, and Yew-Soon Ong, 2 Sep 2025, A Continuous Encoding-Based Representation for Efficient Multi-Fidelity Multi-Objective Neural Architecture Search, https://arxiv.org/abs/2509.01943
Yuxuan Hu, Jihao Liu, Ke Wang, Jinliang Zhen, Weikang Shi, Manyuan Zhang, Qi Dou, Rui Liu, Aojun Zhou, Hongsheng Li, 6 Sep 2025, LM-Searcher: Cross-domain Neural Architecture Search with LLMs via Unified Numerical Encoding, https://arxiv.org/abs/2509.05657
Ankit Kulshrestha and Xiaoyuan Liu and Hayato Ushijima-Mwesigwa and Ilya Safro, 18 Sep 2025, Neural Architecture Search Algorithms for Quantum Autoencoders, https://arxiv.org/abs/2509.15451

This is not the full list of papers, I add with reasonable certainty, given that one survey paper stated there have been over 1,000 papers written on NAS since 2021. If this is your chosen dissertation topic, better start writing that lit review section early!

NAS and Dynamic Inference Optimization

Dynamic NAS is not yet a mainstream use of NAS searching. NAS has traditionally been applied to finding models without regard to dynamic approaches. An emerging area of research is to consider the hyperparameters of dynamic inference optimizations as part of searching the problem space for an optimal model.

Research papers on "dynamic NAS" include:

Matteo Gambella, Manuel Roveri, "EDANAS: Adaptive Neural Architecture Search for Early Exit Neural Networks", 2023 International Joint Conference on Neural Networks (IJCNN), pp.1-8, 2023. https://ieeexplore.ieee.org/document/10191876 (NAS applied to early-exit dynamic inference.)
Chakkrit Termritthikun, Yeshi Jamtsho, Jirarat Ieamsaard, Paisarn Muneesawang, Ivan Lee, 2021, EEEA-Net: An Early Exit Evolutionary Neural Architecture Search, Engineering Applications of Artificial Intelligence Volume 104, September 2021, 104397, https://www.sciencedirect.com/science/article/abs/pii/S0952197621002451, https://arxiv.org/abs/2108.06156, Code: https://github.com/chakkritte/EEEA-Net (A 2021 paper on NAS applied to early-exit.)
KT Chitty-Venkata, Y Bian, M Emani, V Vishwanath, Jan 2023 Differentiable Neural Architecture, Mixed Precision and Accelerator Co-search, IEEE Access, DOI:10.1109/ACCESS.2023.3320133, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10266308
Linnan Wang, Chenhan Yu, Satish Salian, Slawomir Kierat, Szymon Migacz, Alex Fit Florea, 2022, GPUNet: Searching the Deployable Convolution Neural Networks for GPUs, https://arxiv.org/abs/2205.00841 (A general NAS system that could be applied statically or dynamically.)
Matteo Gambella, Jary Pomponi, Simone Scardapane, Manuel Roveri, 24 Jan 2024, NACHOS: Neural Architecture Search for Hardware Constrained Early Exit Neural Networks, https://arxiv.org/abs/2401.13330
David Spuler, March 2024, Chapter 56. Neural Architecture Search, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
Angie Boggust, Venkatesh Sivaraman, Yannick Assogba, Donghao Ren, Dominik Moritz, Fred Hohman, 6 Aug 2024, Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments, https://arxiv.org/abs/2408.03274
Hongrong Cheng, Miao Zhang, Javen Qinfeng Shi, 9 Aug 2024 (v2), A Survey on Deep Neural Network Pruning-Taxonomy, Comparison, Analysis, and Recommendations, IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2024.3447085, https://arxiv.org/abs/2308.06767 https://ieeexplore.ieee.org/abstract/document/10643325
Akhiad Bercovich, Tomer Ronen, Talor Abramovich, Nir Ailon, Nave Assaf, Mohammad Dabbah, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Netanel Haber, Ehud Karpas, Itay Levy, Shahar Mor, Zach Moshe, Najeeb Nabwani, Omri Puny, Ran Rubin, Itamar Schen, Ido Shahaf, Oren Tropp, Omer Ullman Argov, Ran Zilberstein, Ran El-Yaniv, 28 Nov 2024, Puzzle: Distillation-Based NAS for Inference-Optimized LLMs,NVIDIA Research, https://arxiv.org/abs/2411.19146 (This is dynamic NAS on a vast scale in a search space of size 10^138, because the optimization is applied with low granularity to each block in attention and FFN subcomponents of each layer.)
Armin W. Thomas, Rom Parnichkun, Alexander Amini, Stefano Massaroli and Michael Poli, Liquid AI, December 2, 2024, Automated Architecture Synthesis via Targeted Evolution, https://arxiv.org/abs/2411.17800 https://www.liquid.ai/research/automated-architecture-synthesis-via-targeted-evolution
Shaibal Saha, Lanyu Xu, 26 Feb 2025, Vision Transformers on the Edge: A Comprehensive Survey of Model Compression and Acceleration Strategies, https://arxiv.org/abs/2503.02891