Aussie AI

56. Neural Architecture Search

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

56. Neural Architecture Search

“A journey of a thousand miles begins with a single step.”

— Chinese Proverb.

What is NAS?

Neural Architecture Search (NAS) is the very fancy way that AI researchers say things like this: how big should I make the model? How many weights? How many layers? What vocabulary size?

The biggest number is how many billions of weights the model should use, but this is actually dependent on a number of other numeric sizes. These weights are called “parameters” and the various other sizes are called “hyper-parameters” of the model, so NAS is also sometimes called “Hyper-Parameter Optimization” (HPO). The sizes and dimensions of models that NAS aims to determine includes:

Number of layers
Embedding size
Vocabulary size
Number of attention heads
Context size

Choosing these numbers is actually a very hard problem. In the early days, these choices were done either randomly or by trial-and-error, which is expensive when you're talking about GPUs. If you go too large, then the model is over-parameterized and unnecessarily expensive. Go too low, and the model won't be very accurate, or might not even work at all. Hence, a large body of research on “NAS” has developed about systematic ways to find optimal sizes of the models on the various dimensions.

NAS is not a type of model compression and isn't just something you do “offline” before inference. Rather, it's the first thing you do in an AI project. It's before training, and before you even start designing the C++ code for your engine that's tuned to the model.

NAS versus Model Compression

There are some parallels between neural architecture search and model compression, especially structured pruning. NAS aims to select the model hyperparameters before or during training, whereas model compression comes in afterwards and changes the model. Some types of structured pruning are very similar to NAS outcomes, such as:

Depth pruning (e.g. layer pruning)
Width pruning (e.g. head pruning)
Length pruning (e.g. token pruning, embedding pruning)

As an example, any type of layer pruning is very similar to NAS choosing the number of layers. If you train your model, choosing a layer number via NAS, and then subsequently layer prune away some of those layers, that's the same as NAS choosing a smaller number of layers. Of course, that's only true for static layer pruning, whereas dynamic layer pruning such as early exiting has other runtime effects.

NAS Research Papers

This is not the full list of papers, I add with reasonable certainty, given that one survey paper stated there have been over 1,000 papers written on NAS since 2021. If this is your chosen dissertation topic, better start writing that lit review section early!

Survey papers on NAS include:

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Xiaojiang Chen, Xin Wang, 2022, A Comprehensive Survey of Neural Architecture Search: Challenges and Solutions, ACM Computing Surveys 54(4):76:1–76:34, https://arxiv.org/abs/2006.02903
Hadjer Benmeziane, Kaoutar El Maghraoui, Hamza Ouarnoughi, Smail Niar, Martin Wistuba, Naigang Wang, 2021, Hardware-Aware Neural Architecture Search: Survey and Taxonomy. In: International Joint Conference on Artificial Intelligence (IJCAI), https://arxiv.org/abs/2101.09336
Dilyara Baymurzina, Eugene Golikov, Mikhail Burtsev, 2022, A review of neural architecture search, Neurocomputing, Volume 474, 14 February 2022, Pages 82-93, https://www.sciencedirect.com/science/article/abs/pii/S0925231221018439
Thomas Elsken, Jan Hendrik Metzen, Frank Hutter, 2019, Neural architecture search: a survey, The Journal of Machine Learning Research, Volume 20Issue 1, pp. 1997–2017, https://dl.acm.org/doi/10.5555/3322706.3361996, https://arxiv.org/abs/1808.05377
Martin Wistuba, Ambrish Rawat, Tejaswini Pedapati, 2019, A Survey on Neural Architecture Search, https://arxiv.org/abs/1905.01392
Shiqing Liu, Haoyu Zhang, Yaochu Jin, Oct 2022, A Survey on Computationally Efficient Neural Architecture Search, https://arxiv.org/abs/2206.01520
Colin White, Mahmoud Safari, Rhea Sukthanker, Binxin Ru, Thomas Elsken, Arber Zela, Debadeepta Dey, Frank Hutter, Jan 2023, Neural Architecture Search: Insights from 1000 Papers, https://arxiv.org/abs/2301.08727
Bernd Bischl, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, Theresa Ullmann, Marc Becker, Anne-Laure Boulesteix, Difan Deng, Marius Lindauer, Nov 2021, Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges, https://arxiv.org/abs/2107.05847

General research papers on NAS:

Odema, M., Rashid, N., Demirel, B. U., and Faruque, M. A. A. (2021). Lens: Layer distribution enabled neural architecture search in edge-cloud hierarchies, In 2021 58th ACM/IEEE Design Automation Conference (DAC), pages 403–408, https://arxiv.org/abs/2107.09309
A. Wong, M. Famuori, M. J. Shafiee, F. Li, B. Chwyl, and J. Chung, 2019, YOLO nano: A highly compact you only look once convolutional neural network for object detection, arXiv:1910.01271. https://arxiv.org/abs/1910.01271
David R So, Chen Liang, and Quoc V Le. 2019. The evolved transformer, arXiv preprint arXiv:1901.11117. https://arxiv.org/abs/1901.11117
Mingxing Tan and Quoc V Le. 2019, Efficientnet: Rethinking model scaling for convolutional neural networks, arXiv preprint arXiv:1905.1, https://arxiv.org/abs/1905.11946, Code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet
Guihong Li, Duc Hoang, Kartikeya Bhardwaj, Ming Lin, Zhangyang Wang, Radu Marculescu, July 2023, Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities, https://arxiv.org/abs/2307.01998
C Fu, 2023, Machine Learning Algorithm and System Co-design for Hardware Efficiency, Ph.D. thesis, Computer Science, University of California San Diego, https://escholarship.org/content/qt52q368p3/qt52q368p3.pdf

For more research on NAS, see also https://www.yoryck.com/research/nas.

Dynamic NAS

Dynamic NAS is the use of NAS-like approaches to find optimal hyper-parameters for the various dynamic inference options. Every type of adaptive inference or dynamic pruning has meta-parameters such as numeric threshold values or a choice of multiple decision metrics. Deciding on the best option from all of that shemozzle is the idea of Dynamic NAS research.

Dynamic NAS is not yet a mainstream use of NAS searching, but there are some research papers starting to appear on this extension. NAS has traditionally been applied to finding optimal meta-parameters for models without regard to dynamic approaches. This emerging area of research aims to consider the hyperparameters of dynamic inference optimizations as part of searching the problem space for an optimal model.

Research papers on dynamic NAS:

Matteo Gambella, Manuel Roveri, 2023, EDANAS: Adaptive Neural Architecture Search for Early Exit Neural Networks, 2023 International Joint Conference on Neural Networks (IJCNN), pp.1-8, 2023. https://ieeexplore.ieee.org/document/10191876 (NAS applied to early-exit dynamic inference.)
Chakkrit Termritthikun, Yeshi Jamtsho, Jirarat Ieamsaard, Paisarn Muneesawang, Ivan Lee, 2021, EEEA-Net: An Early Exit Evolutionary Neural Architecture Search, Engineering Applications of Artificial Intelligence Volume 104, September 2021, 104397, https://www.sciencedirect.com/science/article/abs/pii/S0952197621002451, https://arxiv.org/abs/2108.06156, Code: https://github.com/chakkritte/EEEA-Net (A 2021 paper on NAS applied to early-exit.)
KT Chitty-Venkata, Y Bian, M Emani, V Vishwanath, Jan 2023 Differentiable Neural Architecture, Mixed Precision and Accelerator Co-search, IEEE Access, DOI:10.1109/ACCESS.2023.3320133, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10266308
Linnan Wang, Chenhan Yu, Satish Salian, Slawomir Kierat, Szymon Migacz, Alex Fit Florea, 2022, GPUNet: Searching the Deployable Convolution Neural Networks for GPUs, https://arxiv.org/abs/2205.00841 (A general NAS system that could be applied statically or dynamically.)