Aussie AI
Layer Reordering
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Layer Reordering
An interesting technique, that generalizes the use of layers, is “layer reordering.” The idea is motivated by the realization that Transformer layers are building blocks which output the same format as their input. Hence, not only can you remove a layer (early exit or layer pruning) or skip a layer (layer skipping), or run the same layer twice (layer fusion), but it can be generalized in any way. You can pick and choose which layers to run, and in what order, and how often. You could even run each layer twice, or run all the layers in reverse, or whatever.
Layer reordering usually refers to entire Transformer layers. For other types of merging or reordering of separate sub-layer structures within Transformer layers, see kernel operator fusion. For discussion of the order of layer normalization subcomponents, see normalization reordering.
Layer reordering seems like it shouldn't work. After all, didn't we expend all those GPU cycles to carefully work out the correct weights for each layer? Isn't it true that the first layers do the broad analysis and the upper layers do the finessing? So, early exiting makes some kind of sense, because it just skips the finer details at the end, but randomly reordering things seems weird. Nevertheless, there are some research papers that explore layer reordering and its generalizations.
Research papers on layer reordering:
- Ofir Press, Noah A. Smith, Omer Levy, 2019, Improving Transformer Models by Reordering their Sublayers, arXiv preprint arXiv:1911.03864, 2019, https://arxiv.org/abs/1911.03864 (Layer reordering! Includes analysis of multiple layers, and also reordering self-attention and feed-forward sub-components in a “sandwich” architecture.)
- Jinhua Zhu, Lijun Wu, Yingce Xia, Shufang Xie, Tao Qin, Wengang Zhou, Houqiang Li, Tie-Yan Liu, Mar 2021, IOT: Instance-wise Layer Reordering for Transformer Structures, https://arxiv.org/abs/2103.03457
- Elicia Ye, March 2023, Greedy Ordering of Layer Weight Matrices in Transformers Improves Translation, https://arxiv.org/abs/2302.02123
For more research on the layer reordering, refer to https://www.aussieai.com/research/layer-pruning#reordering.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |