Aussie AI
Early Exit of Inference Layers
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Early Exit of Inference Layers
Early exit is quitting the main inference loop at one of the layers in the encoder and/or decoder using the results up to that point if there is a high degree of confidence. Implementing this method is surprisingly simple, because a model layer's input and output have the same dimensions. You can simply exit the layer loop early and use whatever the currently computed logits are as inputs into the final Softmax layer.
The speedup from early exit is obvious in that it avoids one or more layers entirely. Early exiting is possible in a significant number of inference evaluations (e.g. reportedly 40% in DeeBERT [Xin et al. 2020]), but not always. By skipping all layers thereafter, it avoids a significant amount of computation.
Early exit is terminology that refers specifically to an inference optimization. The general idea of stopping early has also been applied in training, many years prior. The terms “dropout” and “early stopping” have also occasionally been used to mean inference early exit, but usually refer to training method optimizations with a similar goal to reduce training times.
Early exit is effectively dynamic pruning of multiple layers of the model at runtime during inference. The motivation is a speedup by exiting before all of the layers have been completed, using the results up to that point, if there is a high enough degree of confidence. Early exit means avoiding the calculations for all layers after the just-finished one, whereas “layer skipping” can skip one layer to continue on the following layer, and “layer reordering” is a strange generalization where layers can get executed or skipped in any order.
Why does this even work? Technically, it works because the layer inputs and outputs have the same format. But why does it work accurately? Early exit relies on an assumption that each successive layer makes the results more accurate, but with reducing changes. Some research supports this idea, showing that the initial layers do more to prioritize output tokens, and the last few layers tend to “finesse” between multiple reasonable options. After a few layers, the results are often “good enough” to decide on the output token, or at least a reasonably good choice, without finishing all the layers.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |