Aussie AI
Pre-Norm vs Post-Norm
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Pre-Norm vs Post-Norm
The original 2017 vanilla Transformer architecture (Vaswani et al, 2017) had a “post-norm” architecture where LayerNorm was done on the output logits from a layer. Subsequently, researchers found that switching to a “pre-norm” architecture, instead of post-norm, could fix one of the major problems with the original Transformer, namely that it was initially unstable in training, requiring a “warmup” phase. Pre-norm was found to stabilize early training and remove the need for any special handling in the warmup.
Since then, several researchers have explored where to place the layer normalization submodule. The general consensus seems to be that placing them before computations (“pre-norm”) is better than after calculations (“post-norm”). However, there are still research papers going either way, so there's still room for definitive research.
The pre-norm versus post-norm architectural decision is more of an issue for model accuracy and training warmup than a computation cost decision. There's not a lot of benefit in execution speed regarding the placement of normalization in a different alignment. Rather, we would prefer to use fewer normalization blocks overall, rather than just rearrange them. In fact, pre-norm may be slower than post-norm, because it moves the post-norm normalization components from the outputs to the inputs, but then also requires an extra normalization at the end.
Note that we cannot mix these approaches: using pre-norm for faster training and then switching to post-norm for faster inference just won't work. Like many other Transformer architectural decisions, the same normalization architecture must be retained across training and inference, unless you'd rather gibberish for output.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |