Aussie AI
Norm Pruning
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Norm Pruning
Some research has suggested that the normalization layer in Transformers can be removed without a major loss in model accuracy, which has been called “norm pruning.” If possible, nothing could be faster than this!
The concern with this approach is two-fold.
Firstly, whether non-normalization will give rise to outliers that distort the accuracy of the output.
Secondly, the practical matter of ensuring the computations don't cause a floating-point
overflow into +Inf
, -Inf
or NaN
.
We could add some operations that ensure we don't have outliers and avoid those nasty floating-point
oddities, but, umm, that's actually normalization, so we're just adding it back in again!
Maybe it's faster to do a type of “mini-normalization” that fixes up some of these issues without
fully normalizing every value.
It's a little unclear, since these “norm pruning” approaches are only seen in research papers so far.
Even if we cannot remove them all, it is an important design decision how often we need to normalize. It's not a cheap operation, so we shouldn't re-normalize after every Transformer component. However, the typical Transformer architectures tend to use the normalization blocks quite heavily, in one way or another.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |