Aussie AI
Transformer Architecture Choices
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Transformer Architecture Choices
There are various architectural decisions that are made in the model design phase, which aren't really optimizations of a model, but can significantly impact its efficiency. Using a more advanced engine architecture is also effectively an optimization that “retains” accuracy because these changes allow the model to be fully trained in a better engine. Some important decisions include:
- Decoder-only versus encoder-decoder architectures
- Alternative floating-point representations (e.g. brain float)
- Pre-norm versus post-norm
- Positional encoding algorithms (embeddings)
- Context length optimizations
- Neural Architecture Search (NAS)
Data doesn't just magically end up in the GPU. There has to be software written to send the data there, and there are a lot of possible optimizations that are used in writing such software. This software is often called the “kernel” of the AI engine. The sub-components of the engine often get called the MatMul kernel, Softmax kernel, normalization kernel, and so on. Software techniques that aim to optimize parallelization primarily by increasing throughput and reducing latency include:
- Vectorization
- Multi-threading
- Kernel fusion
- Kernel fission
- Pipelining
- Scheduling algorithms
Memory usage optimizations: Software optimizations that aim to improve memory usage, and thereby benefit further from lowering memory access overhead to increase parallelism, include:
- Tiling
- Data locality optimizations
- Dataflow optimizations
- Memory management optimizations
- Cache management
- Prefetching
- Offloading
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |