Aussie AI
Table of Contents for Generative AI in C++
-
by David Spuler, Ph.D.
Table of Contents
Here is the Table of Contents for Generative AI in C++ by David Spuler.
Generative AI in C++: Coding Transformers and LLMs
Part I: AI Projects in C++
Chapter 1. Introduction to AI in C++ (Full Chapter)
- Everything's Bigger in AI
- What is AI?
- The State of AI
- The Market for AI
- AI Technology Trends
- Why AI and C++?
Chapter 2. Transformers & LLMs (Full Chapter)
- AI Engines and Models
- Training and Fine-tuning
- Inference
- Context and Conversations
- Extended Transformers
- Other Types of Neural Networks
Chapter 3. AI Phones (Full Chapter)
- Native Smartphone AI
- Obstacles to Smartphone AI
- Near-Term Technology Trends
- Speeding Up Smartphone AI
- AI Phone Apps
- Native Smartphone AI
- Research on AI Phones
Chapter 4. AI on Your Desktop (Full Chapter)
- Your Desktop AI Engine
- Open Source C++ Transformer Engines
- Open Source Models
- AI PCs
- New C++ Language Features
- C++ Coding Strategy for AI
- Elements of AI in C++
- Advanced AI in C++
- Downsides of AI in C++
Chapter 5. Design Choices & Architectures (Full Chapter)
- Choosing Your AI Project
- Planning and Requirements
- Top 10 Really Big Optimizations
- Build versus Buy
- Foundation Model Choices
- Open Source Models
- Commercial-Usage Open Source Models
- Model Size
- Software Architecture
- AI Tech Stack
- Financial Optimizations
Chapter 6. Training, Fine-Tuning & RAG (Full Chapter)
- Training Options
- Fine-Tuning
- Training Data for Fine-tuning
- Retrieval-Augmented Generation (RAG)
- RAG Project Design
- RAG Detailed Algorithm
- RAG Data Management
- Fine-Tuning vs RAG
- Prompt Engineering and RAG
- Hybrid RAG + Fine-tuning Methods
- Use Cases for FT vs RAG
- Training FAQs
- Vector Databases
Chapter 7. Deployment Architecture (Full Chapter)
- Backend Server Architecture
- AI Server Hosting Options
- GPU Specs
- Online Architecture Optimization
- API Wrapper Architecture Optimizations
- Hosting Server Specs
- Request Queue Architecture
- Load Balancing
- Prompt History and Context
Part II: Basic C++ Optimizations
Chapter 8. Bitwise Operations (Full Chapter)
- C++ Bitwise Operators
- Bit Flag Basics
- Bit Sets
- Bitwise Intrinsic Functions
- Example: Integer Popcount
- Example: Bitwise Log2 on Integers
- Example: Highest Integer Power-of-Two
- Integer Overflow and Underflow
- Missing Bitwise Operators: NAND, NOR, XNOR
- Bitwise AI Applications
Chapter 9. Floating Point Arithmetic (Full Chapter)
- Floating Point Introduction
- Bit Representations of Floating-Point Numbers
- Standardized Bit Representations
- Representing Zero
- Underflow and Overflow
- Representing Special Numbers
- Getting to the Bits in C++
- FP16 Problems in C++
- FTZ and DAZ CPU Modes
- Negative Zero
- Floating-Point Builtin Functions
- Pitfalls and Portability
- Floating-Point Bit Tricks for AI
- Example: Add-as-int Approximate Multiply
- Example: Float Bitshift via Integer Addition
- Example: Log2 of Floating-Point is the Exponent
Chapter 10. Arithmetic Optimizations (Full Chapter)
- Arithmetic Optimizations
- Operator Strength Reduction
- Avoid Remainder Operations
- Reciprocal Multiplication
- Integer Arithmetic
- Expression Transformations
- Common Subexpression Elimination
- Algebraic Identities
- Float Family Loyalty
Chapter 11. Compile-Time Optimizations (Full Chapter)
- AI Models are Static
- C++ Compile-time Techniques
- C++ Optimizers
- Floating-Point Optimizer Options
- Compiler-Automated Optimizations
- People Helping Parsers
- Inline Functions
- inline function limitations
- Non-inlined functions
- Inline Variables
- Constant Specifiers
- Constant Expressions Specifier
consteval
functionsconstexpr
functionsconstexpr
functions vsinline
functionsconstinit
variablesif constexpr
statements- Templates
- Next level templating
Chapter 12. Pointer Arithmetic (Full Chapter)
- What is Pointer Arithmetic?
- Pointers and Arrays
- Pointer Arithmetic Loop Optimizations
- Smart Pointers
- Pointers vs References
Chapter 13. Algorithm Speedups (Full Chapter)
- Algorithm Optimization Techniques
- Lookup Table Precomputation
- Lazy Evaluation
- Source Code Precomputation
- Augmenting Data Structures
- Approximate Tests
- Common Case First
- Simple Case First
- Special Solution of Simple cases
- Incremental Algorithms
Chapter 14. Memory Optimizations (Full Chapter)
- Memory Reduction in C++
- Memory Reductions
- Compact Data Representation
- Reducing Data Size
- Reducing Static Storage
- Measuring Code Size and Static Storage
- Code Bloat
- Stack Usage
- Reducing Heap Usage
Part III: Parallel C++ Optimizations
Chapter 15. Loop Vectorization (Full Chapter)
- Sequential vs Parallel Loop Optimizations
- Loop Fusion
- Loop Perforation
- Loop Unrolling
- Duff's Device for Loop Unrolling
- Loop Tiling or Blocking
- Loop Fission
- Loop Reversal
- Loop Code Motion
- Loop Distribution
- Loop Reordering
- Loop Iterator Strength Reduction
- Loop Coalescing
- Loop Collapsing
- Loop Peeling
- Loop Splitting
- Loop Interchange
- Loop Sentinel
- Loop Strip Mining (Loop Sectioning)
- Loop Spreading
- Loop Normalization
- Loop Skewing
Chapter 16. Hardware Acceleration (Full Chapter)
- Why Hardware Acceleration?
- Types of Hardware Acceleration
- CPU Hardware Acceleration
- Detecting CPU Acceleration in C++
- GPU Hardware Acceleration
- Detecting GPU Support in C++
- Assembly Language versus Intrinsics
- Inline Assembly Language
Chapter 17. AVX Intrinsics (Full Chapter)
- What are AVX Intrinsics?
- AVX Operations
- AVX Horizontal Intrinsics
- Portability Checking of AVX Versions
- Example: Basic AVX SIMD Multiply
- AVX Memory Alignment Issues
- AVX-2 SIMD Multiplication
- AVX-512 SIMD Multiplication
- Example: AVX 128-Bit Dot Product
- Example: AVX-2 256-Bit Dot Product
Chapter 18. Parallel Data Structures (Full Chapter)
- Data Structures in AI Engines
- Bit Vectors
- Permutation Arrays
- Vector Hashing
- Perfect Hashing
- Bloom Filters
Part IV: Transformer Components in C++
Chapter 19. Encoders & Decoders (Full Chapter)
- What are Encoders and Decoderrs?
- Transformer Layers and Components
- FAQs on Transformer Architecture
- Advances in Transformer Architectures
- Model Loader
Chapter 20. Attention (Full Chapter)
- What is Attention?
- What are Q, K, and V?
- What is Cross Attention?
- Masking and Lookahead
- Multi-Head Attention
- Positional Encoding
- Softmax Normalization
- Efficient Attention Algorithms
- Alternatives to Attention
- Attention Head Approximation
- Attention Head Pruning
- Long Context Research
- Length Generalization
Chapter 21. Activation Functions (Full Chapter)
- What is an Activation Function?
- Common Activation Functions
- Optimization of Activation Functions
- Learned Activation Parameters
- Inputs, Outputs and Dimensions
- RELU Activation Function
- RELU AVX SIMD Vectorization
- GELU Activation Function
- GELU AVX SIMD Vectorization
- SiLU/Sigmoid Activation Function
- SiLU AVX SIMD Vectorization
- SwiGLU/Swish Activation Function
- ELU Activation Function
- Precomputed Lookup Tables
- Load-Time Precompilation
- Approximating Activation Functions
- Activation Function Research
Chapter 22. Vector Algorithms (Full Chapter)
Chapter 23. Tensors (Full Chapter)
- What are Tensors?
- AI Tensor Computations
- Neural Network Theory and Tensors
- Tensor Arithmetic
- Unary Tensor Operations
- Binary Elementwise Tensor Operations
- Sparse Tensors
Chapter 24. Normalization (Full Chapter)
- What is Normalization?
- Why is Normalization Needed?
- Inputs, Outputs and Dimensions
- Optimizing Normalization
- Norm Pruning
- Pre-Norm vs Post-Norm
- Basic Min-Max Scaled Normalization
- Root Mean Square Normalization
- Z-score Normalization
- Batch Normalization
- Example: BatchNorm Optimization
- Layer Normalization
Chapter 25. Softmax (Full Chapter)
- What is Softmax?
- Inputs, Outputs and Dimensions
- Softmax and Temperature
- Softmax C++ Optimizations
- Vectorized Softmax
- Vectorized Softmax with AVX
- Vectorized & Fused Loop Softmax
- Softmax Benchmarking Results
- Softmax Overflow and Underflow
- Softmax Optimization Research
Chapter 26. Decoding Algorithms (Full Chapter)
- What is Decoding?
- Greedy Decoding
- Top-k Decoding
- Top-k Vector Algorithm
- Optimizing Top-k Decoding
- Top-p Decoding
- Advanced Decoding Algorithms
- Tokens and Non-Autoregression
Chapter 27. Tokenizer and Vocabulary (Full Chapter)
- What is Tokenization?
- Tokenization and Inference Latency
- Tokenizer Design Issues
- Tokenizer Algorithms
- Tokenizer Optimizations
- Untokenization
- What are Embeddings?
- Embedding Optimizations
- Positional Encoding
Part V: Optimizing Transformers in C++
Chapter 28. Deslugging AI Engines (Full Chapter)
- Everything's Slower in AI
- Accuracy-Degrading Optimizations
- Accuracy-Retaining Optimizations
- Transformer Architecture Choices
- Hybrid AI Engine Optimizations
- Model Compression
- ML Compilers
- AI Memory Reduction
Chapter 29. Caching Optimizations (Full Chapter)
- What is Caching?
- KV Caching
- Global KV Prefill/Encoder Caching
- Inference Cache
- Semantic Caching and Vector Databases
- Cached or Precomputed Transpose
- Vector Dot Product Computation Reuse
- Input Similarity-Based Caching
Chapter 30. Vectorization (Full Chapter)
- What is Vectorization?
- Vectorization with AVX Intrinsics
- Example: AVX Vectorized Dot Product
- Example: AVX Vector Sum Reduction
- AVX Vector Max and Min Reductions
- Vectorized Sum-of-Squares Reduction
- Vectorized Multiply Vector by Scalar
- Vectorized Add Scalar
- Vectorized RELU with Max Intrinsics
- Vectorization of Exponentiation
- Vectorization of Lookup Tables
- Auto-Vectorization and Restricted Pointers
Chapter 31. Kernel Fusion (Full Chapter)
- What is Kernel Fusion?
- Faster Together
- Kernel Fission
- Transformer Component Fusion
- Example: Fused VMM-add-bias
- Fused Activation Functions
- Kernel Fusion of Vector Min and Max
Chapter 32. Quantization (Full Chapter)
- What is Quantization?
- Types of Quantization
- Floating-Point Quantization
- Integer Quantization
- Integer-Only-Arithmetic Quantization
- Uncommon Quantization Types
Chapter 33. Pruning (Full Chapter)
- What is Model Pruning?
- Pros and Cons of Pruning
- Unstructured and Structured Pruning
- Types of Unstructured Pruning
- Magnitude Pruning
- Movement Pruning
- First-Order and Second-Order Pruning
- Making Magnitude Pruning Effective
- Static vs Dynamic Pruning
Chapter 34. MatMul/GEMM (Full Chapter)
- What is MatMul?
- Matrix-Vector Multiplication
- Optimizing Matrix-Vector Multiplication
- Spot the Buggy MatMul
- Rectangular Matrix-Vector Multiplication
- Tiled Matrix-Vector Multiplication
Chapter 35. Lookup Tables & Precomputation (Full Chapter)
- Precomputation with Lookup Tables
- Example: LUT Precomputation for sqrt
- Float-to-Float Precomputation
- Precalculating C++ Source Files
Chapter 36. AI Memory Optimizations (Full Chapter)
- Why Optimize Memory?
- Elements of Memory Optimization
- Contiguous Memory Blocks
- Fast Memory Block Operations
- Fast memory block copying with memcpy
- Initialize memory blocks with memset
- memcmp byte comparisons
- Linearized Multi-Dimensional Arrays
- Model Compression
- GPU Memory Management
- Transformer Component Memory Optimization
Part VI: Enterprise AI in C++
Chapter 37. Tuning, Profiling & Benchmarking (Full Chapter)
- Tuning an AI Engine
- Performance Tuning Practices
- Tuning Trade-offs
- Profiling and Benchmarking
- Timing C++ Code
- Benchmarking Methods
- Linux C++ Profilers
- The pixie utility
- The prof utility
- Examining Assembly Output
- Examining Object Files
- Reducing Build Time
Chapter 38. Platform Portability (Full Chapter)
- AI Engine Portability
- Basics of Portable Coding
- GPU Portability
- Putting Portability into Supportability
- Testing C++ Code Portability
- Compilation Problems
- Runtime Portability Glitches
- Code Portability Pitfalls
- Data Type Sizes
- Pointers versus Integer Sizes
Chapter 39. Quality (Full Chapter)
- AI Quality
- What is Software Quality?
- Advanced Software Quality
- Sellability
- Software Engineering Methodologies
- Software Engineering Process Group
- Coding Standards
- Project Estimation
- Code Quality
- Extensibility
- Supportability
- Scalability
- Reusability
Chapter 40. Reliability (Full Chapter)
- AI Engine Reliability
- Code Reliability
- Static Analysis Tools (Linters)
- Building More Bugs
- Warning-Free Compilation
- Refactoring versus Rewriting
- Defensive Programming
- Maintainability
- Technical Debt
Chapter 41. Self-Testing Code (Full Chapter)
- Self Testing Code Introduction
- AI Engine Automated Testing
- Test Coverage
- GPROF Test Coverage Script
- Assertions
- Assertion Failure Extra Message
- Assertions for Function Parameter Validation
- Assertless Production Code
- Assert Parameter and Return
- Assertion Return Value Usage
- BYO assertion macros
- static_assert
- Unreachable code assertion
- Once-only execution assertion
- Detecting Spinning Loops
- Variadic Macro Assertions
- Function Call Counting
- Generalized Assertions
- Generalized Variable-Value Assertions
- Laziness and Assertion Macros
- Next-Level Assertion Extensions
- Debug Wrapper Functions
- Compile-time self-testing macro wrappers
- Example: memset Wrapper Self-Checks
- Generalized Self-Testing Debug Wrappers
- Self-Testing Code Block
- Dynamic Self-Testing Code
- Self-test Code Block Macro
- Self-Test Block Macro with Debug Flags
Chapter 42. Debugging (Full Chapter)
- AI Engine Debugging
- Debugging Techniques
- Interactive Debuggers
- Random Number Seeds
- Don’t Blame the Compiler
- Debug Stacktrace
- Error Logging
- Debug Tracing Messages
- Dynamic Debug Tracing Flag
- Multiple Levels of Debug Tracing
- Advanced Debug Tracing
- Multi-Statement Debug Trace Macro
- Variable-Argument Debug Macros
- Valgrind Limitation Workarounds
- Making the Correction
Part VII: Research on AI Optimization
Chapter 43. Overview of AI Research (Full Chapter)
- Smarter AI Research
- Safer AI Research
- Faster AI Research
- Commercialized SOTA Research
- Inference Optimization
- Model Compression
- Dynamic Inference Optimizations
- Uncommon Optimization Techniques
- Beyond Transformers
- Research Topic Ideas
Chapter 44. Advanced Quantization (Full Chapter)
- Binary Quantization
- Ternary Quantization
- 2-Bit Quantization (INT2)
- 3-Bit Quantization (INT3)
- 4-Bit Quantization (INT4)
- 5-Bit Quantization (INT5)
- 6-Bit Quantization (INT6)
- 7-Bit Quantization (INT7)
- 8-Bit Integer Quantization (INT8)
- 9-Bit Quantization (INT9)
- 10-Bit Quantization (INT10)
- 11-Bit Quantization (INT11)
- 12-Bit Quantization (INT12)
- Mixed-Precision Quantization
- Bitshift Quantization (Power-of-Two)
- Sum of Two Bitshifts Quantization
- Arbitrary Base Logarithmic Quantization
- Integer Division Quantization
- Dyadic Quantization
- Stochastic Quantization
- Weight Clustering
Chapter 45. Knowledge Distillation (Full Chapter)
- What is Knowledge Distillation?
- Research on Knowledge Distillation
- Multi-Teacher Knowledge Distillation
- Dataset Distillation
Chapter 46. Structured Pruning (Full Chapter)
- What is Structured Pruning?
- Why Structured Pruning?
- Types of Structured Pruning
- Dynamic Structured Pruning
- Triple Axis Pruning
- Vector-Level Pruning
- Parameter and Weight Sharing
Chapter 47. Early Exit and Layer Pruning (Full Chapter)
- What is Depth Pruning?
- Layer Pruning
- Static Layer Pruning
- Early Exit of Inference Layers
- Types of Early Exit
- Early Exit Research
- Layer Skipping
- Layer Fusion
- Layer Reordering
- Shallow Decoder Transformer Architecture
Chapter 48. Width Pruning (Full Chapter)
- What is Width Pruning?
- Attention Head Pruning
- Slimmable Neural Networks
- Filter Pruning
- Channel Pruning
Chapter 49. Length Pruning (Full Chapter)
- What is Length Pruning?
- Research on Length Pruning
- Token Pruning
- Dynamic Token Pruning
- Embeddings Matrix Pruning
- Embedding Size Optimization with NAS
Chapter 50. Adaptive Inference (Full Chapter)
- What is Adaptive Inference?
- Types of Adaptive Inference
- Easy vs Hard Queries
- Zero Skipping
- Negative Skipping
- Zero Padding Removal
- Weight Precomputations
Chapter 51. Zero-Multiplication Models (Full Chapter)
- What are Zero-Multiplication Models?
- Low Bit Quantization
- Adder Neural Networks
- Approximate Multiplication
- Shift-Add Networks
- Add-as-Integer Networks
- Max-Plus and Tropical Algebra
- Morphological Networks
- Other Addition Networks
- Table Lookups Replace Multiplication
- Difference-Squared Networks
- Bitwise Operators for Inference
Chapter 52. Logarithmic Models (Full Chapter)
- What is a Logarithmic Model?
- Introduction to Logarithmic Models
- End-to-End Logarithmic Models
- Obstacles to Stardom
- LNS Applications
- LNS Addition
- LNS Hardware Acceleration
- LNS Mathematical and Algorithmic Theory
- Logarithmic Algebra
- LNS Extensions
Chapter 53. Arithmetic Optimization Research (Full Chapter)
- Overview of Arithmetic Optimizations
- Multiplication Optimizations
- Approximate Multiplication
- Logarithmic Approximate Multiplication
- Addition Optimizations
- Division
- End-to-End Integer Arithmetic
Chapter 54. Ensemble Multi-Model Architectures (Full Chapter)
- What are Ensemble Architectures?
- Types of Ensemble Algorithms
- Model Selection Algorithms
- Mixture of Experts (MoE)
- Big-Little Transformer Models
- Cascades
- Collaborative Inference
- Consensus Decoding
- Multi-Model Deployment
Chapter 55. Advanced Number Systems (Full Chapter)
- Advanced Number Systmes Introduction
- Advanced Numeric Bit Representations
- Dyadic Numbers
- Residue Number System
- Posit Number System
- Tropical Algebra (Max-Plus)
- Log-Sum-Exp Networks
- MiniMax Algebra
- Trigonometric Approximations
- Double-Base Number System (DBNS)
- Hybrid Number Systems
Chapter 56. Neural Architecture Search (Full Chapter)
- What is NAS?
- Neural Architecture Search
- NAS Versus Model Compression
- Dynamic NAS
- NAS Research Papers
Appendices
Appendix 1: C++ Slug Catalog (Full Chapter)
- Slug Hunting Advice
- C++ Class Slugs
- Bypass interfaces with friend functions
- Avoid Function Pointers
- Avoid unnecessary virtual function calls
- Assignment Operator Return Type
- Singleton Classes
- Temporary Objects and Destruction
- Overloaded Postfix Increment Operator
- Standard Vector Object Resizing
- Skipping Destructor Cleanup
- Specialize inherited member functions
- Initializer lists for member objects
- Initializer lists for base objects
- Avoid temporary objects
- Avoid temporaries via extra member functions
- Declare objects close to use
- Declare Objects with Full Initialization
- Data Member Optimizations
- Function Slugs
- Medium-Sized Slugs
- More Slug Repellent
Bonus Appendix: C++ Bug Catalog
- Lexical bugs
- Expression bugs
- Switch statement bugs
- Control flow bugs
- Variables bugs
- Type bugs
- Preprocessor bugs
- Function call bugs
- Recursion bugs
- C++ class bugs
- C++ library bugs
- Pointer memory bugs
Bonus Appendix: C++ Bug Symptom Diagnosis
Bonus Appendix: C++ Portability Bug Catalog
More about the Book
For general information about Generative AI in C++ see also:
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |