Aussie AI
Research Survey for Generative AI in C++
-
Last Updated 3rd August, 2024
-
by David Spuler, Ph.D.
The Aussie AI project includes a full literature survey of AI optimization.
Chapter-by-Chapter Research Citations
Here is a detailed list of the related research coverage for each chapter of Generative AI in C++.
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |
Book Information
For general information about Generative AI in C++ see also:
New Research Areas: More research papers:
- On-device inference (native phone and PC AI)
- Generalized speculative decoding
- Consensus decoding
- KV Cache Compression/Quantization
- Prefill optimizations
- Fixed-point quantization (integer)
- Fixed-point arithmetic
- Block floating-point arithmetic
Hot Research Topics: Longstanding research areas with many recent additions:
Part I: AI Projects in C++
- Long list of AI optimizations
- AI Research Literature Survey
- Market Research
- GenAI market evolution
- AI phones
- AI PCs (desktops/laptops)
1. Introduction to AI in C++
- Market Research
- Transformer architectures (overview)
- AI phones
- AI PCs (desktops/laptops)
2. Transformers & LLMs
- Market Research
- Transformer architectures (overview)
- AI phones
- AI PCs (desktops/laptops)
3. AI Phones
4. AI on Your Desktop
- AI PCs (desktops/laptops)
- Market Research
- Edge device inference (mobile/PC)
- GenAI market evolution
5. Design Choices & Architectures
- Transformer architectures (overview)
- Market Research
- AI phones
- AI PCs (desktops/laptops)
6. Training, Fine-Tuning & RAG
7. Deployment Architecture
Part II: Basic C++ Optimizations
8. Bitwise Operations
9. Floating Point Arithmetic
10. Arithmetic Optimizations
- Reciprocal multiplication
- Constant folding
- Common subexpression elimination
- Strength reduction
- Bitwise operator inference
- Bitserial operations
11. Compile-Time Optimizations
- Constant folding
- Common subexpression elimination
- Strength reduction
- Type consistency
- Reciprocal multiplication
12. Pointer Arithmetic
13. Algorithm Speedups
14. Memory Optimizations
Part III: Parallel C++ Optimizations
15. Loop Vectorization
- Loop optimizations (overview)
- Loop fusion (merging loops)
- Loop unrolling
- Loop perforation
- Loop reordering
- Loop tiling
- Loop reversal
- Loop fission (splitting a loop)
- Loop interchange
- Loop coalescing
- Loop-invariant code motion ("hoisting")
- Loop distribution
- Pointer arithmetic
- Loop peeling (unrolling first iterations)
- Loop splittingLoop sentinel
- Loop collapsing
- Loop normalization
- Loop strip mining (Loop sectioning)
- Loop skewing
- Loop spreading
- Parallelization
- Vectorization
- Kernel operator fusion (merging two operations)
- Kernel fission (splitting)
16. Hardware Acceleration
- Hardware acceleration
- Hardware-software co-design
- Parallelization
- Vectorization
- Pipelining
- Offloading
- AI phones
- AI PCs (desktops/laptops)
17. AVX Intrinsics
18. Parallel Data Structures
- Hashing
- Locality-Sensitive Hashing (LSH)
- Look-up tables (LUTs)
- Bloom filters
- Vector dot product caching
Part IV: Transformer Components in C++
- Transformer architectures (overview)
- Transformer low-level optimizations (overview)
- AI phones
- AI PCs (desktops/laptops)
19. Encoders & Decoders
20. Attention
- Attention optimizations (overview)
- Attention head pruning (width pruning)
- Approximate attention heads
- Attention alternatives/replacements
- Fused Multi-Head Attention (MHA)
- Non-autoregressive methods
21. Activation Functions
- Activation function optimizations
- Activation function approximation
- Integer-only activation functions
- Fused activation functions
- Fused RELU
- Fused GELU
- Fused SwiGLU
- Activation alternatives/replacements
- Activation function pruning/removal (bilinear layers)
- Activation function reordering
22. Vector Algorithms
23. Tensors
- Tensor decomposition
- Faster matrix multiplication (e.g. Winograd, Strassen)
- Approximate matrix multiplication
24. Normalization
- Norm optimizations
- Approximate normalization
- Norm reordering (pre-norm/post-norm)
- Integer-only normalization
- Normalization alternatives/replacements
- Fused normalization (e.g. "fused LayerNorm")
- Normalization pruning
25. Softmax
26. Decoding Algorithms
27. Tokenizer and Vocabulary
- Positional embeddings pruning
- Positional encoding optimization
- Pruning positional encoding (removal)
Part V: Optimizing Transformers in C++
- Long list of AI optimizations
- Model compression research
- Transformer architectures (overview)
- Transformer low-level optimizations (overview)
- End-to-End integer inference
- AI phones
- AI PCs (desktops/laptops)
28. Deslugging AI Engines
29. Caching Optimizations
- Inference Cache
- KV Caching
- Caching / memoization
- Computation reuse
- Precomputation
- Conditional computation
- Weight precomputations
- Approximate caching
- Vector dot product caching
- Look-up tables (LUTs)
30. Vectorization
- Vectorization
- Parallelization
- Pipelining
- Kernel operator fusion (merging two operations)
- Kernel fission (splitting)
31. Kernel Fusion
- Kernel operator fusion (merging two operations)
- Kernel fission (splitting)
- Loop fusion (merging loops)
- Loop fission (splitting a loop)
- Fused Multi-Head Attention (MHA)
- Fused activation functions
- Fused RELU
- Fused GELU
- Fused SwiGLU
- Fused normalization (e.g. "fused LayerNorm")
- Fused Softmax
- Fused multiply-add (FMA)
- Fused transpose
- Negative skipping
32. Quantization
33. Pruning
- Pruning research
- Model compression research
- Magnitude pruning
- Movement pruning
- Dynamic pruning
- Zero-skipping
- Sparsification techniques
34. MatMul/GEMM
- Faster matrix multiplication (e.g. Winograd, Strassen)
- Approximate matrix multiplication
- Transpose cache
- Fused multiply-add (FMA)
- Fused transpose
- Vector dot product optimization
- FFN pruning
- Fused add-bias
- Bias vector pruning
- Low-rank matrices
- Matrix Algebra (factorization)
- Approximate matrix multiplication
- Butterfly matrices
- Monarch matrices
- Sparse matrices (sparsification)
35. Lookup Tables & Precomputation
- Look-up tables (LUTs)
- Caching / memoization
- Computation reuse
- Precomputation
- Conditional computation
- Weight precomputations
36. AI Memory Optimizations
- Memory optimization techniques
- Parameter sharing
- Model compression
- Quantization
- Binary quantization
- Ternary quantization
- Layer fusion
- Sparsification techniques
Part VI: Enterprise AI in C++
37. Tuning, Profiling & Benchmarking
38. Platform Portability
39. Quality
40. Reliability
41. Self-Testing Code
42. Debugging
Part VII: Research on AI Optimization
43. Overview of AI Research
- Long list of AI optimizations
- Model compression research
- AI Research Literature Survey
- Market Research
44. Advanced Quantization
- Quantization research
- Model compression research
- Binary quantization
- Ternary quantization
- 2-bit quantization (INT2)
- 3-bit quantization (INT3)
- 4-bit quantization (INT4)
- 5-bit quantization (INT5)
- 6-bit quantization (INT6)
- 7-bit quantization (INT7)
- 8-bit quantization (INT8)
- Integer quantization
- Integer-only arithmetic quantization
- FP8 quantization
- Logarithmic power-of-two quantization (bitshift quantization)
- Double bitshift power-of-two quantization
- Division quantization
- Cluster-based quantization (Weight clustering)
- Dyadic quantization
- Fake quantization
- Simulated quantization
- Stochastic quantization (probabilistic)
- Weight clustering
45. Knowledge Distillation
- Knowledge distillation research
- Ensemble Distillation
- Unnatural instructions (data sets)
- Dataset Distillation
- Model compression research
46. Structured Pruning
- Model compression research
- Pruning research
- Parameter sharing research
- Depth pruning (overview)
- Length pruning (overview)
- Width pruning (overview)
- Dual pruning
- Triple pruning
- Negative skipping
- Sparsification techniques
47. Early Exit and Layer Pruning
- Early exit (dynamic layer pruning)
- Layer pruning
- Depth pruning (overview)
- Layer skipping
- Shallow decoder architecture (layer pruning)
- Layer fusion
- Layer reordering
48. Width Pruning
- Width pruning (overview)
- Channel pruning
- Filter pruning
- Attention head pruning
- Slimmable networks
- Attention optimizations (overview)
49. Length Pruning
- Length pruning (overview)
- Token pruning (input pruning)
- Embeddings pruning
- Zero padding removal
- Context length optimizations
- Zero-padding avoidance
- Negative skipping
50. Adaptive Inference
- End-to-End integer inference
- Dynamic inference (adaptive inference)
- Skipping optimizations
51. Zero-Multiplication Models
- Zero-Multiplication Models (overview)
- Integer-only Transformers
- Binary quantization
- Ternary quantization
- 2-bit quantization (INT2)
- Adder networks
- Bitshift-add networks
- Bitshift power-of-2 quantization
- Double bitshift quantization
- Add-as-integer networks
- Logarithmic Models
- Bitwise neural networks
- Diff-squared networks
- Log-sum-exp (LSE) networks
- Max-Plus networks
- Min-Max-Plus networks
- Morphological networks
- Trigonometric approximate inference
- Weightless Neural Networks (WNNs)
- XNOR networks
- End-to-End integer inference
52. Logarithmic Models
53. Arithmetic Optimization Research
- Advanced AI Mathematics
- Integer-only Transformers
- Integer-only arithmetic quantization
- End-to-End integer inference
- Reciprocal multiplication
- Constant folding
- Common subexpression elimination
- Strength reduction
- Foating point bitwise arithmetic
- Addition optimizations
- Approximate addition
- Multiplication algorithms
- Approximate multiplication
- Logarithmic approximate multiplication
- Division optimizations
- Approximate division
- Bitwise operator inference
- Bitserial operations
54. Ensemble Multi-Model Architectures
- Ensemble inference (multi-model AI engines)
- Model selection algorithms
- Big-little architectures
- Cascades
- Cloud inference servers
- Collaborative inference
- Mixture of Experts (MoE)
- Speculative decoding
55. Advanced Number Systems
- Advanced AI Mathematics
- Integer-only Transformers
- End-to-End integer inference
- Foating point bitwise arithmetic
- Posit number system (PNS)
- Residue number system (RNS)
- Logarithmic number system (LNS)
- Dyadic numbers
- Double-base number system (DBNS)
- Dynamic number systems
- Hybrid number systems
- Tropical algebra (max-plus)
- MiniMax algebra
- Multi-dimensional logarithmic number system (MDLNS)
- Multiple-Base Number System (MBNS)
- Matrix Algebra (factorization)
- Approximate matrix multiplication
- Butterfly matrices
- Monarch matrices
56. Neural Architecture Search
Appendix 1: C++ Slug Catalog
More AI Research
Read more about:
- GenAI market research
- AI on Phones
- Inference Optimizations
- Loop Optimizations
- Code Optimizations
- « Research Home