Aussie AI

Overview of AI Research

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

“Well begun is half done.”

Mary Poppins, 1964.

 

 

It's hard to keep track of all the AI research that's happening right now. Some areas already have literally thousands of research papers, and still going strong. In order to make sense of it all, I try to categorize the papers into a few main categories:

  • Smarter AI
  • Safer AI
  • Faster AI

My main literature focus to date has been on inference speed, so I have plenty of papers to share on faster AI. There's also much happening in the area of making the AI engines even more capable, as if they weren't amazing enough already.

Smarter AI Research

The goal of smarter AI is to make the models more capable, so as to offer more benefits to individuals and business. Some of the newer focus areas for increasing the intelligence of AI engines include extensions to capabilities and addressing some of the weaknesses:

  • Long document understanding.
  • Data source usage (i.e. beyond RAG)
  • Tool usage
  • Mathematical reasoning capabilities
  • Artificial General Intelligence (AGI)

Some of the main architectures getting research attention towards achieving these types of breakthroughs include:

  • Mixture-of-experts
  • Ensemble architectures (many types)
  • Multimodal models
  • Intelligent agent architectures

Some of the specific algorithms being researched for increased silicon braininess include:

  • Temporal reasoning
  • Spatial reasoning (e.g. 3D scene understanding)
  • Tool interfaces
  • Informational retrieval integrations
  • Length generalization
  • Prompt augmentation
  • Chain-of-thought reasoning
  • Long context window optimizations
  • Forecasting and predictive algorithms

These are just some of the research areas and there is so much happening that the totality defies categorization. For example, the above list is missing vertical-specific research areas (e.g. medical diagnosis), inputs (e.g. computer vision, video), and fundamental human-computer interfaces, such as voice interactions, gesture interfaces, and extensions to natural language.

Safer AI Research

The concerns about safety and “responsible AI” are well-known, and are spawning a steady stream of research papers. There is much research on addressing the various known social weaknesses of AI engines. Research areas with a focus on AI in society include:

  • Safety, bias, and other ethical issues
  • Policies, best practices, and regulation
  • Legal issues (e.g. copyright, privacy)
  • Green AI (environmentally friendly AI)

For the programmers, some of the specific technical issues being researched in these areas include:

  • Alignment training
  • Explainability (transparency)
  • Hallucinations
  • Model evaluation techniques
  • Cost quantification (environmental impact)

It is worth noting that the broad areas of efficient AI research are also relevant to green AI, because making it faster also makes it greener.

Faster AI Research

Much of the focus on speeding up AI engines is about weaning the lazy bloated monsters off their happy juice from high-end multi-GPU platforms. People want AI applications on their phone and laptop, so there's much happening towards “AI Phones” and “AI PCs.” And even for the high-end platforms, the expense of those GPUs is so high that simply optimizing by a few percent can save millions.

Another reason for needing faster AI: multi-model engines. These are called “ensemble architectures” in the research literature, and there are already many papers on this area. Two models can be much smarter than one, and so we want faster execution so that we can run more engines at once.

What's hot in fast AI? Some of the newer areas of AI efficiency research include:

  • Phone and PC On-Device Inference (i.e. “AI PCs” and “AI Phones”)
  • Mixture-of-Experts multi-model ensemble architecture (e.g. GPT-4 and Gemini architectures)
  • Flash Attention (linearized attention)
  • Long context windows and length generalization (e.g. RoPE positional encoding)
  • Ensemble Multi-Model Architectures (various sub-areas)
  • Dynamic NAS

Still bubbling away on the cooker. Some areas of AI optimization have thousands of papers, and yet I still see new ones in my feeds every week:

  • Hardware optimizations (GPUs, CPUs, faster memory, AI-specific chips, etc.)
  • Quantization (always)
  • Pruning (always) (esp. types of dynamic structural pruning)
  • Distillation (always)
  • Early exit (still hot)
  • Dynamic inference (adaptive inference)

Trail gone cold. These are the areas where there has been so much successful research in prior decades that the number of papers has subsided, presumably because it's become harder to find a novelty. There are still ways to make a contribution by combining the older techniques with newer research, and some of these areas are so important they're probably one idea away from breakthrough status.

  • MatMul optimizations
  • Arithmetic optimizations (e.g. faster addition/multiplication algorithms)
  • Approximate arithmetic optimizations
  • Logarithmic number system models
  • Code transformations and loop optimizations
  • Compiler auto-optimizations

Deep AI research areas. There are several interesting areas of AI research that nevertheless have only a few papers each year, simply because they are somewhat demanding, and there are fewer researchers attempting anything. However, most of these areas offer the promise of performant AI if the problems can be cracked.

  • Advanced number system models (e.g. dyadic, posit, residue)
  • Zero-multiplication models (e.g. bit-shift, logarithmic, adder models, etc.)
  • Floating-point numeric representations (i.e. going beyond brain float)
  • Matrix algebra (e.g. factorization/decomposition, inverse matrices)

Further details about research on almost all of these topics are in subsequent chapters.

Commercialized SOTA Research

The whole of the AI industry is based on commercialization of State-of-the-Art (SOTA) research, from the basic Transformer architecture to the often-used optimizations of quantization and pruning. The areas where recent research from the last year or two is starting to appear in industry models and open source frameworks, with kudos paid to the innovative researchers, includes:

  • Flash Attention (linearized attention algorithm)
  • Flash Decoding
  • RoPE (positional encoding for long contexts)
  • Long context window research

Just my opinion, but the areas that seem ripe for greater inclusion in commercial and open source industry AI work include:

  • Early exit (dynamic layer pruning)
  • Integer-only arithmetic quantization models (end-to-end integers)

Inference Optimization

Optimization of the inference algorithms for AI models is the primary mechanism to provide fast response times and scalable throughput of AI requests from users. There is an extensive volume of literature on various techniques to optimize model inference. Some of the main techniques include:

  • Hardware acceleration (GPU and non-GPU)
  • Parallelization (vectorization)
  • Software acceleration (e.g. pipelining, marshaling, kernel fusion)
  • Memory optimizations (reducing dataflow and memory-to-cache transfers)
  • Model compilation (graph compilers / deep learning compilers)
  • Transformer-specific optimization techniques (e.g. shallow decoder architectures)
  • Model compression (e.g. quantization, pruning, distillation)
  • Advanced mathematical algorithms and matrix algebra
  • General code optimizations including caching, precomputation, approximations, and inference loop optimizations

For more techniques, see the remaining chapters of this section and our “Long list of Transformer Optimization Methods” at URL: https://www.aussieai.com/research/list

Survey research papers. Research papers that survey the many types of inference optimizations include:

  1. Sehoon Kim, Coleman Hooper, Thanakul Wattanawong, Minwoo Kang, Ruohan Yan, Hasan Genc, Grace Dinh, Qijing Huang, Kurt Keutzer, Michael W. Mahoney, Yakun Sophia Shao, Amir Gholami (2023), Full stack optimization of transformer inference: a survey, Feb 2023, arXiv:2302.14017, https://arxiv.org/abs/2302.14017
  2. Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer (2021), A Survey of Quantization Methods for Efficient Neural, Network Inference. arXiv:2103.13630 [cs], June 2021, http://arxiv.org/abs/2103. 13630 arXiv: 2103.13630, https://arxiv.org/abs/2103.13630
  3. Nebuly (2023), Full Stack Optimization of Transformer Inference: a Survey , Part 2 on Transformer Optimization ,A Paper Overview, https://www.nebuly.com/blog/full-stack-optimization-of-transformer-inference-a-survey-part-2
  4. Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, Lei Li (2021), A Survey on Green Deep Learning, Nov 2021, https://arxiv.org/abs/2111.05193 (Extensive survey paper.)
  5. Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani (2023), A Survey of Techniques for Optimizing Transformer Inference, 2023, arxiv.org July 2023, https://arxiv.org/abs/2307.07982
  6. Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler (2022), Efficient transformers: A survey (v2), arXiv preprint arXiv:2009.06732, https://arxiv.org/abs/2009.06732
  7. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang (2023), A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023, https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches.)
  8. Meriam Dhouibi, Ahmed Karim Ben Salem, Afef Saidi, Slim Ben Saoud (2021), Accelerating deep neural networks implementation: A survey, March 2021, https://doi.org/10.1049/cdt2.12016, PDF: https://ietresearch.onlinelibrary.wiley.com/doi/pdfdirect/10.1049/cdt2.12016 (Survey of various techniques including hardware acceleration and pruning.)
  9. Md. Maruf Hossain Shuvo; Syed Kamrul Islam; Jianlin Cheng; Bashir I. Morshed (2023), Efficient Acceleration of Deep Learning Inference on Resource-Constrained Edge Devices: A Review, Proceedings of the IEEE (Volume 111, Issue 1, January 2023), https://ieeexplore.ieee.org/document/9985008, PDF: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9985008a (Extensive 2023 survey of inference optimization in general and specifically on edge platforms.)
  10. Y Wang, Y Han, C Wang, S Song, Q Tian, G Huang (2023), Computation-efficient Deep Learning for Computer Vision: A Survey, arXiv preprint arXiv:2308.13998, https://arxiv.org/abs/2308.13998
  11. Giorgos Armeniakos, Georgios Zervakis, Dimitrios Soudris, Jörg Henkel (2022), Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey, ACM Computing Surveys, Volume 55, Issue 4, No. 83, pp. 1–36, https://doi.org/10.1145/3527156, https://dl.acm.org/doi/10.1145/3527156, https://arxiv.org/abs/2203.08737
  12. Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, Lichao Sun (2023), A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT, May 2023, https://arxiv.org/abs/2302.09419
  13. Q Fournier, GM Caron, D Aloise (2023), A practical survey on faster and lighter transformers, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3586074, https://arxiv.org/abs/2103.14636
  14. V. Sze, Y. Chen, T. Yang, and J. S. Emer (2017), Efficient Processing of Deep Neural Networks: A Tutorial and Survey, Proc. IEEE 105, 12 (2017), 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740 (Good paper from 2017.)
  15. Xipeng Qiu, TianXiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang (2020), Pre-trained Models for Natural Language Processing: A Survey, SCIENCE CHINA Technological Sciences 63, 10 (2020), 1872–1897. https://doi.org/10.1007/s11431-020-1647-3, https://arxiv.org/abs/2003.08271 (Good survey of Transformer architectures in 2020.)
  16. Kah Phooi Seng, Li-Minn Ang (2022), Embedded Intelligence: State-of-the-Art and Research Challenges, IEEE Access, vol.10, pp.59236-59258, 2022. https://ieeexplore.ieee.org/document/9775683, PDF: https://research.usc.edu.au/esploro/outputs/99640278002621
  17. Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz (2023), Efficient methods for natural language processing: A survey, 2023, https://arxiv.org/abs/2209.00099, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00577/116725 (Extensive survey from 2023 covering many optimization techniques.)
  18. Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, Arun K. Somani (2023), A Survey of Techniques for Optimizing Transformer Inference, https://arxiv.org/abs/2307.07982
  19. G Alsuhli, V Sakellariou, H Saleh, M Al-Qutayri (2023), Number Systems for Deep Neural Network Architectures: A Survey, 2023, https://arxiv.org/abs/2307.05035 (Good survey, but specific to number systems.)
  20. Canwen Xu, Julian McAuley (2022), A Survey on Model Compression and Acceleration for Pretrained Language Models, Nov 2022, https://arxiv.org/abs/2202.07105
  21. G Menghani (2023), Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
  22. Anouar Nechi, Lukas Groth, Saleh Mulhem, Farhad Merchant, Rainer Buchty, Mladen Berekovic (2023), FPGA-based Deep Learning Inference Accelerators: Where Are We Standing? ACM Transactions on Reconfigurable Technology and Systems, July 2023 https://doi.org/10.1145/3613963, https://dl.acm.org/doi/10.1145/3613963 PDF: https://dl.acm.org/doi/pdf/10.1145/3613963
  23. L Papa, P Russo, I Amerini, L Zhou (2023), A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking, Sep 2023, arXiv preprint arXiv:2309.02031, 2023, https://arxiv.org/abs/2309.02031
  24. Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. (2022), Dynamic neural networks: A survey, Volume 44, pages 7436–7456, Los Alamitos, CA, USA. IEEE Computer Society. https://arxiv.org/abs/2102.04906, https://doi.ieeecomputersociety.org/10.1109/TPAMI.2021.3117837 (Survey of dynamic inference techniques, where the engine is adaptive to the input.)
  25. Li, Z., Liu, F., Yang, W., Peng, S., and Zhou, J. (2022), A survey of convolutional neural networks: Analysis, applications, and prospects, IEEE Transactions on Neural Networks and Learning Systems, 33(12):6999–7019. https://arxiv.org/abs/2004.02806 (2020 version), https://ieeexplore.ieee.org/document/9451544 (A survey of CNNs.)
  26. Praveen Joshi, Mohammed Hasanuzzaman, Chandra Thapa, Haithem Afli, Ted Scully (2023), Enabling All In-Edge Deep Learning: A Literature Review ,IEEE Access, vol.11, pp.3431-3460, 2023. https://ieeexplore.ieee.org/document/10007810 https://arxiv.org/abs/2204.03326 (Extensive survey of edge computing, including deployment architectures and optimizations.)
  27. Lipton, Z. C., Berkowitz, J., and Elkan, C. (2015), A critical review of recurrent neural networks for sequence learning, https://arxiv.org/abs/1506 (This 2015 survey of RRNs and LSTMs has some interesting perspectives.)
  28. W Li, H Hacid, E Almazrouei, M Debbah (2023), A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing inference for running on edge servers; also training on edge servers.)
  29. JA Chen, W Niu, B Ren, Y Wang, X Shen (2023), Survey: Exploiting data redundancy for optimization of deep learning, ACM Computing Surveys, 2023, https://dl.acm.org/doi/abs/10.1145/3564663, https://arxiv.org/pdf/2208.13363 (Various optimizations to skip or reuse computations or similar data.)
  30. M Capra, B Bussolino, A Marchisio, M Shafique (2020), An updated survey of efficient hardware architectures for accelerating deep convolutional neural networks, Future Internet, 2020, https://www.mdpi.com/1999-5903/12/7/113/pdf
  31. J Zhong, Z Liu, X Chen (2023), Transformer-based models and hardware acceleration analysis in autonomous driving: A survey, Apr 2023, https://arxiv.org/abs/2304.10891

For more general research on inference optimizations, refer to https://www.aussieai.com/research/inference-optimization.

Model Compression

Model compression is the general class of AI optimizations that reduce the size of the model. These methods are generally considered to be “static” optimizations, because the model is reduced during or after training, and does not change at runtime during inference. The goal of model compression is two-fold:

    (a) model size reduction, and

    (b) latency optimization.

The model size reduces by either having fewer weights (e.g. pruning and sparsity) or by using smaller data types (e.g. quantization). The AI engine runs faster inference on the more compact model by (a) requiring fewer calculations overall, (b) using integer calculations (with some techniques), and (c) improving memory-bound inference because fewer memory-to-cache data transfers are needed.

Model compression techniques have been highly successful and are widely used, second only to hardware-acceleration in their impact on the AI industry. The three main model compression techniques with widespread industry usage are:

  • Quantization
  • Model pruning
  • Knowledge Distillation

There are various lesser-known types of model compression methods in the research papers:

  • Low-rank factorization of matrices (tensor decomposition)
  • Weight sharing
  • Layer fusion
  • Weight clustering

Some other research techniques with similar goals of a smaller, simpler model include:

  • Big-little architectures
  • Speculative decoding
  • Logarithmic models
  • Zero-multiplication models (e.g. adder networks)

Survey papers. Various research survey papers on model compression techniques include:

  1. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang (2023), A Survey on Model Compression for Large Language Models, arXiv preprint arXiv:2308.07633, Aug 2023, https://arxiv.org/abs/2308.07633 (Recent 2023 survey paper on various model compression approaches.)
  2. Canwen Xu, Julian McAuley (2022), A Survey on Model Compression and Acceleration for Pretrained Language Models, Nov 2022, https://arxiv.org/abs/2202.07105
  3. T Choudhary, V Mishra, A Goswami (2020), A comprehensive survey on model compression and acceleration, Artificial Intelligence Review, 2020, https://doi.org/10.1007/s10462-020-09816-7, https://link.springer.com/article/10.1007/s10462-020-09816-7
  4. Y Cheng, D Wang, P Zhou, T Zhang (2020), A survey of model compression and acceleration for deep neural networks, arXiv preprint arXiv:1710.09282, June 2020 (revised), https://arxiv.org/abs/1710.09282
  5. K Nan, S Liu, J Du, H Liu (2019) Deep model compression for mobile platforms: A survey, Tsinghua Science and Technology (Volume 24, Issue 6, December 2019), https://ieeexplore.ieee.org/abstract/document/8727762, PDF: https://ieeexplore.ieee.org/iel7/5971803/8727756/08727762.pdf
  6. Yu Cheng; Duo Wang; Pan Zhou; Tao Zhang (2018), Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges, IEEE Signal Processing Magazine (Volume 35, Issue 1, January 2018), https://ieeexplore.ieee.org/document/8253600
  7. G Menghani (2023), Efficient deep learning: A survey on making deep learning models smaller, faster, and better, ACM Computing Surveys, https://dl.acm.org/doi/abs/10.1145/3578938, https://arxiv.org/abs/2106.08962
  8. L Deng, G Li, S Han, L Shi, Y Xie (2020), Model compression and hardware acceleration for neural networks: A comprehensive survey, Proceedings of the IEEE (Volume 108, Issue 4, April 2020), https://ieeexplore.ieee.org/abstract/document/9043731
  9. K Ramesh, A Chavan, S Pandit (2023), A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models, Microsoft Research, https://aclanthology.org/2023.acl-long.878.pdf, https://www.microsoft.com/en-us/research/uploads/prod/2023/07/3687_Paper.pdf
  10. W Li, H Hacid, E Almazrouei, M Debbah (2023), A Comprehensive Review and a Taxonomy of Edge Machine Learning: Requirements, Paradigms, and Techniques, AI 2023, 4(3), 729-786, https://www.mdpi.com/2673-2688/4/3/39 (Extensive survey related to optimizing on edge devices, including model compression.)
  11. A Jaiswal, Z Gan, X Du, B Zhang, Z Wang, Y Yang (2023), Compressing LLMs: The Truth is Rarely Pure and Never Simple, arXiv preprint arXiv:2310.01382, Oct 2023, https://browse.arxiv.org/pdf/2310.01382.pdf

For more general research on model compression, refer to https://www.aussieai.com/research/model-compression. All of the individual model compression strategies (e.g. quantization, pruning) are discussed in detail in the following chapters.

Dynamic Inference Optimizations

Whereas model compression makes static changes to the model, there are numerous “dynamic” optimizations to the runtime inference code. Each different AI model architecture has slightly different features in its inference loop, but the underlying code is very iterative across multiple layers, which in turn loop across many matrices or tensors of weights.

Numerous methods have been considered to reduce the number of computations or use a simpler type of arithmetic calculation. A short list of such optimizations includes:

  • Early exits of loops (dynamically skipping layers)
  • Dynamic pruning (depth pruning, width pruning, length pruning)
  • Zero-multiplication models
  • Integer-only arithmetic quantization
  • Loop vectorization optimizations (e.g. hardware acceleration, loop tiling, etc.)
  • Sparsification
  • Matrix factorization (low-rank) and matrix algebra
  • Non-autoregression (parallelized output of multiple tokens per iteration)
  • General programming loop optimizations (e.g. loop unrolling, parallelization, etc.)

The other major optimization strategy is to use multiple models. Some of the research ideas include:

  • Model selection algorithms
  • Mixture of experts
  • Big-little architectures
  • Speculative decoding
  • Consensus decoding

See the following chapters for more detailed information on many of these research topics.

Uncommon Optimization Techniques

It's hard to list all the possible research areas. However, as a kind of “showcase” of some of the interesting research areas, here are some of the more theoretical and lesser-known techniques:

  • Weight clustering. This technique involves merging weights that have similar magnitude into “clusters” that use exactly the same weight instead. It is similar to a combination of quantization and pruning, and can augment either technique.
  • Approximate matrix multiplication. There are various algorithms for performing mathematical multiplication of matrices without actually using numeric multiplication. This is an area of active research that involves a crossover between high-end mathematics and computer algorithms. Several techniques show promise of fast calculations with an acceptable loss of accuracy.
  • Logarithmic bitshift quantization (power-of-two). The coding optimization of using bit-shift operators to replace integer multiplication is well-known. Conceptually, the idea is a second-order model quantization method involving two main steps: (a) convert floating-point model weights to integers (for integer multiplications), and (b) further convert those integer weights logarithmically to the nearest power-of-2. This allows integer multiplication in vector and matrix multiplications to be replaced with integer bit-shift operations. However, the trade-off is a greater loss of model accuracy than basic integer quantization.
  • Additive and zero-multiplication neural networks. Various approaches to remove the multiplication bottleneck by replacing the dreaded star with other arithmetic operators, including “adder” networks, bit shifting, and other low-level optimizations.
  • Low-rank optimization. Optimize high-degree tensors to be lower-degree tensors using matrix factorization/decomposition. This is conceptually similar to a type of large-scale pruning based on matrix algebra.
  • Faster multiplication algorithms. There are ways to do arithmetic multiplication faster, although this research is mainly used by chip designers nowadays.
  • Approximate multiplication. This is a low-level optimization of the multiplication operation itself, usually a more mathematical method than using bitshifts. Some of these methods have been used in quantization.
  • Matrix multiplication algorithms. A lot of research has been performed on faster matrix algorithms to speed up your MatMul/GEMM kernel. It's a very advanced area, but also rather fun.
  • Advanced Numeric Representations. Various non-standard alternative methods to store numeric weights, making use of all the bits in a byte, going beyond the standard floating-point or integer bit patterns.

Want more? Just pick two random methods and combine them. Hybrid optimization strategies are possible in many ways. For example, a model can be quantized to lower precision, and then the inference engine could employ various dynamic pruning strategies. And some strategies apply across both training and inference phases, thereby combining approaches, such as using different approximation algorithms or advanced matrix decompositions. The co-design of hardware and software architectures also typically crosses both training and inference execution.

Beyond Transformers

In consideration of the future breakthroughs beyond Transformers, let's examine their limitations.

  • Quadratic cost complexity in the input sequence.
  • Static weights that don't change during inference. Compare this to “incremental learning.”
  • Mathematical reasoning limitations.
  • Attribution and transparency issues.
  • Lack of general common sense
  • No real “world model” (superficial understanding)

Some other areas arise in terms of the other architectures which have advantages over Transformers in some types of computations.

  • Hybrid RNN-Transformers. The sequence-processing methods of RNNs have some advantages, although Transformers are fairly good at sequences, too.
  • Hybrid CNN-Transformers. Combine the CNN's innate image processing abilities with Transformers.

But here's my prediction for what comes after Transformers: more Transformers, by which I mean ensemble architectures that combine multiple models.

Research Topic Ideas

Need a topic for a research paper or a thesis dissertation? Here are some thoughts within the “inference optimization” subarea of AI research, which does not include smarter AI or training research areas.

Our research focus is on optimizing these algorithms so that the AI models respond quickly to users (called “latency”) and have a high overall throughput so as to scale efficiently. They need to be much faster not only to reduce data center GPU costs, but also to run efficiently on your smartphone device or your AI laptop.

Some suggestions for research topics based on observations about areas of AI research that seem to be under-researched are below. These topics are primarily in inference optimization for neural networks and large language models. Original research topics include:

  • Phone AI. Smartphone-level inference efficiency remains elusive. The general research area of running AI models on low-resource platforms is called “edge computing.” This whole area has lots of research subareas, but is also needing a major breakthrough.
  • Integer-only-arithmetic quantization. Much of the quantization research focuses on model size reduction but still uses floating-point multiplications for inference, although they can be avoided. This area seems under-researched when you consider its potential.
  • Non-shift bitwise zero-multiplication models. Bitwise-or/bitwise-and with zero-multiplication inference. Papers in this area have used addition, but the bitwise operations might work, too, and they each have slightly different characteristics versus addition.
  • Matrix algebra. Advanced matrix algebra has promise to reduce FLOPs. Sparse matrices, butterfly matrices, Monarch matrices, low-rank matrix factorization, tensor decomposition, etc.
  • Hybrid inference optimizations. There are many research papers on using two or more model compression optimizations together (e.g. quantization and pruning), but this is a combinatorially large space, and there are many combinations unresearched, and a thorough overview of all of the possible combinations with citations to related research also seems missing (because it'd be a huge paper).
  • Layer skipping. Skipping of individual layers during inference, rather than just “early exit” of all layers (dynamic layer pruning). There are various papers, but there's room for more research.
  • Layer reordering. This is a technique that seems like it shouldn't work well. What happens if you run each layer twice? Is it more accurate? (Obviously, it's slower.) What happens if you run all the layers in reverse? Is it true that the initial layers do the general broader understanding and the finer layers do the finessing?
  • Approximate multipliers. Use of approximate arithmetic multiplication algorithms in software for edge platforms with no hardware acceleration, or with different types of limited hardware acceleration.
  • Tokenizer and vocabulary theory. Tokenizer word-to-token ratios and their impact on inference latency via model size (vocabulary size) and input sequence length. Tokenization with larger tokens, such as multi-word tokens, would mitigate the auto-regression latency issue, but increase vocabulary (thereby massively increasing model size). Is there a worthwhile trade-off?
  • Multi-AI. Multi-model algorithms are an interesting area (often called “ensemble algorithms”). Two might be better than one for AI engines. There's a lot of research already (e.g. big-little architectures, collaborative inference, consensus-based decoding, speculative decoding, etc.), but there is much room for advancement here. What new advanced capabilities are possible by leveraging two or more AI engines?
  • Logarithmic quantization. Power-of-two quantization with bitwise shift inference seems to be a fast inference method with many papers, but model accuracy remains problematic.
  • Double-bit power-of-two quantization. Two shifts might be better than one (in terms of model accuracy). Some papers were found, but there's room for innovation.
  • Granular quantization. Fine granularity quantization, such as per-tensor quantization. This is hard to implement efficiently, so it might have to be done in deep learning compilers.
  • Lookup tables (LUTs). Zero-multiplication inference is possible using table lookup. A simple idea that trades space for lower latency time, looks effective, and has had relatively less research attention.
  • Hashing and vector databases. There are quite a lot of papers, but there seems to be room for more. It's an O(1) lookup operation if it succeeds, but hashing a vector of numbers, or comparing two vectors for similarity using vectors, is quite tricky. Probably more to come in this area.
  • Bloom filters. A data structure that is an extension of hashing and bit vectors with O(k) complexity, and has been occasionally used in neural network papers.
  • Data structures. As already mentioned for hashing and Bloom filters, an overall review and comprehensive theoretical basis for the various (non-hardware) data structures in AI is needed, for the various different data structures, and generally across all of them.
  • Dyadic quantization. It's interesting and might have some promise because it replaces multiplication with addition and bitshifts, but isn't as constrained in terms of unique weights as power-of-two quantization, so it's probably more accurate.
  • Odd quantization bit sizes. There's plenty of papers on 4-bit and 8-bit integer quantization (and binary or ternary), a few papers on 2-bit quantization, but hardly any papers focused on 3-bit or 5-bit quantization (or 6-bit or 7-bit). Why not? They seem to work and offer good trade-offs in space versus model accuracy.
  • Double-byte quantization (9 to 15 bits). There are not many papers on 9-bit to 15-bit integer quantization, with more focus on 8-bit or lower bitwidth, and full 16-bit integer quantization (for understandable reasons). Obviously, there's less benefit to memory utilization from more bits, and consequently extra CPU/GPU cost of data processing, but the models should be more accurate than 8-bit quantization.
  • Non-ternary 2-bit quantization. This method has 4 weights -1, 0, +1, +2, thereby using zero, one or two additions/subtractions, rather than multiplication. This hasn't received as much attention as binary or ternary quantization, but 2-bit quantization might be more accurate than ternary quantization with the same space usage, and allows a zero-multiplication model.
  • Streaming of model inference. Is it possible to start model inference before the entire model has been downloaded? Initial thoughts are that you can't run half a model, but you know what, you actually can, and there are two main ways. Yet to see a paper on this idea; there are papers on “streaming” of models, but they're about using a model on a stream, not streaming the model itself.
  • Model-specific file compression algorithms. Whether the standard Zip and Bzip algorithms for file compression can be improved upon for the byte-wise compression of model files. What are the compression characteristics of model files, including the various quantized file formats with different bit sizes? Specific applications are for (a) transmission over the internet, and/or (b) efficient file storage on devices such as smartphones (with the need to quickly uncompress the file to load it fully into RAM).
  • NAS for Dynamic Inference Hyper-Parameters. There are a lot of dynamic (adaptive) inference optimization strategies and they all have hyper-parameters. For example, early exiting layers has hyper-parameters in the minimum executed before early exiting is considered and the configurations of the decision algorithm to use on whether to exit at a given layer. Searching for an optimal set of such dynamic optimization hyper-parameters is an extension of NAS that I call “dynamic NAS.”
  • Quadruple axis pruning. Multi-dimensional pruning is not yet fully researched. There are several papers on dual pruning (depth and width) and only a handful of research papers on triple pruning (adding length pruning), but apparently there's not yet a fourth dimension of pruning.
  • Pruning Positional Embedding. Some papers have found that positional embedding (also called positional encoding) can be completely removed (abbreviated as “NoPE”). Somehow, the AI engine learns inter-token positional information without needing a positional encoding module. This is poorly understood and a new area with few papers.

 

Next: Chapter 44. Advanced Quantization

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++