Aussie AI

Model Size

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Model Size

Choosing a model size is an important part of the project. For starters, the size of a model has a direct correlation to the cost of both training and inference in terms of GPU juice. Making an astute choice on the type of model you need for this exact use case can make a large impact on the initial and ongoing cost of an AI project.

There's no doubt that bigger models are enticing. The general rule seems to be that bigger models are more capable, and a multi-billion parameter model seems to be table stakes for a major AI model these days. And the top commercial models are starting to exceed a trillion parameters.

However, some research is starting to cast doubt on this, at least in that the trend that ever-larger models may not always result in increased intelligence. For example, GPT-4 is rumored to be eight models merged together in a Mixture-of-Experts (MoE) architecture, each of size about 220B parameters, rather than one massive model of 1.76T parameters.

Quality matters, not just quantity. The quality of the data set used for training, and the quality of the various techniques are important. The quality is important for intelligence shouldn't be surprising. In fact, what should be surprising is that quantity has been so successful at raising AI capabilities.

Model optimizations. How can you have a model that's smarter and faster and cheaper? Firstly, the open source models have improved quickly and continue to do so. Some are starting to offer quite good functionality at a speed that is very fast. There are models that have been compressed (e.g. quantization, pruning, etc.), and there are open source C++ engines that offer various newer AI optimization features (e.g. Flash Attention) You can download both models and engine source code, and run the open source models yourself (admittedly, with hosting costs for renting your own GPUs, or using a commercial GPU hosting service). Furthermore, this book has numerous chapters on improving the performance of an AI engine written in C++, which is true for most of the open source engines.

For a commercial API, you can't change their engines until you apply for a job there. However, you can reduce the number of queries being sent to a commercial API, mainly by putting a cache in front of the calls. This cuts costs and speeds up replies for common prompts (or similar ones), with the trade-off that non-cached queries have a slightly slower response time from the additional failed cache lookup. Chapter 29 examines using an “inference cache” or a “semantic cache” via a vector database. An inference cache is a cache of the responses to identical queries, whereas a semantic cache finds “close-enough” matches in prior queries using nearest-neighbor vector database lookups.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Model Size

Model Size

Quick Links

Product

New to Writing?

Writing Styles