Aussie AI

Model Loader

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Model Loader

The first step in Transformer execution is loading the model. What does a model look like? At a basic level, it's simply a very large binary file containing mostly numeric data. The main things you'll find in a model file include:

  • Header data with settings and hyper-parameters
  • String data for tokens
  • Lots and lots and lots of numbers.

Model Header. The start of the model file contains some header data and hyper-parameter values which define the “shape” of the model. For example, it will have the number of “layers” in the model (depth), and the size of the “hidden dimension” (width) and various other settings.

Billions of Numbers. Almost all of the model's size is taken up with floating-point numbers, because a 7B model will have literally 7 billion numbers, usually in 32-bit format (i.e. assuming it's an FP32 model). And these are further organized into sub-structures that represent “layers” and this includes “tensors” and other fun stuff.

Numbers are Static. The first point about these numbers is that they don't change. Model data is static data for any pre-trained model. These are read-only numbers that have been pre-computed “offline” during training or fine-tuning. When you run a model doing “inference,” these numbers don't change. The whole big bang of a full cycle of the entire model, when it spits out one word, actually runs on static numbers. Only if you're doing fine-tuning of a model will the numbers change again. The exceptions to this are the various dynamic inference methods, which are mainly at the research-level, whereas the default inference of a model is static.

All Numbers Are Processed. Another point is that all of these numbers get used. You might read that model files have lots of “redundancy” but that only means that lots of these numbers are less important, but they will all still be used for arithmetic, because it's hard to figure out which ones to discard. (It's they used to say about your advertising budget before, you know, cookies: half the money was wasted, but you didn't know which half.) An inference cycle will perform a floating-point operation on every single one of these billions of numbers. This is usually a multiply, but there's also additions, and these are all called “floating-point operations” or FLOPs. Since inference uses every single number in a model file and there are billions of numbers, there are GigaFLOPs of calculations just for the decoder to spit out one word (or a part-word or punctuation, or whatever token). And then the Transformer repeats all of that for the next word. Again, there are exceptions to this in advanced algorithms, such as “model pruning”, where some of the numbers are skipped.

String Data. The second type of model data is strings, and there's much less. Model files contain some string data for a few different purposes. There are a few descriptive strings that give names to things, and these strings are effectively overhead, since they're not part of the computation (e.g. they might appear in reports or be useful during debugging). The main string data in a model file is the “vocabulary” for the tokenizer. Typically, the model will have about 50,000 different strings that represent words, part-words, punctuation, and any fancy stuff (e.g. UTF8 codes for love heart emojis).

String Data is Static. Again, this string data is fixed at runtime. Actually, the string data is chosen right at the start when designing a model and can't even change during training! Hence, the strings that make up the tokenizer do not change during runtime inference of a model. The AI engine cannot learn a new word and add it to the vocabulary. So, whatever token set was setup in the vocabulary of the model before and during training has to remain the same during inference.

Load Order Matters. The order of the string tokens also matters in the model file. The inference engine treats tokens as numbers, using the offset of the string in the vocabulary array. If you mess up the model file loading of its vocabulary so that it's out-of-order or missing a few words, then the AI engine is going to get very confused and output gibberish.

Handling Unknown Words. A fixed vocabulary doesn't mean the AI engine falls over on unrecognized text. The tokenizer instead uses some default token strategies to handle unknown words or symbols. Individual digit letters are used for numbers, because it's difficult to encode every single number up to infinity as a separate token. Words that are not recognized are tokenized using part-words or in the worst case with individual letters. Unusual symbols, like emoji codes, are also tokenized to UTF8 single-byte tokens.

Engine Initialization. Since the numbers and strings are static, the model loader doesn't need to do anything to this data, other than to store the strings into a tokenizer module, and organize the numbers into tensors and layers. But that is kind of a lot of coding work anyway!

Other parts of the initialization involve getting ready to run a fast inference or training procedure. For example, the lookup tables to optimize the various non-linear (expensive) activation functions could be computed at program startup, although these really should be precomputed offline for a production model.

Note that in production deployment of an AI engine, this initialization cost isn't very important. A server should handle lots of queries, whereas this initialization occurs once, so any initialization time cost is amortized over many server queries. Even so, as any longtime Windows user knows, it's annoying if anything starts up slow.

Where is the magic? If all of the numbers are static, and the shape of the model is fixed and finite, how is it so smart? Moreso, how is it creative? I mean, it sounds like a robotic piece of number-crunching code. Yes, indeedy. It can neither feel bad nor taste dessert. The first part of the explanation of an LLM's abilities is that hyperscale brute-force simply works. Having a huge enough model of billions of weights mapping word probabilities is amazingly good at predicting which are the top 50 words that I should end this sentence with. That part is deterministic and also smart enough not to choose a preposition. The second part is “intentional randomness” introduced into this algorithm, mostly in the final “decoding algorithm” that chooses which of the highest-probability 50 words to pick. Or select. Or choose. Or culminate.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++