Aussie AI

Training FAQs

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Training FAQs

What is pre-training? It's not a specific type of training algorithm, but just means that you've already trained the model. This term mostly appears in Generative Pre-trained Transformers (GPT), which you may have heard of.

It's common for a commercial service to offer access to a pre-trained model. For example, the OpenAI API allows you to send queries to the pre-trained GPT models, which have a broad level of trained capabilities. Similarly, there are numerous open source pre-trained models available, such as the Meta Llama2 model and various smaller ones.

What is re-training? Again, this isn't really a technical term. It usually just means the same as fine-tuning.

What is knowledge distillation? Knowledge distillation (KD) is an optimization technique that creates a smaller model from a large model by having the large model train the small model. Hence, it's a type of “auto-training” using a bigger teacher model to train the smaller student model. The reason it's faster is that once the training is complete, you use only the smaller student model for processing user queries, and don't use the bigger model for inference at all.

Distillation is a well-known and often used approach to save cost but retain accuracy. For example, a large foundation model usually has numerous capabilities that you don't care about. There are various ways to use “distillation” to have the large model teach the smaller model, but within a subset of its capabilities. There are ways to share inference results and also more advanced internal weight-transfer strategies. See Chapter 45 for more on distillation.

What is model initialization? That's where you use malloc to allocate a memory block that exceeds the capacity of your machine. Umm, no. Model initialization is an important part of the training algorithm, and as you have probably already guessed, this refers to the start of training.

Since training creates a smart model by updating parameters incrementally by small amounts, it works better if the parameters are already close to where they need to be. So, you don't just start training with all the model full of zeros. Instead, you try to “jumpstart” the process with better initialization. However, it's far from clear what the best choice of initialization values should be, and there are lots of research papers on this topic.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++