Aussie AI

Training Data for Fine-tuning

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

Training Data for Fine-tuning

One of the biggest obstacles to a fine-tuning project is getting enough data. Many projects where fine tuning seems like a good idea are scuttled when there is no critical mass of data with which to train. Fine-tuning usually requires more data than RAG.

For fine-tuning data to be viable, it usually needs to have:

(a) Several cases of every concept you want to teach it, with both input and expected output. Depending on the NN architecture, it may also need a score indicating how good the output is.

(b) Corner cases and extra data to capture subtle details or complexities.

Gathering the data is likely the hardest part of training. And the more training iterations you need to do, the more data you need. Training data management is mostly a non-coding task, involving processing of the data files, such as chunking, organizing, indexing, and generating embeddings. It's arduous to some extent, but not high in technical difficulty.

Model-Based Training

Another way to do training is to have it talk to another previously trained system. Knowledge distillation is one of these techniques, which is available already in major AI frameworks, and has a high level of sophistication. Another simpler method is to train a new model on the prompt-answer pairs from another large model.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

Training Data for Fine-tuning

Training Data for Fine-tuning

Model-Based Training

Quick Links

Product

New to Writing?

Writing Styles