Aussie AI
Training Data for Fine-tuning
-
Book Excerpt from "Generative AI in C++"
-
by David Spuler, Ph.D.
Training Data for Fine-tuning
One of the biggest obstacles to a fine-tuning project is getting enough data. Many projects where fine tuning seems like a good idea are scuttled when there is no critical mass of data with which to train. Fine-tuning usually requires more data than RAG.
For fine-tuning data to be viable, it usually needs to have:
(a) Several cases of every concept you want to teach it, with both input and expected output. Depending on the NN architecture, it may also need a score indicating how good the output is.
(b) Corner cases and extra data to capture subtle details or complexities.
(c) Held-back extra cases which are not used for training, but are used to evaluate the state of the training.
Gathering the data is likely the hardest part of training. And the more training iterations you need to do, the more data you need. Training data management is mostly a non-coding task, involving processing of the data files, such as chunking, organizing, indexing, and generating embeddings. It's arduous to some extent, but not high in technical difficulty.
Model-Based Training
Another way to do training is to have it talk to another previously trained system. Knowledge distillation is one of these techniques, which is available already in major AI frameworks, and has a high level of sophistication. Another simpler method is to train a new model on the prompt-answer pairs from another large model.
• Next: • Up: Table of Contents |
The new AI programming book by Aussie AI co-founders:
Get your copy from Amazon: Generative AI in C++ |