Aussie AI

Dataset Distillation

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Dataset Distillation

The technique of “dataset distillation” borrows the same terminology, but is a different technique to knowledge distillation. This term refers to methods to reduce a training dataset to a derived set of training data, such as to (theoretically) sidestep privacy or copyright concerns. The dataset is smaller and theoretically can be used to train a similarly capable model.

Research papers on dataset distillation:

  1. T. Wang, J.-Y. Zhu, A. Torralba and A. A. Efros, 2018, Dataset distillation, arXiv:1811.10959, 2018. https://arxiv.org/abs/1811.10959
  2. Yu R, Liu S, Wang X, 2023, Dataset Distillation: A Comprehensive Review, https://arxiv.org/abs/2301.07014 Or Honovich, Thomas Scialom, Omer Levy, Timo Schick, Dec 2022, Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor, https://arxiv.org/abs/2212.09689, https://github.com/orhonovich/unnatural-instructions (Using a model to automatically create a training data set, including automatically creating both instructions and responses.)

For additional research papers on dataset distillation, see https://www.aussieai.com/research/knowledge-distillation#dataset-distillation.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++