Aussie AI

Inference

  • Book Excerpt from "Generative AI in C++"
  • by David Spuler, Ph.D.

Inference

What is inference? The term “inference” is the AI way of saying “running” or “executing” the AI model. Inference and training are different phases. When you're training or fine-tuning, that's not inference. But when you're done and deploy your model live for a nickel a query, that's inference. When you ask ChatGPT a question, you're sending a “query” or “prompt” to its “inference engine” and when it politely refuses to do what you ask, that's the output results of its “inference” code.

What are latency and throughput? Latency is how fast your inference engine runs. It's similar to the idea of “response time” for a single user. Throughput is a measure of how many queries your engine can handle over time, which relates more to how fast your engine can handle a group of users submitting many queries.

Types of inference. There's not only one type of inference, and the exact algorithm depends on what you're trying to do. Some of the types include:

  • Completions. This means extending the prompt into a longer answer. Common use cases include auto-writing text or answering questions.
  • Translation. Convert your Python code comments into Klingon.
  • Summarization. Taking the input prompt, such as a paragraph or document, and creating a brief summary.
  • Grammatical Error Correction (GEC). Also known to non-researchers as “editing.”
  • Transformation. Changing the tone or style of a text input, or changing the formatting and presentation.
  • Categorization. Analyzing the inputs into a set of different categories, which is effectively a subset of summarization.

Inference Settings. In addition to choosing the overarching type of inference algorithm, there are some common parameters to control an inference request.

  • Temperature. A higher temperature setting gives your engine a fever, and makes it output silly words. This is known as “creativity.”
  • Token limit. This is the simple idea of limiting the number of words (tokens) that the engine is allowed to output in its response.
  • Formatting. Do you want the engine to output plain text, a table, or some other format.

This is only a sample list, and API providers typically have many more options. There are also usually various other parameters related to security and tracking of requests. For example, you probably have to submit your security credentials (i.e. password) along with a unique ID for the request. This helps the API validate your request and helps you keep track of which end user submitted the request so you can send the answer back to them.

 

Next:

Up: Table of Contents

Buy: Generative AI in C++: Coding Transformers and LLMs

Generative AI in C++ The new AI programming book by Aussie AI co-founders:
  • AI coding in C++
  • Transformer engine speedups
  • LLM models
  • Phone and desktop AI
  • Code examples
  • Research citations

Get your copy from Amazon: Generative AI in C++