Aussie AI

5. Design Choices & Architectures

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

“Design is the fundamental soul of a human-made creation
that ends up expressing itself in successive outer layers
of the product or service.”

— Steve Jobs.

Choosing Your AI Project

What's the project? Here are some examples of common projects for business usage of AI:

Support chatbot for your website, that directly answers customer questions about your products.
Q&A internal service for support staff to help answer questions and offer “scripts” to follow.
In-house Q&A service to answer staff questions about your products (with a more in-depth answers possible that a public chatbot)
In-house HR chatbot to answer staff questions about policies and internal company matters.

Another common type of AI project is to get certain groups of staff trained up with AI tools to improve their productivity. These are the various “copilot” type of AI tools, and they can be used by various different company teams, even programmers. An important distinction here is that such copilot tools may not require training with any specific in-house data.

Planning and Requirements

Making a plan might not be a bad idea, considering the potential cost outlay involved in an AI project. You're probably familiar with the general issues of project planning in regards to staffing and resourcing a project, so I'll focus mainly on the AI-specific issues.

Researching your AI project will involve issues such as:

What is the specific AI use case?
What in-house proprietary data could be used for training?
Existing staff AI expertise levels.
Capacity of existing hardware viz training or inference workloads.
Vendors and costs of AI-specific hosting versus in-house capabilities.

Some of the specific decisions in moving ahead with a project plan include:

Use case specific requirements
Proprietary training data cleansing
Choice of foundational model
Commercial versus open source models
Training or fine-tuning versus RAG

It's not all about AI. General tech project requirements also apply:

User interface platform
Backend hosting and deployment issues
Development processes
Security risk mitigations
Backup and recovery procedures

In addition to technology issues, there are also broader legal and regulatory issues to consider such as:

Responsible AI (safety issues)
Governmental AI regulatory compliance
Internet regulatory compliance (non-AI)
Organizational legal compliance (e.g. HIPAA, SOC)
Copyright law
Privacy law

Top 10 Really Big Optimizations

Most of this book is about optimizing your AI engine, including its C++ code and model structure. But first, let's take a step back and consider the massive optimizations for your entire project. Here's some ways to save megabucks:

1. Buy an off-the-shelf commercial AI-based solution instead.

2. Wrap a commercial model rather than training your own foundation model (e.g. OpenAI API).

3. Test multiple commercial foundational model API providers and compare pricing.

4. Use an open source pre-trained model and engine (e.g. Meta's Llama models).

5. Avoid fine-tuning completely via Retrieval-Augmented Generation (RAG).

6. Choose smaller model dimensions when designing your model.

7. Choose a compressed open source pre-trained pre-quantized model (e.g. quantized Llama).

8. Cost-compare GPU hosting options for running your model.

9. Use cheaper commercial API providers for early development and testing.

10. Use smaller open-source models for early development and testing.

If ten isn't enough for you, don't worry, I've got more! Roll up your sleeves and look at all the research on optimizations in Part VII.

Build versus Buy

Before we dive into the mechanics of building your own AI thingummy in a huge project, it's worth considering the various existing products and tools. You might not need a development project at all, but simply a DevOps project to integrate a new third-party commercial product into your company's infrastructure.

For example, if your project goal is to have staff writers being more productive in creating drafts of various documents or marketing copy, there's this product called ChatGPT from OpenAI. Maybe you've heard of it?

Actually, there are any number of other tools for writer productivity using AI capabilities, some of which use ChatGPT underneath, and some of which are independent. Similarly, there are already a number of “AI coding copilot” type products, which might make your C++ programmers even more amazingly, astoundingly, RSU-worthily useful than they already are. Across the whole spectrum of creative endeavors, there are also numerous AI products that create images, animations, 3D models, and videos.

More generally, there are starting to be AI products for almost every use case that you can think of, and in all of the major industry verticals (e.g. medicine, law, finance, etc.) so it's worth a little research as to what's currently available that might suit your needs. I'm reluctant to offer lists, because it's changing daily. Anyway, it's not my job to review them; it's yours!

Overall, it's fun to build anything with AI technology, but it's faster to use something that's already been built. And these new AI tools are actually so amazing that it's also fun to test them.

Foundation Model Choices

What model are you going to use as the Foundation Model? There are really three major options:

Commercial models
Open source models
Build Your Own (BYO)

Of course, there's that fourth option of not using AI, which, as anyone in the AI industry will tell you, leads to analysts shunning your stock, instant bankruptcy, and your toenails catching on fire.

Building your own model is a viable option for small to medium models, that you want to train on your data set. However, only the major tech companies have been successful at training a massive LLM foundation model, given the expertise required and expense of training.

The alternative is to choose an existing foundation model, that is pre-trained on lots of general data. Then you would fine-tune that model on whatever proprietary data that you want to use.

If you have no specific extra data for fine-tuning, then you're basically using a commercial or open source model underneath. You can still achieve significant customization of an existing model without fine-tuning, using techniques such as prompt engineering, Retrieval-Augmented Generation (RAG), and the simple idea of mixing heuristics with AI inference results.

Open Source Models

When ChatGPT burst into public consciousness in around February 2023, there were already lots of open source models. However, they are mostly smaller models and nowhere near as capable. Nevertheless, you could get a lot of value in them at no cost.

Open source models received a huge jump forward when Meta released its Llama model into the open source world. It was licensed only for non-commercial and research purposes, but it was immediately used in numerous ways in the open source community. This model was also used in various ways to create other models that were, theoretically at least, freed of the non-commercial limitations of the original Llama license. That legal issue was never tested and became moot shortly afterwards when Llama version 2 came out.

Meta open sourced its Llama2 model for both commercial and non-commercial usage in July 2023. The license was non-standard, but for most users who were not already large companies), it was largely free of the restrictions. You should review the details of the Llama2 license yourself, and any future Meta model releases, but it has been widely used in the open source community already.

Commercial-Usage Open Source Models

Although Llama2 probably tops the list, there are several other major models that have been open-sourced in permissive licenses. Again, you should check these license details yourself, as even the permissive licenses impose some level of restrictions or obligations. Here is my list of some of the better models that I think can be used commercially:

Llama2 from Meta (Facebook Research) under a specific license called the Llama 2 Community License Agreement.
Mistral 7B and Mistral 8x7B (both with an Apache 2.0 license).
MPT-7B from MosaicML (DataBricks) with Apache 2.0 license.
Falcon 7B/40B from the Technology Innovation Institute (TII) (Apache 2.0 license)
FastChat T5 from LMSYS (Apache 2.0 license)
Cerebras GPT AI Model (Apache 2.0 license)
GPT4All models (various); some under MIT License.
H2O GPT AI model (Apache 2.0 license)
Orca Mini 13B (MIT License)
Zephyr 7B Alpha (MIT License)

This list is already out-of-date as you read this, I'm sure. There are new models coming out regularly, and there are also various new models being created from other models, such as quantized versions and other re-trained derivative models.

Model Size

Choosing a model size is an important part of the project. For starters, the size of a model has a direct correlation to the cost of both training and inference in terms of GPU juice. Making an astute choice on the type of model you need for this exact use case can make a large impact on the initial and ongoing cost of an AI project.

There's no doubt that bigger models are enticing. The general rule seems to be that bigger models are more capable, and a multi-billion parameter model seems to be table stakes for a major AI model these days. And the top commercial models are starting to exceed a trillion parameters.

However, some research is starting to cast doubt on this, at least in that the trend that ever-larger models may not always result in increased intelligence. For example, GPT-4 is rumored to be eight models merged together in a Mixture-of-Experts (MoE) architecture, each of size about 220B parameters, rather than one massive model of 1.76T parameters.

Quality matters, not just quantity. The quality of the data set used for training, and the quality of the various techniques are important. The quality is important for intelligence shouldn't be surprising. In fact, what should be surprising is that quantity has been so successful at raising AI capabilities.

Model optimizations. How can you have a model that's smarter and faster and cheaper? Firstly, the open source models have improved quickly and continue to do so. Some are starting to offer quite good functionality at a speed that is very fast. There are models that have been compressed (e.g. quantization, pruning, etc.), and there are open source C++ engines that offer various newer AI optimization features (e.g. Flash Attention) You can download both models and engine source code, and run the open source models yourself (admittedly, with hosting costs for renting your own GPUs, or using a commercial GPU hosting service). Furthermore, this book has numerous chapters on improving the performance of an AI engine written in C++, which is true for most of the open source engines.

For a commercial API, you can't change their engines until you apply for a job there. However, you can reduce the number of queries being sent to a commercial API, mainly by putting a cache in front of the calls. This cuts costs and speeds up replies for common prompts (or similar ones), with the trade-off that non-cached queries have a slightly slower response time from the additional failed cache lookup. Chapter 29 examines using an “inference cache” or a “semantic cache” via a vector database. An inference cache is a cache of the responses to identical queries, whereas a semantic cache finds “close-enough” matches in prior queries using nearest-neighbor vector database lookups.

Software Architecture

If you're building a significant portion of the software architecture for your AI project, then nothing is more important to the speed of a program than its architecture. I mean, look at AI. The whole architecture is a massive fail, endlessly bloated with far too many weights and a brute-force algorithm. Sadly, that's the best we've come up with so far, but there's a lot of research about these architectural issues that will probably solve it.

Anyway, as a professional C++ programmer, it's not difficult to choose a better architecture for your AI project. Fortunately, the best software architecture in the world is well-known to everyone, and is clearly this one:

Object-oriented objects (OOO)
Client-server
Server-client
Message passing
Thin client
Gamification
Virtualization with Vectorization
Model-View-Controller (MVC)
UI-Application-Database (3-level)
Event-Driven Architecture (EDA)
#include "beer.h"
Clouds
Fog computing
Postel's law
RTFM
Microservices architecture
Service Oriented Architecture (SOA)
RESTful API architecture
Intelligent Autonomous Agent (IAA)
Intentional virality
Goto considered helpful

Actually, sorry, that wasn't the best architecture in the world; it was just a tribute.

AI Tech Stack

The tech stack for an AI project is similar to a non-AI project, with a few extra components. There's also a much greater importance tied to the choice of underlying hardware (i.e. GPUs) than in many other types of projects. The tech stack looks something like this:

User interface (client)
Web server (e.g. Apache or Nginx)
Application server
Load balancer (e.g. Apache Kafka)
AI request manager
AI Inference Engine (and model)
Operating system (e.g. Linux vs Windows)
CPU hardware (e.g. Intel vs AMD)
GPU hardware (e.g. NVIDIA V100 vs A100)

Some of these layers are optional or could be merged into a single component. Also, if you're using a remote hosted AI engine, whether open source hosting or wrapping a commercial engine through their API, then the bottom layers are not always your responsibility.

AI engine choices. How much of your AI tech stack will you control? If you have full control over the hardware and software, it makes sense to make symbiotic choices that allow maximum optimization of the combined system. For example, if you've decided to run the system on a particular GPU version, then your AI engine can assume this hardware acceleration is available, and don't need to waste resources on ensuring your engine's C++ software runs on any other hardware platforms.

Financial Optimizations

An AI project is expensive in terms of the hardware, the software, and the people you need. There are some considerations that can reduce the cost somewhat.

Use existing assets. What internal data assets do you possess? Can you re-purpose any of your company's existing hardware assets? And can you “re-purpose” any of your staff, too?

Buy vs rent. If it's floating, flying, or foundational modeling: rent, don't buy! Similarly, do you need to buy your own servers and GPUs? The decision may be different for the different phases of a project:

Development and testing
Training the model
Inference (live execution)

For example, you might want to buy for training phases and rent for the inference phase. This depends on how much training you need, the size of your model, and whether you plan to avoid fine-tuning for proprietary data by using RAG instead. The cost of inference depends on the user counts, which is significantly different if it's an internal employee project versus a live public user application.

Idle VMs and GPUs. Watch out for virtual machines and rented GPUs being idle early in the project. You're paying money for nothing in such cases. This can occur in the development phases and in the early live deployment when user levels are low.

Scrimp on developer models. During the development and testing phases, there's no need for gold-plated AI models. The cost of development and testing of your AI application can be reduced by using low-end models for simple testing. Many of the components needed are not dependent on whether the AI engine returns stellar results. Initial development, prototyping, and ongoing regression testing of these parts of the system can proceed with small models.

There is also vendor support for testing on lower-end models. There are various other AI platforms that offer interfaces that mimic OpenAI's API, but at a lower cost, so you can test on these platforms, and then do final testing on the live commercial platform.

• Next: Chapter 6. Training, Fine-Tuning and RAG

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs