Aussie AI
Portability
-
Last Updated 3 September, 2024
-
by David Spuler, Ph.D.
Portability in AI C++ programming is correctly tolerating the underlying tech stack, including the operating system, CPU, and GPU capabilities. The first level of portability is "toleration" where the program must at least work correctly on whatever platform it finds itself. The second level is "exploiting" the specific features of a particular tech stack, such as making the most of whatever GPU hardware is available.
In one sense, portability is an issue that can be ignored in some cases. If you have control over your hardware and software tech stack, you only need one platform to work, and you can optimize for exactly that platform. So feel free to skip this entire discussion in such situations!
Modern Portability
Ah, yes, I remember portability. Early portability was whether it was a ZX81 or an 8086. Then it was whether it was SunOS, Solaris, SGI, Ultrix, or Irix (I missed a few). And then it was Windows 95 versus Windows NT. And then it was detecting Windows versus Linux. And then it was iOS or Android.
Which brings us up to date. And now portability for AI C++ is detecting things like:
- OS configuration settings
- Software package versions
- Virtual machine settings
- Hardware acceleration GPU capabilities
Why does upgrading Python versions take hours, but updating GCC doesn't? I'm just saying... that I like C++. I don't mean anything by it.
Basics of Portability
The basic approach to writing portable code is:
- 1. Write generic code portably, and
- 2. Write platform-specific code where needed.
Write portable code: Most of your AI C++ application should be written in portable C++. The majority of the C++ programming language is well-standardized, and a lot of code can be written that simply compiles on both, and has the same functionality. You just have to avoid the portability pitfalls.
Platform-specific coding: Small sections of C++ code will need to be written differently on each platform, especially when using interfaces to hardware acceleration and other intrinsic functions. Most C++ programmers are familiar with using #if or #ifdef preprocessor directives to handle different platforms. And the various flavors of this are discussed further below.
Advanced Portability Practices
The basic best practices are to write portable code until you can't. Here are some suggestions to further finesse your portability coding practices:
- 1. Self-test portability issues at startup.
- 2. Print out platform settings into logs.
A good idea is to self-test that certain portabity settings meet the minimum requirements of your application. And you probably should do that even in the production versions that users run, not just in the debugging versions. It's only a handful of lines of code that can save you a lot of headaches later.
Also, you should detect and print out the current portability settings as part of the program's output (or report), or at least to the logs. Ideally, you would actually summarize these settings in the user's output display, which helps the poor phone jockeys trying to answers callers offering very useful problem summaries: "My AI doesn't work."
If it's not a PEBKAC, then having the ability to get these platform settings to put into the incident log is very helpful in resolving production-level support issues. This is especially true if you have users running your software on different user interfaces, and, honestly, if you don't support multiple user interfaces, then what are you doing here?
You should also output backend portability settings for API or other backend software products. The idea works the same even if your "users" are programmers who are running your code on different hardware platforms or virtual machines, except that their issue summaries will be like: "My kernel fission optimizations of batch normalization core dump from a SIGILL whenever I pass it a Mersenne prime."
Compilation Problems
C++ has been standardized for decades, or it seems like that. So I feel like it should be easier to get C++ code to compile. And yet, I find myself sometimes spending an hour or two getting past a few darn compiler errors.
Some of the main issue that will have a C++ program compile on one C++ compiler (e.g. MSVS) but not on another (e.g. GCC), include:
- Const correctness
- Permissive versus non-permissive modes
- Pointer type casting
Const correctness refer to the careful use of "const" to mark not just named constants, but also all unchanging read-only data types. If it's "const" then it cannot be changed; if it's non-const, then it's writable. People have different levels of feelings about whether this is a good idea. There are the fastidious Vogon-relative rule-followers who want it, and the normal reasonable pragmatic people who don't. Can you see which side I'm on?
Anyway, to get non-const-correct code (i.e. mine) to compile on GCC or MSVS, you need to turn off the fussy modes. On MSVS, there's a "permissive" flag in Project Settings that you have to turn off.
Pointer type casting is another issue. C++ for AI has a lot of problems with pointer types, mainly because C++ standardizers back in the 1990s neglected to create a "short float" 16-bit floating point type. Theoretically, you're not supposed to cast between different pointer types, like "int*" and "char*". And theoretically, you're supposed to use "void*" for generic addresses, rather than "char*" or "unsigned char*". But, you know, this is AI, so them rules is made to be broken, and the C++ standardizer committees finally admitted as much when they created the various special types of casts about 20 years later.
Anyway, the strategies for getting a non-compiling pointer cast to work include:
- Just casting it to whatever you want.
- Turning on permissive mode
- Casting it to void* and back again (i.e. "x=*(int*)(void*)(char*)&c")
- Using "reinterpret_cast" like a Goody Two-Shoes.
User Interface Portability
Most of the discussion here focuses on the portability of C++ coding on the backend, where the AI engine is running. But the user doesn't give a hoot about that stuff, and only cares about their user interface. Which brings us back to iOS versus Android.
Yeah, I know, you're a professional C++ programmer sitting there with two screens as big as a mammoth's ears. But your users are on these tiny little things that fit in their purse.
Most of the user interface issues are the same for AI applications as they are for non-AI applications. The methods to detect the type of the end user's device are the same in AI programs as they are for all types of programs.
Runtime Portability Pitfalls
Most of the low-level arithmetic code for AI algorithms looks quite standardized. Well, not so much. The general areas where C++ code that looks standard is actually non-portable includes trappy issues such as:
- Data type byte sizes (e.g. how many bytes is an "int")
- Arithmetic overflow of integers or floats
- Integer operators and negatives (e.g. % and >> operators)
- Floating point oddities (e.g. negative zero and its NaN)
- Divide-by-zero doesn't always crash
- Pointer versus integer sizes (e.g. do void pointers fit inside an int?)
- Endian-ness of integer byte storage (i.e. do you prefer "big endian" or "little endian"?)
- Zero bytes versus zero integers
- Order of evaluation of expression operands (e.g. with side-effects)
And there are various other portability issues arising at a higher-level than the AI arithmetic data processing, such as the inputs and outputs of the program. Problematic areas include:
- Text files (e.g. '\n' on Linux versus '\r\n' on Windows)
- UTF8 versus Latin1 encodings (e.g. for tokenization)
- Unicode special characters
- EBCDIC versus ASCII (character-level problems in tokens)
- Operating system accesses (e.g. processes and file permissions)
- Signal handling (low-level)
Data Type Sizes
The typical AI C++ engines work with 32-bit floats ("float" type) or 32-bit integers ("int"). If you assume that "short" is 16-bit, "int" is 32-bit, and "long" is 64-bit, well, you'd be incorrect. The C++ standard only requires that "long" is at least as big as "int".
Your startup portability check should check that sizes are what you want:
// Test basic numeric sizes yassert(sizeof(int) == 4); yassert(sizeof(float) == 4); yassert(sizeof(short) == 2);
And you should print them out in a report, or to a log file. Here's a useful way with a macro that uses the "#" stringize preprocessor operator and also the standard adjacent string concatenation feature of C++.
#define PRINT_TYPE_SIZE(type) \ printf("Config: sizeof " #type " = %d bytes (%d bits)\n", \ (int)sizeof(type), 8*(int)sizeof(type));
You can print out whatever types you need:
PRINT_TYPE_SIZE(int); PRINT_TYPE_SIZE(float); PRINT_TYPE_SIZE(short);
Here's the output on my Windows laptop with MSVS:
Config: sizeof int = 4 bytes (32 bits) Config: sizeof float = 4 bytes (32 bits) Config: sizeof short = 2 bytes (16 bits)
16-Bit Data: For quantization to 16 bits, you might use a 16-bit integer ("short"). For 16-bit floats (FP16), maybe you can find a platform-specific way to do 16-bit float types in C++, since there's no standard way at the time of writing.
Standard Library Types: Other data types to consider are the builtin ones in the standards. I'm looking at you, size_t and time_t, and a few others that are on Santa's naughty list. People often assume that size_ is the same as "unsigned int" but it's actually usually "unsigned long".
16-bit Float Types
The main C++ compilers at the time of writing (Oct 2023) do not have any great builtin support of 16-bit floating point types. There's no "short float" type, for example, on GCC or Microsoft Visual Studio C++. There is some standard support written into the C++23 standard, but not many compilers are there yet.
Zero Is Not Always Zero?
You probably assume that a 4-byte integer containing "0" has all four individual bytes equal to zero. It seems completely reasonable, and is correct on many platforms. But not all. There's a theoretical portability problem on a few obscure platforms. There are computers where integer zero is not four zero bytes.
Really? Well, actually, I just went scouring the internet for information on a real platform where this is the case, and couldn't find anything. Maybe it's some obscure old platforms from the 1980s, when the ANSI C standards were first being created? In any case, it's only a few lines of code that you can add to the startup initialization (or not).
If you want to check, here's a few lines of code for your platform portability self-check code at startup:
// Test zero integer portability int i = 0; unsigned char* cptr = (unsigned char*) & i; yassert(cptr[0] == 0); yassert(cptr[1] == 0); yassert(cptr[2] == 0); yassert(cptr[3] == 0);
Actually, that code isn't very portable! It's assuming 32-bit int size. Here's some more general code:
int i2 = 0; unsigned char* cptr2 = (unsigned char*)&i2; for (int i = 0; i < sizeof(int); i++) { yassert(cptr2[i] == 0); }
Null Pointer is Zero: The NULL pointer is probably all-bits-zero on all platforms. But you might as well be sure, so here's the code to check NULL in a "char*" type:
// Test pointer NULL portability char *ptr1 = NULL; unsigned char* cptr3 = (unsigned char*)&ptr1; for (int i = 0; i < sizeof(char*); i++) { yassert(cptr3[i] == 0); }
If you have a very low risk tolerance, you can also duplicate this code to check "void*" and "float*" pointer types set to NULL are all zero-bits.
Initialization to Zero: If you have a big object, or a long array, it's very slow to initialize every object field, or every array element, explicitly to zero. The faster method is to use memset to set every byte to zero, or alternatively, to use calloc to allocate memory that is already full of zero bytes. These optimizations rely on integer zero and floating point zero and pointer NULL all being a sequence of zero bytes.
The fast code is typically something like this:
const ARRSIZE = 512; float arr[ARRSIZE]; memset(&arr, 0, ARRSIZE * sizeof(float));
Or you can do other variants:
memset(&arr, 0, sizeof(arr)); // Option #2 (a bit risky) memset(&arr, 0, ARRSIZE * sizeof(*arr)); // Option #3
This works just fine, provided your platform is a normal one in its handling of zero for int, float, and pointers.
Bug Alert! Be careful with the second sizeof option for arrays that are function parameters, because C++ converts arrays to pointers (because arrays are passed-by-reference in C++).
void initmyarray(float arr[512]) { memset(&arr, 0, sizeof(arr)); // Option #2 is BROKEN! ... }
The problem is that "arr" is converted to a pointer type with size only 8 bytes, which is the size of "float*" pointer type. So you didn't really initialize the whole array of size 512. And there's no warnings at compile-time or run-time about this insidious bug.
Memset Wrapper Trick: The only way to catch the bug with sizeof and array parameters is to use your own wrapper around memset calls, and add an explicit runtime test. Here's an example memset wrapper:
void yapi_memset_wrapper(char* addr, int c, int sz) { if (sz == sizeof(float*)) { yassert(sz == sizeof(float*)); // Probable error! } if (sz == 0) { yassert(sz != 0); // Wrongly reversed parameters? } memset(addr, c, sz); // Call the real deal }
And if you want to be sure, you can force memset to use the wrapper:
#define memset memset_wrapper
Another less imposing way to force yourself to always use the memset wrapper is the kind reminder method:
#define memset please_use_memset_wrapper_instead
You'll need to add an "#undef" before the real call to memset in the wrapper code (recursion, anyone?). And you probably can't safely redefine memset before including the standard libraries, so don't do it in a "-D" option or MSVS project property. Instead, put it in your own header files, which should be included after the standard library headers. And it's up to you whether or not to leave these debugging wrapper tests in production code.
Floating Point Zero: Similarly to integer zero, you probably assume that a 4-byte float type with value "0.0f" also has 4 zero bytes (i.e. all 32 bits are zero). You're correct provided your platform is following the IEEE 754 standard for 32-bit floats, which it should be. You can test it explicitly with portability self-testing code:
// Test float zero portability float f1 = 0.0f; unsigned char* cptr4 = (unsigned char*)&f1; for (int i = 0; i < sizeof(float); i++) { yassert(cptr4[i] == 0); }
Zero is a special case floating point number. Because the mantissa has an implicit extra prefix 1 digit, putting all zeros in the mantissa actually means "1.0" (times two raised to the power of the exponent), not "0.0". For an 8-bit exponent with a 127 offset, the all-bits-zero value of the exponent bits is not a zero exponent value, but actually "-127". So, the standard for a "float" treats "1.0 x 2^-127" specially, as if it was exactly zero. The IEEE 754 standard gets around this by treating it as a special case if all of the bits of the exponent are also zero. When all the bits are zero, by definition and signed in triplicate by the IEEE committees, it means "0.0" in 32-bit floats or 64-bits for double.
Negative Zero: Another problem is that floating point has two zeros. There's a "negative zero" in the standard IEEE 754 representation of floating point numbers. This has all 0 bits for both the exponent and mantissa, like the normal zero, but the sign bit is set, as if to indicate a negative number. This is negative zero (i.e. "-0.0"), and its arithmetic value is the same as zero (i.e. "0.0"), but not all the bits are zero. Hence, if you assume that float type comparisons with zero (e.g. "x==0.0f") are a simple integer comparison with an all-zero-bits 4-byte integer, actually the compiler has to consider two different values. (Maybe x86 assembler has a single opcode that does this in one cycle?)
Pointers versus Integer Sizes
You didn't hear this from me, but apparently you can store pointers in integers, and vice-versa, in C++ code. But it only works if the byte sizes are big enough, and it's best to self-test this portability risk during program startup. What exactly you want to test depends on what you're (not) doing, but here's one example:
// Test LONGs can be stored in pointers yassert(sizeof(char*) >= sizeof(long)); yassert(sizeof(void*) >= sizeof(long)); yassert(sizeof(int*) >= sizeof(long));
Note that you can't portably test these at compile-time with #if and the preprocessor, unfortunately, even though it seems like you should be able to use sizeof for builtin types in constant expressions. Maybe there's a tricky way to do it in #if with INT_MAX or something.
Divide-By-Zero Doesn't Crash
You want your code to crash? Well, maybe, and it usually does for a divide-by-zero, whether it's integer or floating point arithmetic. The same is true for the integer modulo (%) operator when given a zero second operand.
First point, though, smack yourself on the wrist for even using division in the first place. It's slower than a toad race in Alice Springs. Instead, floating point division should be changed to multiplication by the reciprocal, and integer division can probably be a right bitshift (>>) operator. Integer modulo (%) by a power-of-two (e.g. "x % 512") can be replaced with an unsigned bitwise-and (&) operator with the operand one less (i.e. "(unsigned)x & 511u").
Anyway, there are some platforms where dividing by zero doesn't core dump or BSOD or whatever. So sometimes you can have an insidious bug where it's dividing by zero, but there's no indication of an error. And that would probably make your AI engine do some squirrelly illogical stuff.
Unicode Special Characters
There are more Unicode characters than there are black holes in the Universe, which I didn't hear from Stephen Hawking. Whoever wrote the Unicode standard didn't get a lot of sleep time. There are different encodings for every language known to humankind, and probably a dozen Star Trek languages, too. And then there are love hearts and stars, and all sorts of emojis and icons. It's enough to confuse a poor little AI!
What to do about all these Unicode specials is a tricky question. Which ones are in the training data set? And does the AI engine tolerate them in the tokenizer anyway? The issue goes way beyond UTF8 versus Latin1 encoding issues, because there are simply so many different ones. I wish you the best of luck!
ASCII versus EBCDIC
AI uses character processing in its tokenization phase, and also possible in data structures such as hash tables or tries, if string data is used as the key (e.g. KV caching or the inference cache). There are issues with UTF8 versus Unicode versus Latin1 encoding on all platforms. Platforms with EBCDIC present additional problems.
ASCII is used on most platforms, including Windows and Linux, and the 7-bit values are mostly portable. EBCDIC is an older encoding that is used mostly on IBM mainframes, and uses 8-bit encoding with completely different values. Although many standard functions will run correctly in C++, such as isdigit or isalpha, at least for the common characters, any code that directly processes the integer value of a character can be problematic.
Character Conversion Pitfalls: Programmers get quite used to ASCII, and there are various tricks that are efficient for ASCII, but break on EBCDIC platforms.
// Convert 1..26 to a letter char c = x - 1 + 'A'; // ASCII works, fails on EBCDIC
Here's the portable way that works on both EBCDIC and ASCII:
char c = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"[x - 1];
Similar issues should not arise for digits because '0'..'9' are sequential in both EBCDIC and ASCII.
// Convert 0..9 to '0'..'9' char c = x + '0'; // Works on ASCII & EBCDIC
Preprocessor Macros: Tolerating EBCDIC in a widespread way will introduce extra inefficiency. An alternative is to use #if preprocessor tests to detect EBCDIC efficiently at compile-time:
#if ('A' != 65) #define YAPI_IS_EBCDIC 1 #else #define YAPI_IS_EBCDIC 0 #endif
And if your code requires ASCII, you can put a #error statement into the #if test (i.e. to trigger a compilation error), or at runtime put an assertion as part of your portability check at startup:
yassert(!YAPI_IS_EBCDIC); // ASCII, please!
EBCDIC Sorting Order: Another problem that's much harder to fix is that ASCII sorts the upper-case letters before the lower-case letters, because 'A' is 65, and 'a' is 97. EBCDIC is the reverse, with lower-case letters having smaller integer values than upper-case letters.
References on Portability
- Ka Hei Martin Kwok, Matti Kortelainen, Giuseppe Cerati, Alexei Strelchenko, Oliver Gutsche, Allison Reinsvold Hall, Steve Lantz, Michael Reid, Daniel Riley, Sophie Berkman, Seyong Lee, Hammad Ather, Boyana Norris, Cong Wang, 25 Jan 2024, Application of performance portability solutions for GPUs and many-core CPUs to track reconstruction kernels, https://arxiv.org/abs/2401.14221
- Stijn Heldens, Ben van Werkhoven, 22 Mar 2023, Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications, https://arxiv.org/abs/2303.12374
- Gregor Daiß, Patrick Diehl, Dominic Marcello, Alireza Kheirkhahan, Hartmut Kaiser, Dirk Pflüger, 4 Mar 2023 (v2), From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels, https://arxiv.org/abs/2210.06438
- Sivasankaran Rajamanickam, Seher Acer, Luc Berger-Vergiat, Vinh Dang, Nathan Ellingwood, Evan Harvey, Brian Kelley, Christian R. Trott, Jeremiah Wilke, Ichitaro Yamazaki, 22 Mar 2021, Kokkos Kernels: Performance Portable Sparse/Dense Linear Algebra and Graph Kernels, https://arxiv.org/abs/2103.11991
- John Lawson, 30 Aug 2020, Performance portability through machine learning guided kernel selection in SYCL libraries, https://arxiv.org/abs/2008.13145
- Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, Holger Fröning, 30 Sep 2020 (v3), A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels, https://arxiv.org/abs/2001.07104
- John Lawson, Mehdi Goli, Duncan McBain, Daniel Soutar, Louis Sugy, 10 Apr 2019, Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels, https://arxiv.org/abs/1904.05347
- Karl Rupp, Philippe Tillet, Florian Rudolf, Josef Weinbub, Tibor Grasser, Ansgar Jüngel, 2 Sep 2014, Performance Portability Study of Linear Algebra Kernels in OpenCL, https://arxiv.org/abs/1409.0669
- David Spuler, March 2024, Chapter 38. Platform Portability, Generative AI in C++: Coding Transformers and LLMs, https://www.amazon.com/dp/B0CXJKCWX9
- Fraser Mince, Dzung Dinh, Jonas Kgomo, Neil Thompson, Sara Hooker, 2023, The Grand Illusion: The Myth of Software Portability and Implications for ML Progress, Part of Advances in Neural Information Processing Systems 36 (NeurIPS 2023), https://proceedings.neurips.cc/paper_files/paper/2023/hash/42c40aff7814e9796266e12053b1c610-Abstract-Conference.html
- Ajay Bati, Spencer H. Bryngelson, 2024, RoseNNa: A performant, portable library for neural network inference with application to computational fluid dynamics, Computer Physics Communications, Volume 296, 109052, ISSN 0010-4655, https://doi.org/10.1016/j.cpc.2023.109052 https://www.sciencedirect.com/science/article/abs/pii/S0010465523003971
- David Spuler, March 2024, C++ Portability Bug Catalog, in Generative AI in C++, in Generative AI in C++, https://www.aussieai.com/book/appendix-portability-bug-catalog
More AI Research
Read more about: