Aussie AI

AVX Memory Alignment Issues

Book Excerpt from "Generative AI in C++"

by David Spuler, Ph.D.

AVX Memory Alignment Issues

The above example glosses over the issue of managing “alignment” of memory addresses on byte boundaries with the “alignas” specifier. Some of the AVX SIMD intrinsic calls require that addresses are 16-byte aligned (i.e. this is effectively 128-bit alignment), which is not guaranteed by the C++ compiler. However, we've tolerated non-aligned addresses by using the “_mm_storeu_ps” intrinsic, which works with either aligned or non-aligned addresses.

Note that alignment restriction requirements of AVX are somewhat in flux. Not all AVX intrinsics require alignment, and they are “relaxed” in many cases. There have also been some bugs in compiler toleration of non-aligned addresses in C++ intrinsics. Where required, the alignment needs are:

AVX-1 — 16-byte alignment (128-bit).
AVX-2 — 32-byte alignment (256-bit).
AVX-512 — 64-byte alignment (512-bit).

Since we can sort out alignment at compile-time using the C++ “alignas” specifier and “aligned” type attributes, there is no performance penalty (except in terms of space) for ensuring greater compatibility across CPU platforms and compiler versions by preferring aligned addresses.

You can create your own macros to easily test pointer addresses for alignment by checking their remainder with the % operator. These examples use bitwise-and to replace the slow remainder operator:

    #define aussie_is_aligned_16(ptr)  ((((unsigned long)(ptr)) &15ul) == 0)
    #define aussie_is_aligned_32(ptr)  ((((unsigned long)(ptr)) &31ul) == 0)

Although our code to multiply 4 float values tolerates non-alignment, it's a minor slug. The “_mm_storeu_ps” AVX intrinsic is slower if the addresses are not aligned, so we should fix the alignment for performance reasons. There's also another “store” intrinsic to convert from 128-bits to 4 floats called “_mm_store_ps” (without the “u”) that runs faster, but does not tolerate non-aligned float arrays. Actually, “_mm_storeu_ps” is supposed to be equally as fast as “_mm_store_ps” if the address is correctly aligned, so we can still use that intrinsic if we prefer safety, but we need to change the variables to be aligned on 16-byte boundaries for a speedup.

To ensure alignment in C++, there is an “alignas” specifier for variable declarations. We can use “alignas(16)” to force C++ to create the variables with 16-byte alignment of the address where they are stored. For example, our unit test harness code could have ensured 16-byte alignment of all memory addresses via:

    // Test with 16-byte alignment
    alignas(16) float arr1[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    alignas(16) float arr2[4] = { 1.0f , 2.5f , 3.14f, 0.0f };
    alignas(16) float resultarr[4];

There are various non-standard alternatives to “alignas” in the various compilers. For example, MSVS has “__declspec(align(16))” with two prefix underscores, and GCC supports “decltype(align(16))”.

The AVX code for an alignment-requiring version is not much different, with minor changes to the names of the C++ intrinsics:

    void aussie_avx_multiply_4_floats_aligned(float v1[4], float v2[4], float vresult[4])
    {
        // Use 128-bit AVX registers to multiply 4x32-bit floats...
        __m128 r1 = _mm_loadu_ps(v1);   // Load floats into 128-bits
        __m128 r2 = _mm_loadu_ps(v2);
        __m128 dst = _mm_mul_ps(r1, r2);   // Multiply
        _mm_store_ps(vresult, dst);  // Aligned version convert to floats
    }

Ideally we'd like to ensure that the function is only called with aligned addresses at compile-time. The first attempt is to declare “vresult” above as “alignas(16)” for type checking of alignment issues, but it fails for function parameters. Fortunately, there's another way using type attributes:

    __attribute__((aligned(16)))

Another method is to define our own assertion that uses bitwise tests on the address instead:

    #define is_aligned_16(ptr)  ((((unsigned long int)(ptr)) & 15) == 0)

This tests the address is a number that is a multiple of 16 using bitwise-and with 15, but this is at runtime and costs extra cycles.

• Next:

• Up: Table of Contents

• Buy: Generative AI in C++: Coding Transformers and LLMs

The new AI programming book by Aussie AI co-founders:

AI coding in C++
Transformer engine speedups
LLM models
Phone and desktop AI
Code examples
Research citations

Get your copy from Amazon: Generative AI in C++

Aussie AI

AVX Memory Alignment Issues

AVX Memory Alignment Issues

Quick Links

Product

New to Writing?

Writing Styles