Aussie AI Blog

False Sharing Multithreading Optimization

  • March 27th, 2025
  • by David Spuler, Ph.D.

False Sharing

False sharing is a slug in C++ multithreaded code preventing two threads from running as fast as they should. The idea of "false sharing" is that two threads can interfere with each other's memory caching. The sharing is "false" because it can occur with data that's not actually being intentionally shared between the threads, but is impeded simply because the memory addresses are too close together.

Why does it occur? The CPU's L1 and L2 caches don't just cache in single bytes, 16-bit words, or even 32-bit integers. Instead, they have caching in "chunks" in the hardware level, which are called "cache lines" (also "cache sectors" or "cache blocks" or "cache line sizes" or "bananas in pyjamas" if you prefer).

How big? Some examples of common sizes of these cache lines include:

  • Intel CPUs — 64 bytes.
  • Apple M2 — 128 bytes.
  • Some AMD and other CPUs — 256 bytes.

What this means is that, on an Intel CPU, the caches are updated 64 bytes at a time, because one "cache line" is read or written as the minimum size. This is good because:

  • Cache loads are 64 bytes in parallel (in hardware).
  • Cache writes (updates) store 64 bytes in parallel.

But this is bad because:

  • Invalidating one cache byte also invalidates all 64 cache line bytes.

This is where we have a slowdown from false sharing. If one thread sets any value in a 64-byte cache line, then all of the other 63 bytes are also invalidated in the cache. If a second thread needs to use any of those other 63 bytes, then it needs a cache line refresh. Slowness ensues.

Example of False Sharing

A common example would be two integers, each 4 bytes in size, but close together so that they sit inside the same 64-byte cache line. The most common problems arise with atomics or mutexes close together, but they can affect any global variable.

Hence, first a simple example without any atomics, mutexes, or other thread synchronization. Let's just look at two threads that are updating their own global variable, with no overlap between the threads. In theory, these two threads should not affect each other at all. In reality, there are CPU cache lines.

Here are our two global counter variables:

   int g_counter1 = 0;
   int g_counter2 = 0;

In practice, false sharing is more likely to occur with two atomics declared close together. However, in this example we're just testing with two completely unrelated threads, with absolutely zero synchronization happening between them. They really shouldn't impact each other, if not for false sharing.

Here is the sequential code, which sets two global variables:

   void runtest1_no_threads(int n)
   {
      for (int i = 0; i < n; i++) {
         g_counter1++;
      }
      for (int i = 0; i < n; i++) {
         g_counter2++;
      }
   }

Here are the two threads that aim to set those two global variables in parallel. Note that each thread only accesses one variable, without any "sharing" going on.

   void thread1(int n)
   {
      for (int i = 0; i < n; i++) {
         g_counter1++;
      }
   }

   void thread2(int n)
   {
      for (int i = 0; i < n; i++) {
         g_counter2++;
      }
   }

And here's the basic thread launching code:

   void runtest1_threads(int n)
   {
      std::thread t1(thread1, n);
      std::thread t2(thread2, n);
      t1.join();
      t2.join();
   }

Finally, here is the timing code using <chrono>:

   g_counter1 = g_counter2 = 0;
   auto before = std::chrono::high_resolution_clock::now();
   runtest1_no_threads(n);
   auto now = std::chrono::high_resolution_clock::now();
   auto diff = std::chrono::duration_cast(now - before).count();
   std::cout << "Time (no threads): " << diff << " microseconds" << std::endl;

Here are the speed results from executing the sequential and threaded code for 100 million iterations using g++ on Linux.

   Time (no threads): 256079 microseconds
   Time (2 threads): 209341 microseconds

Note that the threaded code does not actually run twice as fast as the sequential code, despite having two threads that should run in parallel. In fact, it only improves on the sequential code by about 19%, rather than 50%. Why?

It's the magic of false sharing, whereby one thread writing to its variable slows down the other unrelated variable that's only being used by the other thread. The two threads are constantly writing to their own variable, which messes with the cached value of the other global variable used in the other thread. It's kind of like entanglement in quantum physics, if you like that kind of thing.

Solutions for False Sharing

There are a few coding solutions to prevent false sharing. The basic idea is ensuring that the addresses of unrelated thread-shared global addresses are not too close. Options include:

  • Putting global variables in random spots throughout your C++ code.
  • Using alignas to enforce address spacing on alignment boundaries.

The first one is kind of a joke, although it would probably work in most cases. However, it's not technically guaranteed where the linker will put unrelated global variables in the address space.

A more elegant solution is to put variables, especially atomics, on address alignment boundaries. The idea is to ensure that each important global variable is alone in its 64-byte block. The global variables in our declarations become:

   alignas(64) int g_counter1 = 0;
   alignas(64) int g_counter2 = 0;

By declaring them both as alignas(64), it guarantees two things:

  • The variables start on a 64-byte alignment boundary (we don't care about this here), and
  • They are the only variable in that 64 bytes (this fixes false sharing).

The downside is that each 4-byte integer is stored in 64 bytes, so there's 60 bytes of unused padding added to global memory usage. But it's better to pad memory than to waste CPU cycles! (On the other hand, the CPU cache lines are also loading and storing 60 unused bytes, so we've somewhat undermined the efficiency advantages of the L1/L2 cache lines for this 64-byte block.)

Anyway, who cares, it works! Here are the faster speed measurements just from adding alignas statements:

   Time (no threads): 260277 microseconds
   Time (2 threads): 133947 microseconds

Wow! It's almost exactly half the time! The performance gain is about 49%, which is much better than 19% (due to false sharing slowdowns), and is close to the 50% gain we were aiming for with two threads. Maybe there's something to this multithreading stuff, after all.

Some Final Tweaks

As a finesse, you can assure that the addresses are far enough apart by simply checking in code. One possible method to make sure that some junior code jockey hasn't deleted your alignas statements:

    assert( (char*)&var2 - (char*)&var1 >= 64);

Unfortunately, you can't do it faster at compile-time, since addresses of global variables are not "constant" enough for the compiler:

    static_assert( (char*)&var2 - (char*)&var1 >= 64);  // Fails

Note that some CPUs have cache line sizes up to 256 bytes. Hence, you might need alignas(128) or alignas(256) on those platforms.

Note also there are various other non-standard ways to achieve alignment, most of them having existed on platforms prior to the alignas specifier in the C++ standardization. For example, GCC has a whole set of old builtins. Feel free to use those old things and charge extra because you're writing antique C++ code.

Another point is that false sharing slowdowns can arise for non-global variables, such as dynamic allocated memory or stack addresses. It's not very likely for two threads to see contention over stack addresses inside their respective call frames, but it can occur with allocated memory blocks that are shared. There are various ways to get aligned addresses inside dynamic memory allocation, including aligned memory allocation primitives, so the same ideas can solve the problem.

Nevertheless, atomics declared as global variables are probably the most likely area where false sharing can occur. This suggests a general rule: all global atomics should be declared as alignas. I'm not sure I agree, and it does sound a bit drastic. This does avoid the performance slug of false sharing, but it will also waste significant memory with padding bytes.

Low Latency C++ Book



Low Latency C++: Multithreading and Hotpath Optimizations The new Low Latency C++ coding book:
  • Low Latency for AI and HFT
  • C++ Multithreading Optimizations
  • Efficient C++ Coding

Get your copy from Amazon: Low Latency C++

More AI Research Topics

Read more about: