Compilers, CPU, Memory, Cache Coherency, Atomicity, Syncronization and ordering Are Not Black Magic but The Mix Is Close Enough
Introduction: The Illusion of Sequential Execution
When you write code, you naturally think in terms of sequential execution: instruction A happens, then instruction B, then instruction C. This mental model works perfectly—until you start writing concurrent code or working with hardware. Then you discover that modern CPUs, compilers, and memory systems conspire to execute your code in ways you never imagined.
The truth is that sequential consistency is largely an illusion maintained by your compiler and CPU to make programming tractable. But when multiple cores or threads are involved, that illusion breaks down in spectacular and subtle ways.
This post explores the deep rabbit hole of memory ordering, synchronization, and cache coherency. Each concept individually makes sense, but their interaction creates a complexity that can feel like black magic. Let’s demystify it.
The Fundamental Problem: Nobody Agrees on Reality
In a single-threaded program running on a single core, you can pretend memory operations happen in program order. But the moment you have multiple cores accessing shared memory, you face a fundamental problem: there is no global notion of time.
Each CPU core has its own cache hierarchy. When core 1 writes a value to memory, core 2 might not see that write for thousands of cycles. Worse, two cores might observe writes happening in different orders. This isn’t a bug—it’s a fundamental consequence of the speed of light and the physics of modern hardware.
Compiler Reordering: The Optimizer That Ruins Your Day
Before your code even reaches the CPU, the compiler gets to “optimize” it. And one of its favorite optimizations is reordering memory operations.
Why Compilers Reorder
Consider this seemingly simple code:
int data = 0;
int ready = 0;
// Thread 1
data = 42;
ready = 1;
// Thread 2
while (ready == 0) { }
assert(data == 42);
You expect thread 2 to see data == 42 once ready is set. But the compiler might reorder thread 1’s writes because, from a single-threaded perspective, the order doesn’t matter:
// What the compiler might generate:
ready = 1; // Moved up!
data = 42;
Now thread 2 can observe ready == 1 while data is still 0. Your assertion fails.
Register Caching
Another classic compiler “optimization”:
// Thread 1
while (flag == 0) {
// spin wait
}
// Thread 2
flag = 1;
The compiler might optimize thread 1’s loop to:
int cached_flag = flag;
while (cached_flag == 0) {
// spin forever!
}
It loaded flag into a register once and never reads from memory again. Thread 1 never sees the update from thread 2.
Solutions to Compiler Reordering
1. Volatile keyword:
volatile int flag = 0;
while (flag == 0) { } // Compiler must read from memory every time
But volatile is weak—it only prevents compiler reordering, not CPU reordering.
2. Compiler barriers:
#define barrier() asm volatile("" ::: "memory")
data = 42;
barrier(); // Compiler won't reorder across this
ready = 1;
3. Atomic operations (more on this later) which include compiler barriers implicitly.
CPU Reordering: When Hardware Betrays You
Even if you prevent compiler reordering, modern CPUs reorder memory operations for performance. Different architectures have different memory models.
x86/x86-64: Relatively Strong Ordering
Intel and AMD CPUs provide relatively strong memory ordering guarantees:
- Loads are not reordered with other loads
- Stores are not reordered with other stores
- Stores are not reordered with prior loads
- Loads MAY be reordered with prior stores (store-load reordering)
So on x86, this code is actually safe without additional barriers:
// Thread 1
data = 42;
ready = 1;
// Thread 2
while (ready == 0) { }
assert(data == 42); // Safe on x86!
The store to data won’t be reordered after the store to ready. But this is x86-specific!
ARM/RISC-V: Weak Ordering
ARM and RISC-V have weak memory models where almost anything can be reordered:
- Loads can be reordered with loads
- Stores can be reordered with stores
- Loads can be reordered with stores
- Stores can be reordered with loads
On ARM, our example code will fail without proper memory barriers.
Store Buffers and Out-of-Order Execution
Modern CPUs have store buffers that hold pending writes. A store instruction might “complete” from the CPU’s perspective but not yet be visible to other cores. The CPU continues executing subsequent instructions while the store buffer drains.
Additionally, out-of-order execution means the CPU can execute instructions in any order as long as it respects data dependencies—but only within a single core’s view of the world.
Cache Coherency: The MESI Protocol and Its Discontents
Each CPU core has multiple levels of cache (L1, L2, sometimes L3). When multiple cores access the same memory location, the cache coherency protocol ensures they eventually see consistent values.
The MESI Protocol
The most common cache coherency protocol is MESI (Modified, Exclusive, Shared, Invalid):
- Modified (M): This cache has the only copy, and it’s been modified
- Exclusive (E): This cache has the only copy, but it’s clean
- Shared (S): Multiple caches may have this copy (read-only)
- Invalid (I): This cache line is invalid
When one core writes to a cache line, it must invalidate that line in all other cores’ caches. This involves sending messages over the cache coherency bus.
Cache Line Ping-Pong
A performance killer in concurrent code:
struct {
int counter1; // Used by core 1
int counter2; // Used by core 2
} data;
If counter1 and counter2 are on the same cache line (typically 64 bytes), writes from core 1 will invalidate core 2’s cache, and vice versa. The cache line bounces between cores—this is cache line ping-pong or false sharing.
Solution: Pad structures to ensure hot variables are on different cache lines:
struct {
int counter1;
char padding[60]; // Ensure different cache line
int counter2;
} data;
Or use __attribute__((aligned(64))) or alignas(64) in C11/C++11.
True Sharing vs False Sharing
- True sharing: Multiple cores legitimately need to access the same data (unavoidable coherency traffic)
- False sharing: Multiple cores access different data that happens to be on the same cache line (avoidable!)
Memory Barriers and Fences: The Synchronization Hammer
Memory barriers (also called memory fences) are instructions that constrain memory operation reordering.
Types of Memory Barriers
1. Full barrier (mfence on x86, dmb sy on ARM):
- No loads/stores before the barrier can be reordered with loads/stores after
- Heavy but simple
2. Store barrier (sfence on x86, dmb st on ARM):
- All stores before the barrier complete before any stores after
- Useful for ensuring writes are visible
3. Load barrier (lfence on x86, dmb ld on ARM):
- All loads before the barrier complete before any loads after
- Less common in practice
4. Acquire barrier:
- Prevents loads/stores after the barrier from moving before it
- Used after loading a lock or flag
5. Release barrier:
- Prevents loads/stores before the barrier from moving after it
- Used before storing to a lock or flag
When to Use Barriers
Explicit barriers are rare in application code because you typically use higher-level primitives. But in kernel code or lock-free data structures, you need them:
// Publisher (producer)
data = 42;
smp_wmb(); // Store barrier
ready = 1;
// Consumer
while (READ_ONCE(ready) == 0) { }
smp_rmb(); // Load barrier
assert(data == 42);
Architecture-Specific Considerations
On x86, some barriers are essentially no-ops because of the strong memory model:
// Linux kernel definitions for x86
#define smp_rmb() barrier() // Just a compiler barrier
#define smp_wmb() barrier() // Just a compiler barrier
#define smp_mb() asm volatile("mfence" ::: "memory")
On ARM, all of them map to actual hardware instructions:
// Linux kernel definitions for ARM
#define smp_rmb() asm volatile("dmb ld" ::: "memory")
#define smp_wmb() asm volatile("dmb st" ::: "memory")
#define smp_mb() asm volatile("dmb sy" ::: "memory")
Atomic Operations: Read-Modify-Write Without Tears
Atomic operations provide indivisible read-modify-write operations that are essential for lock-free programming.
Common Atomic Operations
1. Compare-and-Swap (CAS):
// Atomically: if (*ptr == old) { *ptr = new; return true; }
bool compare_and_swap(int *ptr, int old, int new);
The foundation of most lock-free algorithms:
// Lock-free increment
int old, new;
do {
old = atomic_load(counter);
new = old + 1;
} while (!compare_and_swap(counter, old, new));
2. Fetch-and-Add:
int old = atomic_fetch_add(counter, 1);
Simpler and often faster than CAS for counters.
3. Exchange:
int old = atomic_exchange(ptr, new_value);
4. Test-and-Set:
bool was_set = atomic_test_and_set(flag);
ABA Problem
A classic issue with CAS-based algorithms:
// Thread 1 reads A
int old = atomic_load(ptr); // Gets value A
// Thread 2 changes A → B → A
atomic_store(ptr, B);
atomic_store(ptr, A);
// Thread 1's CAS succeeds even though the world changed!
if (compare_and_swap(ptr, old, new)) {
// Succeeds, but intermediate state was missed
}
Solutions:
- Use versioned pointers (pack a counter with the pointer)
- Use hazard pointers or epoch-based reclamation
- Use double-word CAS (DCAS) where available
Memory Ordering Semantics: The C11/C++11 Model
Modern C/C++ provide a memory model with explicit ordering semantics. Every atomic operation specifies its memory ordering:
Memory Order Types
1. memory_order_relaxed: No ordering guarantees, just atomicity
atomic_fetch_add_explicit(counter, 1, memory_order_relaxed);
Good for counters where you only care about the final value, not intermediate states.
2. memory_order_acquire: Used with loads
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) { }
// All subsequent reads/writes cannot move before this
3. memory_order_release: Used with stores
atomic_store_explicit(&ready, 1, memory_order_release);
// All prior reads/writes cannot move after this
4. memory_order_acq_rel: Both acquire and release (for RMW operations)
atomic_fetch_add_explicit(counter, 1, memory_order_acq_rel);
5. memory_order_seq_cst: Sequentially consistent (default, strongest)
atomic_fetch_add(counter, 1); // Implies seq_cst
Establishes a global total order of all seq_cst operations across all threads.
6. memory_order_consume: Like acquire but weaker (rarely used, tricky to use correctly)
Acquire-Release Synchronization
The most common pattern in lock-free code:
// Producer
data = 42;
atomic_store_explicit(&ready, 1, memory_order_release);
// Consumer
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) { }
assert(data == 42); // Guaranteed to see data
The release store “synchronizes with” the acquire load. Everything before the release is visible after the acquire.
Performance Implications
Different orderings have different costs:
- Relaxed: Cheapest, often just a regular load/store
- Acquire/Release: Medium cost, may require barriers on weak architectures
- Seq_cst: Most expensive, requires full barriers on most architectures
Don’t use seq_cst everywhere just because it’s the default. Use the weakest ordering that’s correct.
READ_ONCE/WRITE_ONCE: The Linux Kernel Approach
The Linux kernel uses READ_ONCE and WRITE_ONCE macros extensively. These prevent both compiler optimizations and provide basic ordering.
Definition
#define READ_ONCE(x) \
(*(const volatile typeof(x) *)&(x))
#define WRITE_ONCE(x, val) \
(*(volatile typeof(x) *)&(x) = (val))
Why They’re Needed
1. Prevent register caching:
// Without READ_ONCE, compiler might optimize to infinite loop
while (READ_ONCE(flag) == 0) { }
2. Prevent store tearing:
// Without WRITE_ONCE, compiler might generate multiple stores
WRITE_ONCE(status, 0x12345678); // Single store instruction
3. Prevent load tearing:
uint64_t value = READ_ONCE(big_value); // Single load instruction
4. Signal intent:
// "This variable is shared and might change unexpectedly"
if (READ_ONCE(stop_flag)) {
break;
}
When to Use Them
- Accessing shared variables without locks
- Reading/writing data structures accessed by interrupt handlers
- Implementing lock-free algorithms (though usually need stronger atomics)
- Any time a variable might be concurrently modified
Limitations
READ_ONCE/WRITE_ONCE don’t provide ordering guarantees beyond preventing compiler reordering. For ordering across cores, you need memory barriers or atomic operations.
Example of what goes wrong:
int data = 0;
int ready = 0;
// Writer thread
data = 42;
WRITE_ONCE(ready, 1);
// Reader thread
while (READ_ONCE(ready) == 0) { }
int value = READ_ONCE(data); // WRONG! Might still see data == 0
Even though we use READ_ONCE and WRITE_ONCE, on weakly-ordered architectures (ARM, RISC-V, PowerPC), the reader might observe ready == 1 but still see the old value of data. Why?
- The writer’s stores might be reordered by the CPU’s store buffer
- The reader’s loads might be reordered - it could speculatively read
databefore checkingready - Cache coherency delays - the write to
datamight not have propagated to the reader’s cache yet
READ_ONCE/WRITE_ONCE only prevent the compiler from reordering or optimizing away the accesses. They don’t prevent CPU-level reordering.
Correct solution using acquire-release:
int data = 0;
atomic_int ready = 0;
// Writer thread
data = 42;
atomic_store_explicit(&ready, 1, memory_order_release);
// Reader thread
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) { }
int value = data; // Guaranteed to see data == 42
assert(value == 42); // Will never fail
The memory_order_release store ensures all prior writes (including to data) are visible before the store to ready. The memory_order_acquire load ensures that once we observe ready == 1, all subsequent reads (including data) will see the updated values.
Alternative solution with explicit barriers:
// Writer thread
data = 42;
smp_wmb(); // Store barrier
WRITE_ONCE(ready, 1);
// Reader thread
while (READ_ONCE(ready) == 0) { }
smp_rmb(); // Load barrier
int value = data;
assert(value == 42); // Now safe
The key lesson: READ_ONCE/WRITE_ONCE are necessary but not sufficient for synchronization. They prevent compiler mischief but don’t order memory operations across cores. For that, you need atomic operations with proper memory ordering or explicit barriers.
Data Dependency Ordering: A Special Case
Some architectures (including ARM and older Alpha) can reorder dependent loads:
// Thread 1
data->value = 42;
smp_wmb();
ptr = data;
// Thread 2
struct data *p = ptr;
// On Alpha, might see p != NULL but p->value != 42!
int val = p->value;
On Alpha, the load of p->value can speculatively execute before p is loaded and reordered.
Solution: Use smp_read_barrier_depends() or consume ordering:
struct data *p = READ_ONCE(ptr);
smp_read_barrier_depends();
int val = p->value;
Or use RCU (Read-Copy-Update) which handles this automatically.
Practical Synchronization Problems and Solutions
Problem 1: Double-Checked Locking
The naive implementation is broken:
// BROKEN CODE
if (ptr == NULL) {
lock();
if (ptr == NULL) {
ptr = malloc(sizeof(*ptr));
init(ptr);
}
unlock();
}
use(ptr);
Between malloc and init, another thread might see a non-NULL but uninitialized ptr.
Solution: Use atomic operations with appropriate ordering:
struct obj *p = atomic_load_explicit(&ptr, memory_order_acquire);
if (p == NULL) {
lock();
p = atomic_load_explicit(&ptr, memory_order_relaxed);
if (p == NULL) {
p = malloc(sizeof(*p));
init(p);
atomic_store_explicit(&ptr, p, memory_order_release);
}
unlock();
}
use(p);
Problem 2: Spinlock Implementation
A broken spinlock:
// BROKEN
void spin_lock(int *lock) {
while (*lock == 1) { }
*lock = 1;
}
Multiple threads can “acquire” the lock simultaneously.
Correct implementation with atomic operations:
void spin_lock(atomic_int *lock) {
while (atomic_exchange_explicit(lock, 1, memory_order_acquire) == 1) {
while (atomic_load_explicit(lock, memory_order_relaxed) == 1) {
cpu_relax(); // Hint to CPU (pause instruction)
}
}
}
void spin_unlock(atomic_int *lock) {
atomic_store_explicit(lock, 0, memory_order_release);
}
Problem 3: Peterson’s Algorithm
Peterson’s algorithm for mutual exclusion requires release/acquire semantics:
atomic_int flag[2];
atomic_int turn;
void lock(int id) {
int other = 1 - id;
atomic_store_explicit(&flag[id], 1, memory_order_release);
atomic_store_explicit(&turn, other, memory_order_release);
while (atomic_load_explicit(&flag[other], memory_order_acquire) == 1 &&
atomic_load_explicit(&turn, memory_order_acquire) == other) {
cpu_relax();
}
}
void unlock(int id) {
atomic_store_explicit(&flag[id], 0, memory_order_release);
}
Problem 4: Message Passing
Producer-consumer with a ring buffer:
struct ring {
int data[SIZE];
atomic_int head; // Producer writes here
atomic_int tail; // Consumer reads here
};
// Producer
void produce(struct ring *r, int value) {
int h = atomic_load_explicit(&r->head, memory_order_relaxed);
int next = (h + 1) % SIZE;
while (next == atomic_load_explicit(&r->tail, memory_order_acquire)) {
// Ring full, wait
}
r->data[h] = value;
atomic_store_explicit(&r->head, next, memory_order_release);
}
// Consumer
int consume(struct ring *r) {
int t = atomic_load_explicit(&r->tail, memory_order_relaxed);
while (t == atomic_load_explicit(&r->head, memory_order_acquire)) {
// Ring empty, wait
}
int value = r->data[t];
atomic_store_explicit(&r->tail, (t + 1) % SIZE, memory_order_release);
return value;
}
Debugging Techniques
1. Thread Sanitizer (TSan)
Compile with -fsanitize=thread to detect data races:
gcc -fsanitize=thread -g program.c -o program
./program
TSan will report races with stack traces.
2. Stress Testing
Race conditions are probabilistic. Run tests under stress:
for (int i = 0; i < 1000000; i++) {
test_concurrent_operation();
}
3. Memory Model Simulators
Tools like cppmem let you simulate execution under different memory models.
4. Assertions and Invariants
Liberally sprinkle assertions:
assert(atomic_load(&refcount) > 0);
5. Architecture-Specific Testing
Test on weak memory order architectures (ARM, RISC-V). Code might work on x86 but fail on ARM.
Architecture-Specific Considerations
x86/x86-64
- Strong memory model helps (but don’t rely on it)
lockprefix for atomic operationsmfence,lfence,sfencefor barrierspauseinstruction for spinloop optimization
ARM
- Weak memory model requires careful use of barriers
dmb(data memory barrier):dmb sy,dmb ld,dmb stdsb(data synchronization barrier): stronger than dmbisb(instruction synchronization barrier): for instruction cacheldrex/strexfor atomic operations (older)- Load-acquire/Store-release instructions (ARMv8+):
ldar,stlr
RISC-V
- Weak memory model like ARM
fenceinstruction with ordering bitslr/sc(load-reserved/store-conditional) for atomics- Atomic extension provides
amooperations
Power/PowerPC
- Very weak memory model
- Requires careful barrier placement
sync,lwsync,isyncbarrierslwarx/stwcxfor atomic operations
Best Practices and Rules of Thumb
- Use existing synchronization primitives (mutexes, atomics) instead of rolling your own
- Start with sequential consistency and optimize only if profiling shows it’s necessary
- Document your memory ordering assumptions explicitly in comments
- Test on weak memory order architectures even if your production target is x86
- Use Thread Sanitizer religiously during development
- Avoid data races - if multiple threads access data and at least one writes, use synchronization
- Understand the difference between volatile and atomic - volatile doesn’t provide synchronization
- Pad hot structures to avoid false sharing
- Use READ_ONCE/WRITE_ONCE for shared variables in kernel-style code
- Learn your platform’s memory model - x86, ARM, and RISC-V are all different
Conclusion
Memory ordering, cache coherency, and synchronization are complex topics individually, but their interaction creates emergent complexity that can be genuinely mind-bending. The key insights:
- Compilers reorder for optimization
- CPUs reorder for performance
- Caches create coherency problems that require protocols and barriers
- Different architectures have different memory models - x86 is strong, ARM/RISC-V are weak
- Atomic operations and memory barriers are your tools for controlling this chaos
- Acquire-release semantics are usually what you want for synchronization
The good news is that you rarely need to think about this directly. Use mutexes, condition variables, atomic operations with appropriate ordering, and existing lock-free data structures. But when you do need to dig deep—for kernel code, high-performance systems, or lock-free algorithms—understanding these concepts is essential.
And yes, it’s still close enough to black magic that even experts get it wrong. The difference is that experts know when to be paranoid and which tools to reach for.
Further Reading
- “Memory Barriers: a Hardware View for Software Hackers” by Paul McKenney
- “Linux Kernel Memory Barriers” - Documentation/memory-barriers.txt in Linux source
- “C++ Concurrency in Action” by Anthony Williams
- “Is Parallel Programming Hard” by Paul McKenney
- “The Arm Memory Model” - ARM Architecture Reference Manual
- “A Primer on Memory Consistency” by Sorin, Hill, and Wood
- Jeff Preshing’s blog on lock-free programming
- The RISC-V Memory Model specification