Compilers, CPU, Memory, Cache Coherency, Atomicity, Syncronization and ordering Are Not Black Magic but The Mix Is Close Enough

2025-12-11

Page content

Introduction: The Illusion of Sequential Execution

When you write code, you naturally think in terms of sequential execution: instruction A happens, then instruction B, then instruction C. This mental model works perfectly—until you start writing concurrent code or working with hardware. Then you discover that modern CPUs, compilers, and memory systems conspire to execute your code in ways you never imagined.

The truth is that sequential consistency is largely an illusion maintained by your compiler and CPU to make programming tractable. But when multiple cores or threads are involved, that illusion breaks down in spectacular and subtle ways.

This post explores the deep rabbit hole of memory ordering, synchronization, and cache coherency. Each concept individually makes sense, but their interaction creates a complexity that can feel like black magic. Let’s demystify it.

The Fundamental Problem: Nobody Agrees on Reality

In a single-threaded program running on a single core, you can pretend memory operations happen in program order. But the moment you have multiple cores accessing shared memory, you face a fundamental problem: there is no global notion of time.

Each CPU core has its own cache hierarchy. When core 1 writes a value to memory, core 2 might not see that write for thousands of cycles. Worse, two cores might observe writes happening in different orders. This isn’t a bug—it’s a fundamental consequence of the speed of light and the physics of modern hardware.

Compiler Reordering: The Optimizer That Ruins Your Day

Before your code even reaches the CPU, the compiler gets to “optimize” it. And one of its favorite optimizations is reordering memory operations.

Why Compilers Reorder

Consider this seemingly simple code:

int data = 0;
int ready = 0;

// Thread 1
data = 42;
ready = 1;

// Thread 2
while (ready == 0) { }
assert(data == 42);

You expect thread 2 to see data == 42 once ready is set. But the compiler might reorder thread 1’s writes because, from a single-threaded perspective, the order doesn’t matter:

// What the compiler might generate:
ready = 1;  // Moved up!
data = 42;

Now thread 2 can observe ready == 1 while data is still 0. Your assertion fails.

Register Caching

Another classic compiler “optimization”:

// Thread 1
while (flag == 0) {
    // spin wait
}

// Thread 2
flag = 1;

The compiler might optimize thread 1’s loop to:

int cached_flag = flag;
while (cached_flag == 0) {
    // spin forever!
}

It loaded flag into a register once and never reads from memory again. Thread 1 never sees the update from thread 2.

Solutions to Compiler Reordering

1. Volatile keyword:

volatile int flag = 0;

while (flag == 0) { }  // Compiler must read from memory every time

But volatile is weak—it only prevents compiler reordering, not CPU reordering.

2. Compiler barriers:

#define barrier() asm volatile("" ::: "memory")

data = 42;
barrier();  // Compiler won't reorder across this
ready = 1;

3. Atomic operations (more on this later) which include compiler barriers implicitly.

CPU Reordering: When Hardware Betrays You

Even if you prevent compiler reordering, modern CPUs reorder memory operations for performance. Different architectures have different memory models.

x86/x86-64: Relatively Strong Ordering

Intel and AMD CPUs provide relatively strong memory ordering guarantees:

Loads are not reordered with other loads
Stores are not reordered with other stores
Stores are not reordered with prior loads
Loads MAY be reordered with prior stores (store-load reordering)

So on x86, this code is actually safe without additional barriers:

// Thread 1
data = 42;
ready = 1;

// Thread 2
while (ready == 0) { }
assert(data == 42);  // Safe on x86!

The store to data won’t be reordered after the store to ready. But this is x86-specific!

ARM/RISC-V: Weak Ordering

ARM and RISC-V have weak memory models where almost anything can be reordered:

Loads can be reordered with loads
Stores can be reordered with stores
Loads can be reordered with stores
Stores can be reordered with loads

On ARM, our example code will fail without proper memory barriers.

Store Buffers and Out-of-Order Execution

Modern CPUs have store buffers that hold pending writes. A store instruction might “complete” from the CPU’s perspective but not yet be visible to other cores. The CPU continues executing subsequent instructions while the store buffer drains.

Additionally, out-of-order execution means the CPU can execute instructions in any order as long as it respects data dependencies—but only within a single core’s view of the world.

Cache Coherency: The MESI Protocol and Its Discontents

Each CPU core has multiple levels of cache (L1, L2, sometimes L3). When multiple cores access the same memory location, the cache coherency protocol ensures they eventually see consistent values.

The MESI Protocol

The most common cache coherency protocol is MESI (Modified, Exclusive, Shared, Invalid):

Modified (M): This cache has the only copy, and it’s been modified
Exclusive (E): This cache has the only copy, but it’s clean
Shared (S): Multiple caches may have this copy (read-only)
Invalid (I): This cache line is invalid

When one core writes to a cache line, it must invalidate that line in all other cores’ caches. This involves sending messages over the cache coherency bus.

Cache Line Ping-Pong

A performance killer in concurrent code:

struct {
    int counter1;  // Used by core 1
    int counter2;  // Used by core 2
} data;

If counter1 and counter2 are on the same cache line (typically 64 bytes), writes from core 1 will invalidate core 2’s cache, and vice versa. The cache line bounces between cores—this is cache line ping-pong or false sharing.

Solution: Pad structures to ensure hot variables are on different cache lines:

struct {
    int counter1;
    char padding[60];  // Ensure different cache line
    int counter2;
} data;

Or use __attribute__((aligned(64))) or alignas(64) in C11/C++11.

True sharing: Multiple cores legitimately need to access the same data (unavoidable coherency traffic)
False sharing: Multiple cores access different data that happens to be on the same cache line (avoidable!)

Memory Barriers and Fences: The Synchronization Hammer

Memory barriers (also called memory fences) are instructions that constrain memory operation reordering.

Types of Memory Barriers

1. Full barrier (mfence on x86, dmb sy on ARM):

No loads/stores before the barrier can be reordered with loads/stores after
Heavy but simple

2. Store barrier (sfence on x86, dmb st on ARM):

All stores before the barrier complete before any stores after
Useful for ensuring writes are visible

3. Load barrier (lfence on x86, dmb ld on ARM):

All loads before the barrier complete before any loads after
Less common in practice

4. Acquire barrier:

Prevents loads/stores after the barrier from moving before it
Used after loading a lock or flag

5. Release barrier:

Prevents loads/stores before the barrier from moving after it
Used before storing to a lock or flag

When to Use Barriers

Explicit barriers are rare in application code because you typically use higher-level primitives. But in kernel code or lock-free data structures, you need them:

// Publisher (producer)
data = 42;
smp_wmb();  // Store barrier
ready = 1;

// Consumer
while (READ_ONCE(ready) == 0) { }
smp_rmb();  // Load barrier
assert(data == 42);

Architecture-Specific Considerations

On x86, some barriers are essentially no-ops because of the strong memory model:

// Linux kernel definitions for x86
#define smp_rmb()  barrier()  // Just a compiler barrier
#define smp_wmb()  barrier()  // Just a compiler barrier
#define smp_mb()   asm volatile("mfence" ::: "memory")

On ARM, all of them map to actual hardware instructions:

// Linux kernel definitions for ARM
#define smp_rmb()  asm volatile("dmb ld" ::: "memory")
#define smp_wmb()  asm volatile("dmb st" ::: "memory")
#define smp_mb()   asm volatile("dmb sy" ::: "memory")

Atomic Operations: Read-Modify-Write Without Tears

Atomic operations provide indivisible read-modify-write operations that are essential for lock-free programming.

Common Atomic Operations

1. Compare-and-Swap (CAS):

// Atomically: if (*ptr == old) { *ptr = new; return true; }
bool compare_and_swap(int *ptr, int old, int new);

The foundation of most lock-free algorithms:

// Lock-free increment
int old, new;
do {
    old = atomic_load(counter);
    new = old + 1;
} while (!compare_and_swap(counter, old, new));

2. Fetch-and-Add:

int old = atomic_fetch_add(counter, 1);

Simpler and often faster than CAS for counters.

3. Exchange:

int old = atomic_exchange(ptr, new_value);

4. Test-and-Set:

bool was_set = atomic_test_and_set(flag);

ABA Problem

A classic issue with CAS-based algorithms:

// Thread 1 reads A
int old = atomic_load(ptr);  // Gets value A

// Thread 2 changes A → B → A
atomic_store(ptr, B);
atomic_store(ptr, A);

// Thread 1's CAS succeeds even though the world changed!
if (compare_and_swap(ptr, old, new)) {
    // Succeeds, but intermediate state was missed
}

Solutions:

Use versioned pointers (pack a counter with the pointer)
Use hazard pointers or epoch-based reclamation
Use double-word CAS (DCAS) where available

Memory Ordering Semantics: The C11/C++11 Model

Modern C/C++ provide a memory model with explicit ordering semantics. Every atomic operation specifies its memory ordering:

Memory Order Types

1. memory_order_relaxed: No ordering guarantees, just atomicity

atomic_fetch_add_explicit(counter, 1, memory_order_relaxed);

Good for counters where you only care about the final value, not intermediate states.

2. memory_order_acquire: Used with loads

while (atomic_load_explicit(&ready, memory_order_acquire) == 0) { }
// All subsequent reads/writes cannot move before this

3. memory_order_release: Used with stores

atomic_store_explicit(&ready, 1, memory_order_release);
// All prior reads/writes cannot move after this

4. memory_order_acq_rel: Both acquire and release (for RMW operations)

atomic_fetch_add_explicit(counter, 1, memory_order_acq_rel);

5. memory_order_seq_cst: Sequentially consistent (default, strongest)

atomic_fetch_add(counter, 1);  // Implies seq_cst

Establishes a global total order of all seq_cst operations across all threads.

6. memory_order_consume: Like acquire but weaker (rarely used, tricky to use correctly)

Acquire-Release Synchronization

The most common pattern in lock-free code:

// Producer
data = 42;
atomic_store_explicit(&ready, 1, memory_order_release);

// Consumer
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) { }
assert(data == 42);  // Guaranteed to see data

The release store “synchronizes with” the acquire load. Everything before the release is visible after the acquire.

Performance Implications

Different orderings have different costs:

Relaxed: Cheapest, often just a regular load/store
Acquire/Release: Medium cost, may require barriers on weak architectures
Seq_cst: Most expensive, requires full barriers on most architectures

Don’t use seq_cst everywhere just because it’s the default. Use the weakest ordering that’s correct.

READ_ONCE/WRITE_ONCE: The Linux Kernel Approach

The Linux kernel uses READ_ONCE and WRITE_ONCE macros extensively. These prevent both compiler optimizations and provide basic ordering.

Definition

#define READ_ONCE(x) \
    (*(const volatile typeof(x) *)&(x))

#define WRITE_ONCE(x, val) \
    (*(volatile typeof(x) *)&(x) = (val))

Why They’re Needed

1. Prevent register caching:

// Without READ_ONCE, compiler might optimize to infinite loop
while (READ_ONCE(flag) == 0) { }

2. Prevent store tearing:

// Without WRITE_ONCE, compiler might generate multiple stores
WRITE_ONCE(status, 0x12345678);  // Single store instruction

3. Prevent load tearing:

uint64_t value = READ_ONCE(big_value);  // Single load instruction

4. Signal intent:

// "This variable is shared and might change unexpectedly"
if (READ_ONCE(stop_flag)) {
    break;
}

When to Use Them

Accessing shared variables without locks
Reading/writing data structures accessed by interrupt handlers
Implementing lock-free algorithms (though usually need stronger atomics)
Any time a variable might be concurrently modified

Limitations

READ_ONCE/WRITE_ONCE don’t provide ordering guarantees beyond preventing compiler reordering. For ordering across cores, you need memory barriers or atomic operations.

Example of what goes wrong:

int data = 0;
int ready = 0;

// Writer thread
data = 42;
WRITE_ONCE(ready, 1);

// Reader thread
while (READ_ONCE(ready) == 0) { }
int value = READ_ONCE(data);  // WRONG! Might still see data == 0

Even though we use READ_ONCE and WRITE_ONCE, on weakly-ordered architectures (ARM, RISC-V, PowerPC), the reader might observe ready == 1 but still see the old value of data. Why?

The writer’s stores might be reordered by the CPU’s store buffer
The reader’s loads might be reordered - it could speculatively read data before checking ready
Cache coherency delays - the write to data might not have propagated to the reader’s cache yet

READ_ONCE/WRITE_ONCE only prevent the compiler from reordering or optimizing away the accesses. They don’t prevent CPU-level reordering.

Correct solution using acquire-release:

int data = 0;
atomic_int ready = 0;

// Writer thread
data = 42;
atomic_store_explicit(&ready, 1, memory_order_release);

// Reader thread
while (atomic_load_explicit(&ready, memory_order_acquire) == 0) { }
int value = data;  // Guaranteed to see data == 42
assert(value == 42);  // Will never fail

The memory_order_release store ensures all prior writes (including to data) are visible before the store to ready. The memory_order_acquire load ensures that once we observe ready == 1, all subsequent reads (including data) will see the updated values.

Alternative solution with explicit barriers:

// Writer thread
data = 42;
smp_wmb();  // Store barrier
WRITE_ONCE(ready, 1);

// Reader thread
while (READ_ONCE(ready) == 0) { }
smp_rmb();  // Load barrier
int value = data;
assert(value == 42);  // Now safe

The key lesson: READ_ONCE/WRITE_ONCE are necessary but not sufficient for synchronization. They prevent compiler mischief but don’t order memory operations across cores. For that, you need atomic operations with proper memory ordering or explicit barriers.

Data Dependency Ordering: A Special Case

Some architectures (including ARM and older Alpha) can reorder dependent loads:

// Thread 1
data->value = 42;
smp_wmb();
ptr = data;

// Thread 2
struct data *p = ptr;
// On Alpha, might see p != NULL but p->value != 42!
int val = p->value;

On Alpha, the load of p->value can speculatively execute before p is loaded and reordered.

Solution: Use smp_read_barrier_depends() or consume ordering:

struct data *p = READ_ONCE(ptr);
smp_read_barrier_depends();
int val = p->value;

Or use RCU (Read-Copy-Update) which handles this automatically.

Practical Synchronization Problems and Solutions

Problem 1: Double-Checked Locking

The naive implementation is broken:

// BROKEN CODE
if (ptr == NULL) {
    lock();
    if (ptr == NULL) {
        ptr = malloc(sizeof(*ptr));
        init(ptr);
    }
    unlock();
}
use(ptr);

Between malloc and init, another thread might see a non-NULL but uninitialized ptr.

Solution: Use atomic operations with appropriate ordering:

struct obj *p = atomic_load_explicit(&ptr, memory_order_acquire);
if (p == NULL) {
    lock();
    p = atomic_load_explicit(&ptr, memory_order_relaxed);
    if (p == NULL) {
        p = malloc(sizeof(*p));
        init(p);
        atomic_store_explicit(&ptr, p, memory_order_release);
    }
    unlock();
}
use(p);

Problem 2: Spinlock Implementation

A broken spinlock:

// BROKEN
void spin_lock(int *lock) {
    while (*lock == 1) { }
    *lock = 1;
}

Multiple threads can “acquire” the lock simultaneously.

Correct implementation with atomic operations:

void spin_lock(atomic_int *lock) {
    while (atomic_exchange_explicit(lock, 1, memory_order_acquire) == 1) {
        while (atomic_load_explicit(lock, memory_order_relaxed) == 1) {
            cpu_relax();  // Hint to CPU (pause instruction)
        }
    }
}

void spin_unlock(atomic_int *lock) {
    atomic_store_explicit(lock, 0, memory_order_release);
}

Problem 3: Peterson’s Algorithm

Peterson’s algorithm for mutual exclusion requires release/acquire semantics:

atomic_int flag[2];
atomic_int turn;

void lock(int id) {
    int other = 1 - id;
    atomic_store_explicit(&flag[id], 1, memory_order_release);
    atomic_store_explicit(&turn, other, memory_order_release);

    while (atomic_load_explicit(&flag[other], memory_order_acquire) == 1 &&
           atomic_load_explicit(&turn, memory_order_acquire) == other) {
        cpu_relax();
    }
}

void unlock(int id) {
    atomic_store_explicit(&flag[id], 0, memory_order_release);
}

Problem 4: Message Passing

Producer-consumer with a ring buffer:

struct ring {
    int data[SIZE];
    atomic_int head;  // Producer writes here
    atomic_int tail;  // Consumer reads here
};

// Producer
void produce(struct ring *r, int value) {
    int h = atomic_load_explicit(&r->head, memory_order_relaxed);
    int next = (h + 1) % SIZE;

    while (next == atomic_load_explicit(&r->tail, memory_order_acquire)) {
        // Ring full, wait
    }

    r->data[h] = value;
    atomic_store_explicit(&r->head, next, memory_order_release);
}

// Consumer
int consume(struct ring *r) {
    int t = atomic_load_explicit(&r->tail, memory_order_relaxed);

    while (t == atomic_load_explicit(&r->head, memory_order_acquire)) {
        // Ring empty, wait
    }

    int value = r->data[t];
    atomic_store_explicit(&r->tail, (t + 1) % SIZE, memory_order_release);
    return value;
}

Debugging Techniques

1. Thread Sanitizer (TSan)

Compile with -fsanitize=thread to detect data races:

gcc -fsanitize=thread -g program.c -o program
./program

TSan will report races with stack traces.

2. Stress Testing

Race conditions are probabilistic. Run tests under stress:

for (int i = 0; i < 1000000; i++) {
    test_concurrent_operation();
}

3. Memory Model Simulators

Tools like cppmem let you simulate execution under different memory models.

4. Assertions and Invariants

Liberally sprinkle assertions:

assert(atomic_load(&refcount) > 0);

5. Architecture-Specific Testing

Test on weak memory order architectures (ARM, RISC-V). Code might work on x86 but fail on ARM.

Architecture-Specific Considerations

x86/x86-64

Strong memory model helps (but don’t rely on it)
lock prefix for atomic operations
mfence, lfence, sfence for barriers
pause instruction for spinloop optimization

ARM

Weak memory model requires careful use of barriers
dmb (data memory barrier): dmb sy, dmb ld, dmb st
dsb (data synchronization barrier): stronger than dmb
isb (instruction synchronization barrier): for instruction cache
ldrex/strex for atomic operations (older)
Load-acquire/Store-release instructions (ARMv8+): ldar, stlr

RISC-V

Weak memory model like ARM
fence instruction with ordering bits
lr/sc (load-reserved/store-conditional) for atomics
Atomic extension provides amo operations

Power/PowerPC

Very weak memory model
Requires careful barrier placement
sync, lwsync, isync barriers
lwarx/stwcx for atomic operations

Best Practices and Rules of Thumb

Use existing synchronization primitives (mutexes, atomics) instead of rolling your own
Start with sequential consistency and optimize only if profiling shows it’s necessary
Document your memory ordering assumptions explicitly in comments
Test on weak memory order architectures even if your production target is x86
Use Thread Sanitizer religiously during development
Avoid data races - if multiple threads access data and at least one writes, use synchronization
Understand the difference between volatile and atomic - volatile doesn’t provide synchronization
Pad hot structures to avoid false sharing
Use READ_ONCE/WRITE_ONCE for shared variables in kernel-style code
Learn your platform’s memory model - x86, ARM, and RISC-V are all different

Conclusion

Memory ordering, cache coherency, and synchronization are complex topics individually, but their interaction creates emergent complexity that can be genuinely mind-bending. The key insights:

Compilers reorder for optimization
CPUs reorder for performance
Caches create coherency problems that require protocols and barriers
Different architectures have different memory models - x86 is strong, ARM/RISC-V are weak
Atomic operations and memory barriers are your tools for controlling this chaos
Acquire-release semantics are usually what you want for synchronization

The good news is that you rarely need to think about this directly. Use mutexes, condition variables, atomic operations with appropriate ordering, and existing lock-free data structures. But when you do need to dig deep—for kernel code, high-performance systems, or lock-free algorithms—understanding these concepts is essential.

And yes, it’s still close enough to black magic that even experts get it wrong. The difference is that experts know when to be paranoid and which tools to reach for.

Introduction: The Illusion of Sequential Execution

The Fundamental Problem: Nobody Agrees on Reality

Compiler Reordering: The Optimizer That Ruins Your Day

Why Compilers Reorder

Register Caching

Solutions to Compiler Reordering

CPU Reordering: When Hardware Betrays You

x86/x86-64: Relatively Strong Ordering

ARM/RISC-V: Weak Ordering

Store Buffers and Out-of-Order Execution

Cache Coherency: The MESI Protocol and Its Discontents

The MESI Protocol

Cache Line Ping-Pong

True Sharing vs False Sharing

Memory Barriers and Fences: The Synchronization Hammer

Types of Memory Barriers

When to Use Barriers

Architecture-Specific Considerations

Atomic Operations: Read-Modify-Write Without Tears

Common Atomic Operations

ABA Problem

Memory Ordering Semantics: The C11/C++11 Model

Memory Order Types

Acquire-Release Synchronization

Performance Implications

READ_ONCE/WRITE_ONCE: The Linux Kernel Approach

Definition

Why They’re Needed

When to Use Them

Limitations

Data Dependency Ordering: A Special Case

Practical Synchronization Problems and Solutions

Problem 1: Double-Checked Locking

Problem 2: Spinlock Implementation

Problem 3: Peterson’s Algorithm

Problem 4: Message Passing

Debugging Techniques

1. Thread Sanitizer (TSan)

2. Stress Testing

3. Memory Model Simulators

4. Assertions and Invariants

5. Architecture-Specific Testing

Architecture-Specific Considerations

x86/x86-64

ARM

RISC-V

Power/PowerPC

Best Practices and Rules of Thumb

Conclusion

Further Reading