O_DIRECT - The Problem That Grew Up With Multi-Threading

Page content

Introduction: A Problem Hiding in Plain Sight

Direct I/O (O_DIRECT) has been a contentious feature in Linux since its introduction. Linus Torvalds famously called it a design “by a deranged monkey on some serious mind-controlling substances” back in 2002. Yet for years, it continued to work—mostly. Applications used it, databases relied on it, and virtual machines benefited from its zero-copy performance.

But something fundamental has changed. As modern software has embraced multi-threading at every level—from applications to filesystems within the kernel itself—a problem that was once manageable has become critical. The truth is stark: with O_DIRECT, there is no way to guarantee that nobody will touch your I/O buffers during the operation.

This affects more than you might think. Btrfs can’t protect your data checksums. ext4 and XFS can’t protect their metadata checksums. MD-RAID can’t keep mirrors synchronized. And there’s no fix coming—it’s an architectural limitation in Linux’s memory management that becomes increasingly problematic as software grows more concurrent.

This isn’t a bug that can be fixed. It’s an architectural limitation that has finally caught up with us.

The Core Problem: Unstable Pages

When you write data to a file using buffered I/O (the normal mode), the kernel copies your data into the page cache. Once copied, the kernel has full control: it can set the AS_STABLE_WRITES flag to prevent anyone from modifying that page while it’s being written to disk. If the filesystem needs to calculate a checksum (like Btrfs does), it knows the data won’t change between checksum calculation and disk write.

Direct I/O bypasses this protection entirely. When you use O_DIRECT:

  1. Your application provides a buffer in user-space memory
  2. The kernel maps that buffer directly for I/O
  3. The buffer remains in user-space, writable by anyone
  4. The filesystem calculates checksums (if needed)
  5. The I/O operation begins…
  6. But your application, another thread, or a guest OS can modify the buffer
  7. The disk receives data that doesn’t match the checksum

The fundamental issue: Linux cannot write-protect anonymous user pages during I/O operations. FreeBSD can do this, but it’s a capability Linux’s memory management subsystem simply doesn’t have.

When Simple Times Were Simpler

Years ago, this wasn’t as big a deal. Let’s consider why:

Single-Threaded Applications

Early database systems and applications that used O_DIRECT were often single-threaded or carefully managed their threading. A single-threaded application knows exactly when it’s safe to modify a buffer—it just doesn’t do it during I/O. Problem solved, right?

Simple Filesystems

Traditional filesystems like ext3 and early ext4 didn’t compute data checksums. They might checksum metadata, but the actual file data? That was the disk controller’s problem. Without checksums to validate, it didn’t matter if the buffer changed slightly during I/O.

Single-Core CPUs

On a single-core system, even if your application was multi-threaded, there’s less opportunity for race conditions. The chances of thread 2 modifying the buffer at precisely the wrong moment while thread 1’s I/O was in flight were lower.

Simpler Storage Stack

The storage stack was simpler. No complex software RAID doing its own checksums, no T10 DIF/DIX integrity checking at the block layer, no virtualization layers where a guest OS inside a VM has its own ideas about buffer management.

The Modern Multi-Threaded Reality

Fast forward to 2025, and everything has changed:

Pervasively Multi-Threaded Applications

Modern applications are inherently concurrent. Your database isn’t just one thread—it’s dozens or hundreds of threads, all working with shared memory. Your virtual machine hypervisor (QEMU, KVM) serves multiple virtual CPUs to guest operating systems that have no idea they’re sharing resources.

When QEMU uses O_DIRECT with cache=none to write a VM’s disk image, here’s what happens:

  1. The guest OS (say, Windows or Linux) writes to its “disk”
  2. QEMU maps that guest memory and issues an O_DIRECT write to the Btrfs image file
  3. Btrfs calculates a checksum of the data
  4. But the guest OS doesn’t know it needs to keep that memory stable
  5. The guest’s filesystem (NTFS, ext4, XFS) might modify pages during writeback—this is a valid optimization for them since they lack data checksums
  6. The modified data hits the disk
  7. Btrfs reads it back and finds a checksum mismatch

This isn’t a bug in the guest OS. It’s not a bug in QEMU. It’s not even really a bug in Btrfs. It’s a fundamental architectural problem with O_DIRECT.

Multi-Threaded Filesystems

Modern filesystems themselves are heavily multi-threaded. Btrfs, XFS, and others use worker threads, asynchronous I/O, and complex background operations. When you start an O_DIRECT write, multiple subsystems might touch that I/O:

  • The VFS layer
  • The filesystem’s direct I/O path
  • The block layer
  • The device mapper (if using RAID)
  • Each individual disk’s driver

All of these run concurrently. And if you’re using MD-RAID (Linux software RAID), the situation gets even worse.

The MD-RAID Disaster

MD-RAID with O_DIRECT is fundamentally broken, and it’s unfixable. Here’s why:

When you write to a RAID-1 mirror with O_DIRECT:

  1. MD-RAID receives the write request with a pointer to your user-space buffer
  2. It forwards that same pointer to each mirror disk
  3. Each disk device independently copies from that buffer
  4. But copying isn’t instantaneous
  5. If another thread in your application modifies the buffer while disk 1 is copying but before disk 2 starts copying…
  6. Different data gets written to each mirror
  7. Your RAID array is now corrupted

An unprivileged user can trivially desynchronize a RAID array. This is Kernel Bug #99171, reported years ago, and it remains unfixed because there’s no reasonable fix that doesn’t destroy the performance benefits of O_DIRECT.

Proxmox, a popular hypervisor platform, removed mdadm software RAID from their installation options entirely because of this bug.

The Kernel Developers Respond

To the kernel developers’ credit, they’ve recognized the problem. In kernel 6.15 (released May 2025), significant changes landed:

Btrfs: Complete Fallback

As of kernel 6.15, Btrfs no longer supports true direct I/O for checksummed files:

/*
 * We can't control the folios being passed in, applications can write
 * to them while a direct IO write is in progress.  This means the
 * content might change after we calculated the data checksum.
 * Therefore we can end up storing a checksum that doesn't match the
 * persisted data.
 *
 * To be extra safe and avoid false data checksum mismatch, if the
 * inode requires data checksum, just fallback to buffered IO.
 */
if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)) {
    btrfs_inode_unlock(BTRFS_I(inode), ilock_flags);
    goto buffered;
}

If your file has checksums enabled (the default), O_DIRECT writes silently fall back to buffered I/O. You get the safety of stable pages but lose the zero-copy performance.

Want actual direct I/O on Btrfs? You need to explicitly disable checksums:

# Disable checksums for VM image file
chattr +C /path/to/vm-image.qcow2

XFS: Conditional Fallback

XFS doesn’t do data checksums, but it respects block devices that require stable pages (those with T10 DIF/DIX integrity checking):

// Falls back to buffered I/O if device needs stable writes
if (bdev_stable_writes(iocb->ki_filp->f_mapping->host))
    goto buffered;

Christoph Hellwig, one of the key filesystem developers, expressed unhappiness about this but acknowledged it was necessary.

The Community Consensus

From the mailing lists and bug reports, a clear picture emerges:

“O_DIRECT is really fundamentally broken. There’s just no way to fix it sanely.”

“Given that we never claimed that you can’t modify the buffer I would not call it buggy, even if the behavior is unfortunate.” - Christoph Hellwig

The kernel community has essentially given up on making O_DIRECT safe with checksumming systems. The fallback approach is a pragmatic admission that the original design doesn’t work in modern concurrent environments.

Why Can’t This Be Fixed?

You might wonder: why not just copy the buffer when checksums are needed? Or pin the pages? Or write-protect them?

Buffer Copying Defeats the Purpose

The entire point of O_DIRECT is zero-copy I/O. If you copy the buffer to make it stable, you’ve lost the main performance benefit. You might as well use buffered I/O at that point (which is exactly what Btrfs now does).

Page Pinning Has Severe Limitations

Linux’s get_user_pages() can pin pages in memory to prevent them from being swapped out, but it cannot write-protect them. Other threads in your application, or the guest OS in a VM, can still modify the contents.

FreeBSD has vm_fault_hold_user_pages() which can write-protect user pages and take page faults on modifications. Linux has no equivalent mechanism, and adding it would require fundamental changes to the memory management subsystem.

The FreeBSD vs Linux Architecture

This isn’t just a missing feature—it’s an architectural difference in how the operating systems handle memory:

FreeBSD: Can place user pages under write protection, guaranteeing buffer stability at the cost of page faults and complexity.

Linux: Cannot write-protect anonymous user pages for I/O purposes. Simpler memory management, but unable to guarantee O_DIRECT buffer stability.

Changing this would be a massive undertaking affecting the core memory management code, with performance implications for all workloads.

The Growing Problem

What makes this situation increasingly problematic is that the trend toward multi-threading isn’t slowing down—it’s accelerating:

  • Multi-core CPUs are standard: Even embedded devices often have 4+ cores now
  • Asynchronous I/O is everywhere: io_uring, libaio, and async frameworks encourage concurrent operations
  • Virtualization is ubiquitous: Running multiple VMs with their own concurrent filesystems is normal
  • Data integrity is expected: Users increasingly expect their filesystems to detect corruption (hence Btrfs, ZFS, and checksumming block devices)

Every one of these trends makes the O_DIRECT unstable pages problem worse. The window for buffer modification is larger, the probability of concurrent access is higher, and the consequences (checksum mismatches, RAID desync) are more severe.

Practical Implications Today

For Virtual Machine Users on Btrfs

If you’re running VMs on Btrfs with default settings:

  • Your cache=none (direct I/O) setting in QEMU silently falls back to buffered I/O in kernel 6.15+
  • You lose the zero-copy performance you thought you were getting
  • You gain protection against checksum mismatch errors

Want actual direct I/O performance? Disable checksums on the image file:

chattr +C /var/lib/libvirt/images/vm.qcow2

But now you’ve lost the data integrity protection that Btrfs provides.

For ext4 and XFS Users: The Metadata Problem

You might think ext4 and XFS are safe since they don’t checksum your file data. You’d be wrong. They checksum their metadata—and this creates the same unstable pages problem, just for filesystem structures instead of your data.

ext4’s Metadata Checksums

Modern ext4 has two layers of checksums, and you can’t escape both:

Journal checksums (jbd2): Mandatory since kernel 3.5. There’s no way to disable them—the mount option was removed and the feature is always on. These protect the integrity of every journal transaction.

Filesystem metadata checksums: Enabled by default in most distributions. These protect superblocks, inode tables, extent trees, directory blocks—everything except your actual file data.

You can disable the filesystem metadata checksums:

umount /mnt/vm-images
tune2fs -O ^metadata_csum /dev/mapper/vg-images
mount /mnt/vm-images

But the mandatory journal checksums remain, and they have the same race condition problem.

XFS’s Metadata CRCs

XFS metadata CRCs have been enabled by default since 2013 (xfsprogs 3.2.3). Unlike ext4, you can only disable them at format time:

mkfs.xfs -m crc=0 /dev/sdX

No mount option exists to change this on an existing filesystem. And disabling CRCs also disables the free inode btree, reflink support, reverse mapping, and big timestamps. You lose too much to make it worthwhile.

The Metadata Unstable Pages Race

When you use O_DIRECT with ext4 or XFS, here’s what happens:

  1. Your write modifies metadata (extent trees, inode tables, allocation bitmaps)
  2. The filesystem calculates checksums of that metadata for the journal
  3. Another thread, or an async filesystem worker, modifies the same metadata
  4. The checksum no longer matches what gets written
  5. On journal replay or recovery: checksum mismatch, filesystem corruption

This is the exact same architectural problem as Btrfs’s data checksums. Linux cannot write-protect kernel metadata structures during checksum calculation. Multiple concurrent operations—O_DIRECT writes, page cache writeback, allocations from other threads—all touch the same structures.

Why You Haven’t Hit This Yet

Three reasons:

  1. Smaller attack surface: Most O_DIRECT workloads write large blocks with relatively few metadata updates. A 1GB database write might touch one inode, a few extent tree blocks, and some allocation bitmaps. Btrfs checksums every 4KB data block.

  2. Narrow race window: ext4 and XFS batch and optimize metadata operations efficiently. You need perfect timing—metadata modification during the exact moment of checksum calculation.

  3. Lucky timing so far: But the same multi-threading trends making Btrfs’s problem critical apply here. More cores, more threads, faster NVMe storage, io_uring, dense virtualization—all make the race window bigger.

When Metadata Corruption Strikes

Unlike data corruption (which might affect one file), metadata corruption is often catastrophic:

  • Filesystem won’t mount: “bad geometry” or “corruption detected”
  • Kernel panic during mount or I/O
  • Journal replay fails with checksum errors
  • Files disappear or become inaccessible
  • xfs_repair or e2fsck required, may not fully recover

The kernel developers documented real race conditions in ext4:

  • jbd2 checkpointing race: CPU1 checksums a buffer while CPU2 modifies it, leading to “potential filesystem corruption”
  • dioread_nolock corruption: Direct I/O reads return stale data after truncate operations
  • Superblock checksum races: Reads racing with updates caused systemd boot failures

XFS has its own issues with the Committed Item List (CIL), where lock contention between concurrent in-memory and on-disk logging operations creates windows for inconsistency.

Should You Disable Metadata Checksums?

For ext4: You can disable metadata_csum, but journal checksums remain mandatory. The protection against hardware failures usually outweighs the small race risk. Disable only if you’re running high-density virtualization with extreme metadata churn and have seen corruption.

For XFS: Probably not worth it. You lose too many features (reflink, reverse mapping, etc.), and the CRC overhead is minimal on modern CPUs with hardware CRC32c acceleration.

Better approach: If you’re hitting corruption issues, consider:

  • Using buffered I/O instead of O_DIRECT
  • Switching to ZFS, which handles this correctly
  • Reducing concurrency in your application
  • Using hardware RAID instead of MD-RAID

The kernel developers haven’t added fallback-to-buffered-IO for ext4/XFS like they did for Btrfs because metadata is a much smaller fraction of I/O, and disabling key filesystem features to work around it isn’t acceptable.

For MD-RAID Users

The advice is simple: don’t use O_DIRECT with MD-RAID. Period.

The bug allows unprivileged userspace to corrupt your RAID arrays. There is no fix. Proxmox removed MD-RAID support entirely. If you need software RAID with O_DIRECT:

  • Use ZFS (properly handles direct I/O through buffer management)
  • Use Btrfs RAID (forces buffered I/O with checksums in 6.15+)
  • Use hardware RAID controllers

For Application Developers

If you’re using O_DIRECT:

  1. Never modify buffers during I/O - wait for completion
  2. Avoid multi-threaded access to I/O buffers
  3. Consider io_uring instead - it has better control over buffer lifecycle
  4. Test on ARM - weak memory ordering will expose races that x86 hides
  5. Accept that true direct I/O might not be available on checksumming filesystems

ZFS: The Exception That Proves the Rule

OpenZFS added direct I/O support in version 2.3.0 (2024), and it actually works safely. How?

On FreeBSD, ZFS can write-protect user pages. On Linux, where that’s not possible, ZFS uses explicit buffer copying and pinning to ensure stability. In other words, ZFS sacrifices some zero-copy performance to maintain correctness.

This is the pragmatic approach: if you can’t guarantee buffer stability, you ensure stability through copying. It’s slower than true zero-copy, but it doesn’t corrupt data.

The Future of O_DIRECT

The trajectory is clear:

  1. More filesystems will follow Btrfs and XFS in falling back to buffered I/O when integrity features are enabled
  2. The io_uring interface will increasingly replace O_DIRECT for asynchronous I/O
  3. Applications will move toward explicit buffer management rather than hoping the kernel can make unreliable primitives safe

Linus Torvalds’ 2002 criticism has aged remarkably well. He argued that O_DIRECT was a broken interface that required AIO (asynchronous I/O) to be useful, and even then it was fundamentally flawed. The page cache, he maintained, was almost always the right answer.

His exchange with XFS maintainer Dave Chinner in 2019 remains instructive:

“Caches work, Dave. Anybody who thinks caches don’t work is incompetent. 99% of all filesystem accesses are cached, and they never do any IO at all, and the page cache handles them beautifully.”

Conclusion: Embrace the Reality

The uncomfortable truth is that O_DIRECT was designed for a simpler era. It worked when applications were single-threaded, filesystems were simple, and data integrity was the disk controller’s job.

Modern multi-threaded software—where applications, filesystems, and even the kernel’s block layer run dozens of concurrent operations—exposes the fundamental flaw: Linux cannot guarantee that user-space buffers remain stable during I/O.

The kernel developers have made the pragmatic choice: when data integrity requires stable pages, silently fall back to buffered I/O. You get correctness at the cost of performance.

For most workloads, this is the right trade-off. The page cache is fast, and copying data is cheap compared to data corruption. If you truly need zero-copy direct I/O and data integrity, your options are limited: use ZFS, disable filesystem checksums, or accept that your hardware must guarantee integrity.

The problem that was once manageable has grown up with multi-threading. And the kernel’s response acknowledges what has always been true: O_DIRECT cannot be made reliably safe in a concurrent world without fundamental architectural changes.

If you’re writing new code today, consider whether you really need O_DIRECT, or if you’re reaching for it out of habit or cargo-cult performance mythology. The page cache has gotten better. Buffer management has improved. And io_uring provides modern async I/O without the sharp edges.

Sometimes the best way to solve a problem is to stop using the thing that creates it.

References

General O_DIRECT Issues

Btrfs

ext4 Metadata Issues

XFS Metadata Issues

ZFS


Thanks to the comprehensive research in the Linux kernel mailing lists, bug trackers, and the detailed commit messages from filesystem developers who have grappled with this issue over many years.