The Disruptor Pattern: Mechanical Sympathy in Rust

Implementing lock-free ring buffers with cache-line padding to achieve nanosecond-level inter-thread latency.

Advanced • 55 min read

The LMAX Disruptor is legendary in the HFT space. Invented by LMAX Exchange in the 2010s, it revolutionized how we think about inter-thread communication.

Most developers reach for a standard queue:

Python: multiprocessing.Queue
Go: chan
Rust: std::sync::mpsc

These are robust, but they use locks (mutexes) or complex atomic compare-and-swap (CAS) loops. At 100k messages/sec, they are fine. At 10 million/sec, they collapse.

Mechanical Sympathy

Martin Thompson coined the term “Mechanical Sympathy”: You don’t need to be a generic engineer to drive a race car, but you need to understand how the engine works to get the best performance.

The Disruptor is built on understanding hardware, specifically CPU Caches.

The Cache Coherency Problem

Modern CPUs have L1, L2, and L3 caches.

L1: ~1-2ns access.
RAM: ~60-100ns access.

When Core A writes to memory, and Core B reads it, they must coordinate via the MESI Protocol. If Core A and B both modify variables that sit on the same cache line (a 64-byte chunk of memory), they fight. Core A invalidates Core B’s cache line. Core B has to re-fetch from RAM. This “ping-pong” effect is called False Sharing.

The Ring Buffer Solution

The Disruptor uses a fixed-size, pre-allocated array (Ring Buffer).

Pre-allocation: No malloc during runtime. No GC pressure.
Single Writer: Simplifies concurrency (Wait-Free).
Sequence Numbers: Just incrementing integers (u64) to track position.

Implementation in Rust

We need to align our cursors to 64 bytes (128 bytes on some Archs like Apple Silicon to be safe) to prevent False Sharing.

use std::sync::atomic::{AtomicI64, Ordering};

// Cache Line Padding
#[repr(align(128))] 
struct CachePad;

struct RingBuffer<T> {
    // The data itself
    buffer: Vec<T>,
    mask: i64,
    
    // Padding ensures producer_cursor is on its own cache line
    _pad1: CachePad,
    producer_cursor: AtomicI64,
    
    // Padding ensures consumer_cursor is on its own cache line
    _pad2: CachePad,
    consumer_cursor: AtomicI64,
}

The “Batching” Effect

The secret sauce of the Disruptor is Batching.

In a standard queue, if the consumer is slow, the queue fills up. The consumer pops one item, processes it, pops the next… verifying the lock every single time.

In the Disruptor, the Consumer checks the Producer’s cursor:

Consumer is at 100.
Producer is at 105.

The Consumer says: “Oh, I see 5 items are ready! I will process 101, 102, 103, 104, 105 in a tight loop, and ONLY THEN update my cursor to 105.”

This amortizes the cost of the application-level “commit” (the atomic write) over multiple messages. It actually gets more efficient under load.

// Simplified Consumer Logic
loop {
    let claimed = self.consumer_cursor.load(Ordering::Relaxed);
    let published = self.producer_cursor.load(Ordering::Acquire);
    
    if published > claimed {
        // BATCH PROCESSING
        for seq in (claimed + 1)..=published {
            let item = &self.buffer[(seq & self.mask) as usize];
            process(item);
        }
        
        // Single atomic store for the whole batch
        self.consumer_cursor.store(published, Ordering::Release);
    } else {
        std::thread::yield_now(); // Backoff strategy
    }
}

Production Architecture: The Event Sourcing Core

In the ZeroCopy Sentinel, we use the Disruptor as the backbone of the “Sovereign Node”:

Ingress Thread (Producer 1): Receives UDP packets, writes to RB.
Journaler Thread (Consumer 1): Writes events to disk (Persistence).
Replication Thread (Consumer 2): Sends events to HA peer.
Logic Thread (Consumer 3): The Matching Engine.

Crucially, Consumers 1, 2, and 3 can run in parallel. The Logic Thread doesn’t wait for disk I/O. It just reads the same memory.

Summary

Lock-Free: Use Atomic counters, not Mutexes.
Cache Friendly: Padding prevents False Sharing.
Pre-Allocated: No GC, no massive latency spikes.
Batching: System gets faster as load increases.

Now that we have efficient memory layout and inter-thread comms, let’s fix the biggest bottleneck remaining: The Operating System Network Stack.

Questions about this lesson? Working on related infrastructure?

Let's discuss