The Physics of CPU Latency: Caches, Context Switches & Isolation
Why your code is slow. The physics of CPU Caches (L1/L2/L3), the 4µs cost of a Context Switch, and the `isolcpus` kernel boot parameter.
🎯 What You'll Learn
- Deconstruct the CPU Memory Hierarchy (L1 vs RAM)
- Measure the exact cost of a Context Switch (Syscall Physics)
- Configure Kernel Isolation (`isolcpus`, `nohz_full`)
- Pin processes to specific cores using `taskset`
- Analyze False Sharing (Cache Coherency Physics)
Introduction
In High-Frequency Trading (HFT), we don’t think in milliseconds. We think in Clock Cycles. A 4GHz CPU executes 4 Billion cycles per second. 1 Cycle = 0.25 nanoseconds.
When your code waits for RAM, it wastes 400 cycles. When the OS switches tasks, it wastes 12,000 cycles. This lesson explores the Physics of the CPU—how to keep data hot in L1 cache and how to banish the Kernel Scheduler from your trading cores.
The Speed of Light: Cache Physics
Data does not move instantly. It travels through silicon.
| Storage | Latency (ns) | Cycles (4GHz) | Physics Metaphor |
|---|---|---|---|
| L1 Cache | 1 ns | 4 | Picking a pen from your desk. |
| L2 Cache | 4 ns | 16 | Picking a book from the shelf. |
| L3 Cache | 12 ns | 48 | Walking to the next room. |
| RAM | 100 ns | 400 | Walking to the warehouse. |
The Goal: Stick to L1. If you access a random memory address (Linked List), you hit RAM. If you access contiguous memory (Array), the CPU Prefetcher pulls it into L1.
Context Switches: The Invisible Tax
A Context Switch is when the CPU stops your code to run something else (another app, or the Kernel). It is catastrophic for latency.
The Physics:
- Save Registers: Using CPU to save state.
- Pollute L1 Cache: The new process overrides your hot data in L1.
- TLB Flush: The Translation Lookaside Buffer (Virtual Memory map) is wiped.
Cost: ~2-4 microseconds. (15,000 cycles). Solution: CPU Pinning.
Code: CPU Pinning & Isolation
We tell the Linux Scheduler: “Do not touch CPU 2 and 3”.
1. Boot Parameters (The Nuclear Option)
Edit /etc/default/grub:
# isolcpus: Remove from scheduler balancing
# nohz_full: Stop scheduling-clock ticks (1000Hz -> 1Hz)
# rcu_nocbs: Move RCU callbacks to housekeeping cores
GRUB_CMDLINE_LINUX="isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3"
Note: Run update-grub and reboot.
2. Runtime Pinning (taskset)
Now CPU 2 and 3 are empty. You must manually force your app onto them.
# Launch a python script on CPU 2
taskset -c 2 python3 my_trading_algo.py
# Check affinity (Physics Verification)
pid=$(pgrep -f my_trading_algo)
taskset -p $pid
# output: pid 1234's current affinity mask: 4 (Binary 100 for CPU 2)
False Sharing: The Concurrency Killer
Imagine two threads on different Cores writing to variables that sit next to each other in RAM. CPUs cache data in 64-byte Cache Lines.
- Thread A writes to
Variable X. - Thread B writes to
Variable Y. - If X and Y are in the same 64-byte line, Core A and Core B fight.
- Physics: The Cache Coherency Protocol (MESI) forces L1 invalidation constantly. Code slows down by 50x.
Fix: Pad your data structures to ensure separation.
Practice Exercises
Exercise 1: The Context Switch Cost (Beginner)
Task: Use perf to measure context switches.
Action: perf stat -e context-switches ./my_script.
Goal: Minimize this number to zero.
Exercise 2: Cache Miss Profiling (Intermediate)
Task: Run perf stat -e L1-dcache-load-misses ./my_script.
Action: Change a Linked List to an Array. Watch misses drop.
Exercise 3: Full Isolation (Advanced)
Task: Isolate CPU 3 via GRUB.
Action: Run a busy-loop on CPU 3.
Observation: Use htop. See that CPU 3 stays at 100% usage, but the Load Average typically doesn’t spike because regular scheduler tasks aren’t fighting for it.
Knowledge Check
- How many cycles does a RAM access cost?
- What is a Cache Line size?
- What does
isolcpusdo? - Why is a Linked List slower than an Array?
- What is False Sharing?
Answers
- ~400 cycles.
- 64 Bytes.
- Removes a CPU from the kernel scheduler’s balancing algorithms.
- Pointer chasing. Arrays are contiguous and prefetch-friendly; Linked Lists are random memory jumps (RAM hits).
- Two cores fighting over the same Cache Line due to proximity of variables.
Summary
- L1 Cache: The only fast storage.
- Context Switch: A 15,000 cycle penalty.
- isolcpus: Evicting the Scheduler.
- False Sharing: The invisible concurrency bug.
Kubernetes Note: In Kubernetes 1.22+, use the static CPU manager policy. When hyperthreading is enabled, request isolated cores in pairs to ensure both SMT siblings are allocated to the same container.
🏛️ Advanced: For the complete sovereign systems approach to deterministic execution, see The Sovereign Architecture.
Pro Version: For production-grade implementation details, see the full research article: cpu-optimization-linux-latency
Questions about this lesson? Working on related infrastructure?
Let's discuss