AiTechWorlds
AiTechWorlds
A kernel with no synchronization would corrupt data every millisecond. But over-synchronized code can be slower than a single CPU. Linux kernel has evolved from simple spinlocks to sophisticated lock-free data structures read by millions of threads simultaneously.
Consider the routing table. On a busy router running Linux, the kernel consults the routing table for every single packet — millions of times per second across dozens of CPU cores. If reading the routing table required acquiring a mutex, the cores would queue behind each other for every packet lookup. The routing table rarely changes, but reads must be absolutely free from overhead. Writes (adding a new route) happen perhaps once per minute.
The solution the kernel uses is RCU (Read-Copy-Update) — a synchronization primitive where readers pay essentially zero cost, writers are correct but slower, and the whole thing is provably safe. RCU was introduced by Paul McKenney to Linux in 2002 and has since become one of the most widely used synchronization primitives in the entire kernel source. Understanding RCU means understanding the fundamental insight about concurrency that enables modern Linux scalability.
A spinlock is the simplest form of mutual exclusion. The thread trying to acquire the lock loops (spins) in place, testing the lock state, until the lock is free.
spinlock_t my_lock;
spin_lock_init(&my_lock);
spin_lock(&my_lock);
/* critical section — modify shared data */
spin_unlock(&my_lock);
The kernel's spin_lock() uses a single atomic compare-and-swap instruction. On x86-64, this is typically a LOCK XCHG or LOCK CMPXCHG — one CPU instruction that atomically tests and sets the lock word.
In a single-CPU context: if code holds a spinlock and an interrupt fires, and the interrupt handler also tries to acquire the same spinlock — deadlock. The interrupt handler spins forever while the locked code never gets CPU time to release the lock (because the interrupt preempted it).
unsigned long flags;
spin_lock_irqsave(&my_lock, flags); // disable local interrupts, save flags
/* safe even if interrupt handler uses the same lock */
spin_unlock_irqrestore(&my_lock, flags); // restore interrupt state
spin_lock_irqsave() is the standard pattern for spinlocks in device drivers and any code that may be called from interrupt context.
test of the lock variable transfers the cache line between CPUs' L1 caches — this inter-core traffic can become the bottleneck with many waitersThe kernel's MCS spinlock (queued spinlock, used in Linux 4.2+) eliminates cache-line bouncing by having each waiter spin on a local variable rather than the shared lock word.
When a spinlock's waiting period might be long — a filesystem operation, a network call, waiting for I/O — spinning wastes CPU. The mutex allows the waiting thread to sleep.
struct mutex my_mutex;
mutex_init(&my_mutex);
mutex_lock(&my_mutex); // sleep if locked (cannot use in interrupt context)
/* critical section */
mutex_unlock(&my_mutex);
On contention, mutex_lock() calls the scheduler to put the current thread to sleep and run another. When the lock is released, the sleeper is woken and re-scheduled.
Adaptive mutex: Linux mutexes have an optimization: if the lock holder is currently running on another CPU, spin briefly before sleeping (the holder will likely release soon). This avoids the scheduler overhead for short waits while still sleeping for long waits.
struct semaphore generalizes the mutex with a count — N threads may hold it simultaneously (a mutex is a semaphore with count 1). Semaphores are used for producer-consumer patterns and rate limiting in the kernel, but are less common than mutexes for simple mutual exclusion.
rwlock_t allows concurrent readers but exclusive writers:
rwlock_t rw;
read_lock(&rw); // multiple readers can hold simultaneously
/* read shared data */
read_unlock(&rw);
write_lock(&rw); // exclusive — waits for all readers to finish
/* modify shared data */
write_unlock(&rw);
The problem: under heavy read load, writers can starve — readers continuously hold the lock, never leaving a window for the writer. Linux's rwsem (reader-writer semaphore, for sleeping contexts) and rwlock_t (spinning) both address this with writer priority.
For data that is read very frequently but written rarely — and where readers can detect a concurrent write — seqlocks offer a compelling alternative.
seqlock_t jiffies_lock = __SEQLOCK_UNLOCKED(jiffies_lock);
/* Writer: */
write_seqlock(&jiffies_lock);
jiffies_64 += delta;
write_sequnlock(&jiffies_lock);
/* Reader: */
u64 seq, result;
do {
seq = read_seqbegin(&jiffies_lock);
result = jiffies_64;
} while (read_seqretry(&jiffies_lock, seq));
The seqlock maintains an integer sequence counter. Writers increment it before and after their update (odd = write in progress, even = stable). Readers read the counter before and after reading the data. If the counters differ or are odd, a write was in progress — the reader retries.
Key property: Writers never block. Readers retry on contention. This is ideal for jiffies (the kernel's timer tick counter), read billions of times per second on busy systems.
RCU is the kernel's most powerful synchronization primitive, enabling true lock-free reads at massive scale.
For pointer-based data structures (linked lists, hash tables, trees), RCU allows:
/* Reader (lock-free): */
rcu_read_lock(); // just disables preemption (or notes quiescent state)
struct my_data *p = rcu_dereference(global_ptr); // safe pointer read
if (p) use(p->value); // access data through the pointer
rcu_read_unlock(); // re-enables preemption
/* Writer: */
struct my_data *old = global_ptr;
struct my_data *new = kmalloc(sizeof(*new), GFP_KERNEL);
*new = *old; // copy
new->value = updated_value; // modify copy
rcu_assign_pointer(global_ptr, new); // atomic pointer swap (includes memory barrier)
synchronize_rcu(); // wait for all current RCU read-side critical sections to finish
kfree(old); // safe to free: no reader can reach this anymore
synchronize_rcu() waits for a grace period — a time during which every CPU has had at least one "quiescent state" (a moment where it could not be in an RCU read-side critical section). After the grace period, the writer is guaranteed that no reader holds a reference to the old pointer.
On a non-preemptible kernel, a quiescent state is any context switch or idle period. synchronize_rcu() simply waits until every CPU has scheduled at least once.
| Operation | Cost |
|---|---|
rcu_read_lock() | Disable preemption: ~1 instruction |
rcu_dereference() | Load + memory barrier: ~1-2 instructions |
rcu_read_unlock() | Re-enable preemption: ~1 instruction |
rcu_assign_pointer() | Store + memory barrier: ~2 instructions |
synchronize_rcu() | Wait for all CPUs to pass quiescent state: 1–10ms |
Reader overhead is essentially zero. This is why the routing table, the list of network devices, the process credentials struct, and the PID hash table all use RCU — they are read far more often than they are written.
Beyond RCU, Linux uses atomic operations for simple shared counters and flags:
atomic_t counter = ATOMIC_INIT(0);
atomic_inc(&counter); // atomic increment
atomic_dec_and_test(&counter); // decrement, return true if result is 0
atomic_cmpxchg(&counter, old, new); // compare-and-exchange (CAS)
cmpxchg() maps directly to the LOCK CMPXCHG CPU instruction — the foundation of all lock-free algorithms. If the current value equals old, set it to new and return old. Otherwise return current value (indicating failure — retry).
On modern CPUs, memory operations can be reordered by the hardware for performance. Two CPUs may see writes in different orders unless memory barriers are explicitly inserted.
smp_wmb(); // write memory barrier: all prior writes visible before subsequent writes
smp_rmb(); // read memory barrier: all prior reads complete before subsequent reads
smp_mb(); // full memory barrier: all memory operations ordered
rcu_assign_pointer() includes an smp_wmb() — ensuring the new pointer's contents are visible to all CPUs before the pointer itself becomes visible. Without this, a reader might dereference the new pointer and see uninitialized data.
Linux's lockdep validator (enabled with CONFIG_PROVE_LOCKING=y) tracks every lock acquisition and release, building a dependency graph. It detects potential deadlock cycles before they actually occur.
# In a debug kernel build, lockdep violations appear in dmesg:
dmesg | grep "possible circular locking dependency"
# lockdep annotates stack traces with the full lock chain:
# WARNING: possible circular locking dependency detected
# task/12345 is trying to acquire lock:
# ...
# but task/12345 already holds lock:
# ...
# which lock already depends on the new lock.
lockdep has found hundreds of real kernel deadlocks over its history. It is one reason the Linux kernel has very few deadlock bugs despite millions of lines of concurrent code.
| Primitive | Context | Sleeping? | Reader Cost | Writer Cost | Scalability | Canonical Use Case |
|---|---|---|---|---|---|---|
| spinlock | Any (incl. interrupt) | No | O(1) spin | O(1) spin | Poor under contention | Short critical sections, IRQ handlers |
| mutex | Process only | Yes | Sleep + schedule | Sleep + schedule | Good (one holder) | Filesystem operations, driver state |
| rwlock_t | Any | No (spinning) | O(1) atomic | O(n readers) spin | Good for read-heavy | Rarely held write locks |
| semaphore | Process only | Yes | Sleep | Sleep | Fair | Count-limited resources |
| seqlock | Any | Writers no, readers retry | Retry on collision | Never blocks | Excellent | Rarely-written counters (jiffies) |
| RCU | Read: any; Write: process | Writes: yes (grace period) | ~0 (no atomic ops) | Moderate (grace period) | Excellent | Routing tables, process lists, creds |
The evolution of Linux synchronization tells the story of multicore scaling. Early Linux (2.4) used a single Big Kernel Lock (BKL) that serialized nearly all kernel operations. The BKL was progressively replaced with finer-grained locks, then with RCU where reads dominate.
RCU is the kernel's most important contribution to systems programming literature. Its insight — that freeing memory can be deferred until no reader can possibly hold a reference, without those readers ever acquiring a lock — is profound. The grace period mechanism makes this possible without per-reader counters. The result is a read path that costs essentially nothing, enabling the kernel to scale routing lookups, credential checks, and list traversals linearly with CPU count rather than serializing them.
When diagnosing kernel performance problems, lockdep is the first tool to enable on a debug build. perf lock can profile lock contention in production. And if a workload shows high iowait combined with soft lockup warnings, the first question to ask is whether a write-heavy workload is contending with read-heavy workloads on a shared data structure — the likely fix being a migration from rwlock to RCU.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises