A kernel with no synchronization would corrupt data every millisecond. But over-synchronized code can be slower than a single CPU. Linux kernel has evolved from simple spinlocks to sophisticated lock-free data structures read by millions of threads simultaneously.

Consider the routing table. On a busy router running Linux, the kernel consults the routing table for every single packet — millions of times per second across dozens of CPU cores. If reading the routing table required acquiring a mutex, the cores would queue behind each other for every packet lookup. The routing table rarely changes, but reads must be absolutely free from overhead. Writes (adding a new route) happen perhaps once per minute.

The solution the kernel uses is RCU (Read-Copy-Update) — a synchronization primitive where readers pay essentially zero cost, writers are correct but slower, and the whole thing is provably safe. RCU was introduced by Paul McKenney to Linux in 2002 and has since become one of the most widely used synchronization primitives in the entire kernel source. Understanding RCU means understanding the fundamental insight about concurrency that enables modern Linux scalability.

Spinlocks: The Primitive Foundation

A spinlock is the simplest form of mutual exclusion. The thread trying to acquire the lock loops (spins) in place, testing the lock state, until the lock is free.

spinlock_t my_lock;
spin_lock_init(&my_lock);

spin_lock(&my_lock);
    /* critical section — modify shared data */
spin_unlock(&my_lock);

The kernel's spin_lock() uses a single atomic compare-and-swap instruction. On x86-64, this is typically a LOCK XCHG or LOCK CMPXCHG — one CPU instruction that atomically tests and sets the lock word.

Interrupt-Safe Spinlocks

In a single-CPU context: if code holds a spinlock and an interrupt fires, and the interrupt handler also tries to acquire the same spinlock — deadlock. The interrupt handler spins forever while the locked code never gets CPU time to release the lock (because the interrupt preempted it).

unsigned long flags;
spin_lock_irqsave(&my_lock, flags);     // disable local interrupts, save flags
    /* safe even if interrupt handler uses the same lock */
spin_unlock_irqrestore(&my_lock, flags); // restore interrupt state

spin_lock_irqsave() is the standard pattern for spinlocks in device drivers and any code that may be called from interrupt context.

When to Use Spinlocks

Critical section is very short (< 1 microsecond)
Code runs in interrupt context (cannot sleep)
You know the lock will not be contended for long

Spinlock Costs

Wasted CPU cycles: A spinning CPU does nothing useful while waiting
Cache-line bouncing: On multicore, each test of the lock variable transfers the cache line between CPUs' L1 caches — this inter-core traffic can become the bottleneck with many waiters
Non-scalable under high contention: With N CPUs spinning on one spinlock, cache coherency traffic is O(N²)

The kernel's MCS spinlock (queued spinlock, used in Linux 4.2+) eliminates cache-line bouncing by having each waiter spin on a local variable rather than the shared lock word.

Mutexes and Sleeping Locks

When a spinlock's waiting period might be long — a filesystem operation, a network call, waiting for I/O — spinning wastes CPU. The mutex allows the waiting thread to sleep.

struct mutex my_mutex;
mutex_init(&my_mutex);

mutex_lock(&my_mutex);    // sleep if locked (cannot use in interrupt context)
    /* critical section */
mutex_unlock(&my_mutex);

On contention, mutex_lock() calls the scheduler to put the current thread to sleep and run another. When the lock is released, the sleeper is woken and re-scheduled.

Adaptive mutex: Linux mutexes have an optimization: if the lock holder is currently running on another CPU, spin briefly before sleeping (the holder will likely release soon). This avoids the scheduler overhead for short waits while still sleeping for long waits.

Semaphores

struct semaphore generalizes the mutex with a count — N threads may hold it simultaneously (a mutex is a semaphore with count 1). Semaphores are used for producer-consumer patterns and rate limiting in the kernel, but are less common than mutexes for simple mutual exclusion.

Reader-Writer Spinlocks

rwlock_t allows concurrent readers but exclusive writers:

rwlock_t rw;

read_lock(&rw);    // multiple readers can hold simultaneously
    /* read shared data */
read_unlock(&rw);

write_lock(&rw);   // exclusive — waits for all readers to finish
    /* modify shared data */
write_unlock(&rw);

The problem: under heavy read load, writers can starve — readers continuously hold the lock, never leaving a window for the writer. Linux's rwsem (reader-writer semaphore, for sleeping contexts) and rwlock_t (spinning) both address this with writer priority.

Seqlock: Optimistic Reads Without Blocking Writers

For data that is read very frequently but written rarely — and where readers can detect a concurrent write — seqlocks offer a compelling alternative.

seqlock_t jiffies_lock = __SEQLOCK_UNLOCKED(jiffies_lock);

/* Writer: */
write_seqlock(&jiffies_lock);
    jiffies_64 += delta;
write_sequnlock(&jiffies_lock);

/* Reader: */
u64 seq, result;
do {
    seq = read_seqbegin(&jiffies_lock);
    result = jiffies_64;
} while (read_seqretry(&jiffies_lock, seq));

The seqlock maintains an integer sequence counter. Writers increment it before and after their update (odd = write in progress, even = stable). Readers read the counter before and after reading the data. If the counters differ or are odd, a write was in progress — the reader retries.

Key property: Writers never block. Readers retry on contention. This is ideal for jiffies (the kernel's timer tick counter), read billions of times per second on busy systems.

RCU: Read-Copy-Update

RCU is the kernel's most powerful synchronization primitive, enabling true lock-free reads at massive scale.

The Core Idea

For pointer-based data structures (linked lists, hash tables, trees), RCU allows:

Readers: access the structure with zero locking — just prevent preemption
Writers: make a copy of the modified node, update the copy, then atomically swap the pointer

/* Reader (lock-free): */
rcu_read_lock();                    // just disables preemption (or notes quiescent state)
    struct my_data *p = rcu_dereference(global_ptr);  // safe pointer read
    if (p) use(p->value);          // access data through the pointer
rcu_read_unlock();                  // re-enables preemption

/* Writer: */
struct my_data *old = global_ptr;
struct my_data *new = kmalloc(sizeof(*new), GFP_KERNEL);
*new = *old;                        // copy
new->value = updated_value;         // modify copy
rcu_assign_pointer(global_ptr, new); // atomic pointer swap (includes memory barrier)
synchronize_rcu();                   // wait for all current RCU read-side critical sections to finish
kfree(old);                         // safe to free: no reader can reach this anymore

The Grace Period

synchronize_rcu() waits for a grace period — a time during which every CPU has had at least one "quiescent state" (a moment where it could not be in an RCU read-side critical section). After the grace period, the writer is guaranteed that no reader holds a reference to the old pointer.

On a non-preemptible kernel, a quiescent state is any context switch or idle period. synchronize_rcu() simply waits until every CPU has scheduled at least once.

RCU Performance

Operation	Cost
`rcu_read_lock()`	Disable preemption: ~1 instruction
`rcu_dereference()`	Load + memory barrier: ~1-2 instructions
`rcu_read_unlock()`	Re-enable preemption: ~1 instruction
`rcu_assign_pointer()`	Store + memory barrier: ~2 instructions
`synchronize_rcu()`	Wait for all CPUs to pass quiescent state: 1–10ms

Reader overhead is essentially zero. This is why the routing table, the list of network devices, the process credentials struct, and the PID hash table all use RCU — they are read far more often than they are written.

Lock-Free Programming

Beyond RCU, Linux uses atomic operations for simple shared counters and flags:

atomic_t counter = ATOMIC_INIT(0);

atomic_inc(&counter);              // atomic increment
atomic_dec_and_test(&counter);     // decrement, return true if result is 0
atomic_cmpxchg(&counter, old, new); // compare-and-exchange (CAS)

cmpxchg() maps directly to the LOCK CMPXCHG CPU instruction — the foundation of all lock-free algorithms. If the current value equals old, set it to new and return old. Otherwise return current value (indicating failure — retry).

Memory Ordering

On modern CPUs, memory operations can be reordered by the hardware for performance. Two CPUs may see writes in different orders unless memory barriers are explicitly inserted.

smp_wmb();  // write memory barrier: all prior writes visible before subsequent writes
smp_rmb();  // read memory barrier: all prior reads complete before subsequent reads
smp_mb();   // full memory barrier: all memory operations ordered

rcu_assign_pointer() includes an smp_wmb() — ensuring the new pointer's contents are visible to all CPUs before the pointer itself becomes visible. Without this, a reader might dereference the new pointer and see uninitialized data.

lockdep: Kernel Deadlock Detection

Linux's lockdep validator (enabled with CONFIG_PROVE_LOCKING=y) tracks every lock acquisition and release, building a dependency graph. It detects potential deadlock cycles before they actually occur.

# In a debug kernel build, lockdep violations appear in dmesg:
dmesg | grep "possible circular locking dependency"

# lockdep annotates stack traces with the full lock chain:
# WARNING: possible circular locking dependency detected
# task/12345 is trying to acquire lock:
#   ...
# but task/12345 already holds lock:
#   ...
# which lock already depends on the new lock.

lockdep has found hundreds of real kernel deadlocks over its history. It is one reason the Linux kernel has very few deadlock bugs despite millions of lines of concurrent code.

Synchronization Primitive Comparison

Primitive	Context	Sleeping?	Reader Cost	Writer Cost	Scalability	Canonical Use Case
spinlock	Any (incl. interrupt)	No	O(1) spin	O(1) spin	Poor under contention	Short critical sections, IRQ handlers
mutex	Process only	Yes	Sleep + schedule	Sleep + schedule	Good (one holder)	Filesystem operations, driver state
rwlock_t	Any	No (spinning)	O(1) atomic	O(n readers) spin	Good for read-heavy	Rarely held write locks
semaphore	Process only	Yes	Sleep	Sleep	Fair	Count-limited resources
seqlock	Any	Writers no, readers retry	Retry on collision	Never blocks	Excellent	Rarely-written counters (jiffies)
RCU	Read: any; Write: process	Writes: yes (grace period)	~0 (no atomic ops)	Moderate (grace period)	Excellent	Routing tables, process lists, creds

Key Takeaways

The evolution of Linux synchronization tells the story of multicore scaling. Early Linux (2.4) used a single Big Kernel Lock (BKL) that serialized nearly all kernel operations. The BKL was progressively replaced with finer-grained locks, then with RCU where reads dominate.

RCU is the kernel's most important contribution to systems programming literature. Its insight — that freeing memory can be deferred until no reader can possibly hold a reference, without those readers ever acquiring a lock — is profound. The grace period mechanism makes this possible without per-reader counters. The result is a read path that costs essentially nothing, enabling the kernel to scale routing lookups, credential checks, and list traversals linearly with CPU count rather than serializing them.

When diagnosing kernel performance problems, lockdep is the first tool to enable on a debug build. perf lock can profile lock contention in production. And if a workload shows high iowait combined with soft lockup warnings, the first question to ask is whether a write-heavy workload is contending with read-heavy workloads on a shared data structure — the likely fix being a migration from rwlock to RCU.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

32 minLesson 14 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min