Right now, dozens of processes are "running" on your machine. Your browser, your terminal emulator, your music player, a half-dozen system daemons — all appearing to execute at once. On a quad-core machine, at most four are actually executing at any instant. The rest are waiting. The scheduler switches between them thousands of times per second, and each transition must be completely invisible to the interrupted program.

This is the context switch: the operation of freezing one program's execution state in time, restoring another's, and handing the CPU to the new program at the exact instruction where it was last interrupted. Done correctly, the resumed program has no idea it was ever paused. It picks up exactly where it left off, with the same register values, the same stack, the same view of memory.

Done poorly — or done too frequently — context switching becomes a significant performance cost. A context switch between two processes with large, hot working sets evicts cache lines that took milliseconds to accumulate. Understanding exactly what must be saved and restored, and exactly what it costs, is essential knowledge for performance-conscious systems work.

What Must Be Saved: The Complete CPU State

Every CPU core maintains a collection of state that defines the current execution context. All of it must be preserved across a context switch.

General-Purpose Registers

On x86-64, there are 16 general-purpose registers: rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, r8–r15. Each is 64 bits (8 bytes). Total: 128 bytes.

More critically:

rip (instruction pointer): the address of the next instruction to execute — the most important register
rflags: condition codes (zero flag, carry flag, sign flag), interrupt enable flag
rsp: stack pointer — points to the top of the current stack frame

If rip and rsp are not perfectly preserved, the resumed process will execute the wrong code or corrupt its stack. Both are guaranteed data corruption.

Segment Registers and TLS

fs and gs base registers point to thread-local storage blocks. On x86-64 Linux, fs points to the TLS block — where errno, pthread_self(), and other per-thread data live. These are loaded from the task_struct's thread.fsbase field on every context switch.

Floating Point and SIMD State

x87 FPU, SSE (XMM registers, 16 × 128-bit), AVX (YMM registers, 16 × 256-bit), and AVX-512 (ZMM registers, 32 × 512-bit) must all be saved if the task has used them.

AVX-512 state alone is 2,688 bytes. Saving and restoring this on every context switch for every process — including processes that never use floating point — would be wasteful.

Linux uses lazy FPU state saving: the CR0.TS (Task Switched) bit is set on every context switch. The first FPU instruction executed by the new task triggers a Device Not Available exception. The exception handler checks whether FPU state exists for this task, saves the outgoing FPU state, restores the new task's FPU state, clears CR0.TS, and returns. If a task never uses FPU, it never triggers the exception and never pays the cost.

The Kernel Stack

Every task has a dedicated kernel stack — 8KB on x86-64 (allocated in the kernel's memory). When a process makes a system call or takes an interrupt, the CPU switches to this kernel stack. The kernel stack holds:

The pt_regs structure at the top: saved user-space registers
Call frames for kernel functions invoked during this syscall/interrupt
Local variables for kernel functions

The kernel stack pointer is stored in the CPU's IST (Interrupt Stack Table) for interrupt handling and in per_cpu(cpu_current_top_of_stack) for syscall entry.

Virtual Memory State: The TLB Problem

The most expensive part of a context switch between two different processes is the virtual address space switch.

Every process has its own page tables. The CPU's CR3 register holds the physical address of the current process's top-level page table (PML4 on x86-64). Switching processes requires loading CR3 with the new process's page table.

Loading CR3 flushes the entire TLB (Translation Lookaside Buffer) — the CPU's cache of virtual-to-physical address translations. After the flush, every memory access by the new process misses the TLB and must walk the page table — a 3–4 memory accesses per translation. With a 100-entry TLB covering a 4KB page working set, the CPU must re-populate the TLB from scratch.

PCID (Process-Context Identifiers): on modern x86-64 CPUs and Linux 4.14+, the kernel uses PCID to tag TLB entries with a process ID (12-bit PCID). A CR3 load with the PCID flag set does not flush TLB entries tagged with a different PCID. The TLB can hold entries from multiple processes simultaneously.

Linux maintains a small PCID cache per CPU. When switching to a process whose PCID is cached on this CPU, the TLB flush is skipped. This is a significant optimization for workloads with many context switches between a small set of processes.

The Linux Context Switch: `__switch_to()` in Detail

The core of the Linux context switch is in arch/x86/kernel/process_64.c. The scheduler calls context_switch(), which calls the architecture-specific __switch_to().

The full sequence:

Scheduler selects new task: CFS picks the leftmost node from the red-black tree
context_switch() called: with (prev_task, next_task) as arguments
Memory space switch: if switching to a different process (not just a different thread), switch_mm_irqs_off() loads the new process's CR3 (with or without TLB flush, depending on PCID)
__switch_to() executes:
- Save prev's fsbase (TLS pointer) to prev->thread.fsbase
- Load next's fsbase into the FS_BASE MSR
- Save/restore debug registers if either task uses them (ptrace breakpoints)
- Save FPU state if CR0.TS indicates FPU was used (lazy save)
- Switch to next's kernel stack (stored in per_cpu(cpu_tss_rw, cpu).sp0)
switch_to() macro: the actual register swap
- Pushes prev's rbp and rflags onto prev's kernel stack
- Saves prev's rsp into prev->thread.sp
- Loads next's rsp from next->thread.sp — the CPU is now running on next's kernel stack
- Returns (via ret instruction) to the address at the top of next's kernel stack — where next was last interrupted
current pointer updated: the per-CPU current variable now points to next

After step 6, code that references current sees the new task. The previous task is no longer executing — it will resume the next time the scheduler picks it.

Context Switch Cost Analysis

Direct costs (measured with perf stat -e context-switches):

Register save/restore: ~50 CPU cycles
Stack switch: ~10 cycles
TLB flush (without PCID): 500–2,000 cycles (depending on TLB size and CPU)
TLB flush (with PCID, warm entry): ~0 cycles overhead
Total without FPU: ~1–5 µs on modern hardware

Indirect costs (harder to measure, often dominant):

L1/L2 cache pollution: the new task's working set evicts the previous task's hot cache lines. After a context switch, the new task may experience 100–200% more cache misses for several milliseconds until its working set is warm.
Branch predictor state: the branch predictor's history tables are built for the previous task's code paths. The new task starts with cold, wrong predictions.
iTLB pollution: instruction TLB entries for the previous task are evicted

On a server with 100,000 context switches per second (moderate load), the cache pollution effect can account for 10–30% of total CPU time consumed — even if each individual switch takes only 2 µs.

Context Switch Timeline Diagram

CPU State Save/Restore Reference Table

CPU State	Where Saved	Size (bytes)	When Restored	Optimization
General-purpose registers (rax–r15)	`pt_regs` on kernel stack	120	Every context switch	None possible
`rip` (instruction pointer)	`pt_regs` on kernel stack	8	Every context switch	None possible
`rflags`	`pt_regs` on kernel stack	8	Every context switch	None possible
`rsp` (stack pointer)	`thread.sp` in task_struct	8	Every context switch	None possible
`fsbase` (TLS pointer)	`thread.fsbase` in task_struct	8	Every context switch	Fast MSR write
FPU / SSE / AVX state	`thread.fpu` (task_struct)	512–2688	Lazy: first FPU use triggers fault	`CR0.TS` lazy save
Page table (CR3 / TLB)	`mm_struct->pgd`	8 (pointer)	Process switch only	PCID avoids full flush
Debug registers (DR0–DR7)	`thread.debugreg[]`	56	Only if task uses hardware breakpoints	Skip if not set

Voluntary vs Involuntary Context Switches

A voluntary context switch occurs when the task explicitly relinquishes the CPU: calling sleep(), pthread_cond_wait(), blocking on I/O, or calling sched_yield(). The process moves from TASK_RUNNING to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.

An involuntary context switch occurs when the scheduler preempts a running task — typically because its time quantum expired or a higher-priority task became runnable. The task remains TASK_RUNNING but is moved from the running CPU back to the runqueue.

Both are tracked in /proc/[PID]/status:

voluntary_ctxt_switches:    12847
nonvoluntary_ctxt_switches:  3291

A high nonvoluntary_ctxt_switches count relative to voluntary indicates CPU starvation — the process is being preempted before it wants to yield, a sign of CPU contention.

Key Takeaways

Context switching is cheap per operation but expensive at scale because of cache pollution. The direct cost (microseconds) is rarely the bottleneck. The indirect cost — working set eviction and branch predictor reset — is what limits context-switch-heavy workloads.

NUMA-aware scheduling, CPU affinity (taskset, sched_setaffinity()), and PCID all exist to mitigate these indirect costs. A process pinned to a single core with taskset has perfect cache warmth and zero TLB disruption — useful for latency-sensitive real-time tasks.

When a performance profile shows high system time and high cache miss rates with moderate CPU utilization, the diagnosis is usually excessive context switching — often caused by too many threads competing for too few CPU cores, or by lock contention driving frequent voluntary yields.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

25 minLesson 6 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min

Chapter 2: Process and Thread Internals

Context Switching: How the CPU Switches Between Tasks

Context Switching: The Deep Mechanics

The Illusion of Simultaneity

What Must Be Saved: The Complete CPU State

Every CPU core maintains a collection of state that defines the current execution context. All of it must be preserved across a context switch.

General-Purpose Registers

On x86-64, there are 16 general-purpose registers: rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, r8–r15. Each is 64 bits (8 bytes). Total: 128 bytes.

More critically:

rip (instruction pointer): the address of the next instruction to execute — the most important register
rflags: condition codes (zero flag, carry flag, sign flag), interrupt enable flag
rsp: stack pointer — points to the top of the current stack frame

If rip and rsp are not perfectly preserved, the resumed process will execute the wrong code or corrupt its stack. Both are guaranteed data corruption.

Segment Registers and TLS

Floating Point and SIMD State

x87 FPU, SSE (XMM registers, 16 × 128-bit), AVX (YMM registers, 16 × 256-bit), and AVX-512 (ZMM registers, 32 × 512-bit) must all be saved if the task has used them.

AVX-512 state alone is 2,688 bytes. Saving and restoring this on every context switch for every process — including processes that never use floating point — would be wasteful.

The Kernel Stack

The pt_regs structure at the top: saved user-space registers
Call frames for kernel functions invoked during this syscall/interrupt
Local variables for kernel functions

The kernel stack pointer is stored in the CPU's IST (Interrupt Stack Table) for interrupt handling and in per_cpu(cpu_current_top_of_stack) for syscall entry.

Virtual Memory State: The TLB Problem

The most expensive part of a context switch between two different processes is the virtual address space switch.

The Linux Context Switch: `__switch_to()` in Detail

The core of the Linux context switch is in arch/x86/kernel/process_64.c. The scheduler calls context_switch(), which calls the architecture-specific __switch_to().

The full sequence:

Scheduler selects new task: CFS picks the leftmost node from the red-black tree
context_switch() called: with (prev_task, next_task) as arguments
Memory space switch: if switching to a different process (not just a different thread), switch_mm_irqs_off() loads the new process's CR3 (with or without TLB flush, depending on PCID)
__switch_to() executes:
- Save prev's fsbase (TLS pointer) to prev->thread.fsbase
- Load next's fsbase into the FS_BASE MSR
- Save/restore debug registers if either task uses them (ptrace breakpoints)
- Save FPU state if CR0.TS indicates FPU was used (lazy save)
- Switch to next's kernel stack (stored in per_cpu(cpu_tss_rw, cpu).sp0)
switch_to() macro: the actual register swap
- Pushes prev's rbp and rflags onto prev's kernel stack
- Saves prev's rsp into prev->thread.sp
- Loads next's rsp from next->thread.sp — the CPU is now running on next's kernel stack
- Returns (via ret instruction) to the address at the top of next's kernel stack — where next was last interrupted
current pointer updated: the per-CPU current variable now points to next

After step 6, code that references current sees the new task. The previous task is no longer executing — it will resume the next time the scheduler picks it.

Context Switch Cost Analysis

Direct costs (measured with perf stat -e context-switches):

Register save/restore: ~50 CPU cycles
Stack switch: ~10 cycles
TLB flush (without PCID): 500–2,000 cycles (depending on TLB size and CPU)
TLB flush (with PCID, warm entry): ~0 cycles overhead
Total without FPU: ~1–5 µs on modern hardware

Indirect costs (harder to measure, often dominant):

L1/L2 cache pollution: the new task's working set evicts the previous task's hot cache lines. After a context switch, the new task may experience 100–200% more cache misses for several milliseconds until its working set is warm.
Branch predictor state: the branch predictor's history tables are built for the previous task's code paths. The new task starts with cold, wrong predictions.
iTLB pollution: instruction TLB entries for the previous task are evicted

On a server with 100,000 context switches per second (moderate load), the cache pollution effect can account for 10–30% of total CPU time consumed — even if each individual switch takes only 2 µs.

Context Switch Timeline Diagram

CPU State Save/Restore Reference Table

CPU State	Where Saved	Size (bytes)	When Restored	Optimization
General-purpose registers (rax–r15)	`pt_regs` on kernel stack	120	Every context switch	None possible
`rip` (instruction pointer)	`pt_regs` on kernel stack	8	Every context switch	None possible
`rflags`	`pt_regs` on kernel stack	8	Every context switch	None possible
`rsp` (stack pointer)	`thread.sp` in task_struct	8	Every context switch	None possible
`fsbase` (TLS pointer)	`thread.fsbase` in task_struct	8	Every context switch	Fast MSR write
FPU / SSE / AVX state	`thread.fpu` (task_struct)	512–2688	Lazy: first FPU use triggers fault	`CR0.TS` lazy save
Page table (CR3 / TLB)	`mm_struct->pgd`	8 (pointer)	Process switch only	PCID avoids full flush
Debug registers (DR0–DR7)	`thread.debugreg[]`	56	Only if task uses hardware breakpoints	Skip if not set

Voluntary vs Involuntary Context Switches

Both are tracked in /proc/[PID]/status:

voluntary_ctxt_switches:    12847
nonvoluntary_ctxt_switches:  3291

A high nonvoluntary_ctxt_switches count relative to voluntary indicates CPU starvation — the process is being preempted before it wants to yield, a sign of CPU contention.

Key Takeaways

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

Context Switching: How the CPU Switches Between Tasks

Context Switching: The Deep Mechanics

The Illusion of Simultaneity

What Must Be Saved: The Complete CPU State

General-Purpose Registers

Segment Registers and TLS

Floating Point and SIMD State

The Kernel Stack

Virtual Memory State: The TLB Problem

The Linux Context Switch: __switch_to() in Detail

Context Switch Cost Analysis

Context Switch Timeline Diagram

CPU State Save/Restore Reference Table

Voluntary vs Involuntary Context Switches

Key Takeaways

💬 DiscussionPowered by GitHub Discussions

Context Switching: How the CPU Switches Between Tasks

Context Switching: The Deep Mechanics

The Illusion of Simultaneity

What Must Be Saved: The Complete CPU State

General-Purpose Registers

Segment Registers and TLS

Floating Point and SIMD State

The Kernel Stack

Virtual Memory State: The TLB Problem

The Linux Context Switch: __switch_to() in Detail

Context Switch Cost Analysis

Context Switch Timeline Diagram

CPU State Save/Restore Reference Table

Voluntary vs Involuntary Context Switches

Key Takeaways

💬 DiscussionPowered by GitHub Discussions

The Linux Context Switch: `__switch_to()` in Detail

The Linux Context Switch: `__switch_to()` in Detail