AiTechWorlds
AiTechWorlds
Right now, dozens of processes are "running" on your machine. Your browser, your terminal emulator, your music player, a half-dozen system daemons — all appearing to execute at once. On a quad-core machine, at most four are actually executing at any instant. The rest are waiting. The scheduler switches between them thousands of times per second, and each transition must be completely invisible to the interrupted program.
This is the context switch: the operation of freezing one program's execution state in time, restoring another's, and handing the CPU to the new program at the exact instruction where it was last interrupted. Done correctly, the resumed program has no idea it was ever paused. It picks up exactly where it left off, with the same register values, the same stack, the same view of memory.
Done poorly — or done too frequently — context switching becomes a significant performance cost. A context switch between two processes with large, hot working sets evicts cache lines that took milliseconds to accumulate. Understanding exactly what must be saved and restored, and exactly what it costs, is essential knowledge for performance-conscious systems work.
Every CPU core maintains a collection of state that defines the current execution context. All of it must be preserved across a context switch.
On x86-64, there are 16 general-purpose registers: rax, rbx, rcx, rdx, rsi, rdi, rbp, rsp, r8–r15. Each is 64 bits (8 bytes). Total: 128 bytes.
More critically:
rip (instruction pointer): the address of the next instruction to execute — the most important registerrflags: condition codes (zero flag, carry flag, sign flag), interrupt enable flagrsp: stack pointer — points to the top of the current stack frameIf rip and rsp are not perfectly preserved, the resumed process will execute the wrong code or corrupt its stack. Both are guaranteed data corruption.
fs and gs base registers point to thread-local storage blocks. On x86-64 Linux, fs points to the TLS block — where errno, pthread_self(), and other per-thread data live. These are loaded from the task_struct's thread.fsbase field on every context switch.
x87 FPU, SSE (XMM registers, 16 × 128-bit), AVX (YMM registers, 16 × 256-bit), and AVX-512 (ZMM registers, 32 × 512-bit) must all be saved if the task has used them.
AVX-512 state alone is 2,688 bytes. Saving and restoring this on every context switch for every process — including processes that never use floating point — would be wasteful.
Linux uses lazy FPU state saving: the CR0.TS (Task Switched) bit is set on every context switch. The first FPU instruction executed by the new task triggers a Device Not Available exception. The exception handler checks whether FPU state exists for this task, saves the outgoing FPU state, restores the new task's FPU state, clears CR0.TS, and returns. If a task never uses FPU, it never triggers the exception and never pays the cost.
Every task has a dedicated kernel stack — 8KB on x86-64 (allocated in the kernel's memory). When a process makes a system call or takes an interrupt, the CPU switches to this kernel stack. The kernel stack holds:
pt_regs structure at the top: saved user-space registersThe kernel stack pointer is stored in the CPU's IST (Interrupt Stack Table) for interrupt handling and in per_cpu(cpu_current_top_of_stack) for syscall entry.
The most expensive part of a context switch between two different processes is the virtual address space switch.
Every process has its own page tables. The CPU's CR3 register holds the physical address of the current process's top-level page table (PML4 on x86-64). Switching processes requires loading CR3 with the new process's page table.
Loading CR3 flushes the entire TLB (Translation Lookaside Buffer) — the CPU's cache of virtual-to-physical address translations. After the flush, every memory access by the new process misses the TLB and must walk the page table — a 3–4 memory accesses per translation. With a 100-entry TLB covering a 4KB page working set, the CPU must re-populate the TLB from scratch.
PCID (Process-Context Identifiers): on modern x86-64 CPUs and Linux 4.14+, the kernel uses PCID to tag TLB entries with a process ID (12-bit PCID). A CR3 load with the PCID flag set does not flush TLB entries tagged with a different PCID. The TLB can hold entries from multiple processes simultaneously.
Linux maintains a small PCID cache per CPU. When switching to a process whose PCID is cached on this CPU, the TLB flush is skipped. This is a significant optimization for workloads with many context switches between a small set of processes.
__switch_to() in DetailThe core of the Linux context switch is in arch/x86/kernel/process_64.c. The scheduler calls context_switch(), which calls the architecture-specific __switch_to().
The full sequence:
context_switch() called: with (prev_task, next_task) as argumentsswitch_mm_irqs_off() loads the new process's CR3 (with or without TLB flush, depending on PCID)__switch_to() executes:
prev's fsbase (TLS pointer) to prev->thread.fsbasenext's fsbase into the FS_BASE MSRptrace breakpoints)CR0.TS indicates FPU was used (lazy save)next's kernel stack (stored in per_cpu(cpu_tss_rw, cpu).sp0)switch_to() macro: the actual register swap
prev's rbp and rflags onto prev's kernel stackprev's rsp into prev->thread.spnext's rsp from next->thread.sp — the CPU is now running on next's kernel stackret instruction) to the address at the top of next's kernel stack — where next was last interruptedcurrent pointer updated: the per-CPU current variable now points to nextAfter step 6, code that references current sees the new task. The previous task is no longer executing — it will resume the next time the scheduler picks it.
Direct costs (measured with perf stat -e context-switches):
Indirect costs (harder to measure, often dominant):
On a server with 100,000 context switches per second (moderate load), the cache pollution effect can account for 10–30% of total CPU time consumed — even if each individual switch takes only 2 µs.
| CPU State | Where Saved | Size (bytes) | When Restored | Optimization |
|---|---|---|---|---|
| General-purpose registers (rax–r15) | pt_regs on kernel stack | 120 | Every context switch | None possible |
rip (instruction pointer) | pt_regs on kernel stack | 8 | Every context switch | None possible |
rflags | pt_regs on kernel stack | 8 | Every context switch | None possible |
rsp (stack pointer) | thread.sp in task_struct | 8 | Every context switch | None possible |
fsbase (TLS pointer) | thread.fsbase in task_struct | 8 | Every context switch | Fast MSR write |
| FPU / SSE / AVX state | thread.fpu (task_struct) | 512–2688 | Lazy: first FPU use triggers fault | CR0.TS lazy save |
| Page table (CR3 / TLB) | mm_struct->pgd | 8 (pointer) | Process switch only | PCID avoids full flush |
| Debug registers (DR0–DR7) | thread.debugreg[] | 56 | Only if task uses hardware breakpoints | Skip if not set |
A voluntary context switch occurs when the task explicitly relinquishes the CPU: calling sleep(), pthread_cond_wait(), blocking on I/O, or calling sched_yield(). The process moves from TASK_RUNNING to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE.
An involuntary context switch occurs when the scheduler preempts a running task — typically because its time quantum expired or a higher-priority task became runnable. The task remains TASK_RUNNING but is moved from the running CPU back to the runqueue.
Both are tracked in /proc/[PID]/status:
voluntary_ctxt_switches: 12847
nonvoluntary_ctxt_switches: 3291
A high nonvoluntary_ctxt_switches count relative to voluntary indicates CPU starvation — the process is being preempted before it wants to yield, a sign of CPU contention.
Context switching is cheap per operation but expensive at scale because of cache pollution. The direct cost (microseconds) is rarely the bottleneck. The indirect cost — working set eviction and branch predictor reset — is what limits context-switch-heavy workloads.
NUMA-aware scheduling, CPU affinity (taskset, sched_setaffinity()), and PCID all exist to mitigate these indirect costs. A process pinned to a single core with taskset has perfect cache warmth and zero TLB disruption — useful for latency-sensitive real-time tasks.
When a performance profile shows high system time and high cache miss rates with moderate CPU utilization, the diagnosis is usually excessive context switching — often caused by too many threads competing for too few CPU cores, or by lock contention driving frequent voluntary yields.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises