The scheduler runs every few milliseconds, making one of the most consequential decisions in computing: which of the dozens or hundreds of runnable processes should occupy the next CPU time slice? The answer determines whether your laptop feels responsive or sluggish, whether a web server handles 10,000 requests per second or 1,000, and whether a real-time audio application produces glitch-free sound or stutters.

What makes scheduling hard is not the mechanics — maintaining a queue is straightforward. What makes it hard is that the goals are fundamentally in tension. Maximizing throughput (completed jobs per second) means running long CPU-bound jobs to completion. Minimizing response time (time from input to first output) means constantly interrupting long jobs to give interactive processes quick turns. You cannot fully optimize for both simultaneously.

Every scheduling algorithm is a different answer to the question: whose interests do we sacrifice, and by how much?

Scheduling Goals and Their Tensions

Before examining algorithms, understanding the objective function is essential — because every scheduler design implicitly encodes a priority ordering.

CPU Utilization: keep the CPU busy. Idle CPU is wasted capacity. Goal: 90–100% utilization.

Throughput: maximize jobs completed per unit time. Relevant for batch workloads (scientific computing, video encoding, data processing).

Turnaround time: minimize the time between submitting a job and its completion. Critical for batch systems. Turnaround = finish time − arrival time.

Response time: minimize time from a request to the first response — not the complete answer, just the first byte. This is what makes UI interactions feel instant. A 50ms response feels instantaneous; 500ms feels laggy; 2 seconds breaks user engagement.

Fairness: each process gets a proportional share of CPU. Prevents starvation — a process should not wait indefinitely while others run.

The fundamental tension: optimizing for throughput means long time slices and minimal context switching. Optimizing for response time means short time slices and frequent context switching. These are contradictory. Every scheduler is a compromise.

Classic Scheduling Algorithms

First-Come, First-Served (FCFS)

The simplest conceivable scheduler: run processes in order of arrival. No preemption — a process runs to completion (or until it blocks).

Convoy effect: if a CPU-bound job arrives first and runs for 30 seconds, all subsequent jobs — including trivial 1ms interactive processes — wait 30 seconds. Average waiting time skyrockets. FCFS is catastrophic for interactive workloads.

Use case: batch systems where all jobs are roughly equal in length and interactivity is irrelevant.

Shortest Job First (SJF) / Shortest Time to Completion First (STCF)

SJF runs the job with the smallest estimated remaining CPU time next. It is provably optimal for minimizing average waiting time — no other algorithm can do better given accurate burst time estimates.

Problem: requires knowing job duration in advance. This is impossible in general. The kernel cannot know how long a process will run before it next blocks.

Heuristic: predict next burst from history: τ_n+1 = α × t_n + (1-α) × τ_n (exponential moving average of past bursts). This approximation is useful but not guaranteed accurate.

STCF is the preemptive variant: if a new job arrives with a shorter remaining time than the current job, preempt. Further improves average waiting time at the cost of frequent preemptions for long-running jobs.

Round Robin (RR)

Each process gets a fixed time quantum (the "time slice"), then is preempted and moved to the back of the queue. With N processes and quantum q, each process gets CPU time every N×q milliseconds — bounded response time for any number of processes.

Quantum size matters enormously:

Too small (1ms): context switch overhead dominates (if a switch takes 1ms, 50% of CPU time is wasted switching)
Too large (1000ms): degenerates toward FCFS; interactive processes feel unresponsive
Linux default: ~10ms (configurable via sched_rr_timeslice_ms)

RR maximizes response time at the cost of turnaround time — a single long job that runs for 10 quanta completes 10x later than under FCFS.

Priority Scheduling

Each process has a numeric priority. The highest-priority runnable process always runs next. Two forms:

Static priority: assigned at creation, never changes. Problem: starvation — a low-priority process may never run if higher-priority processes are always present.

Dynamic priority with aging: periodically boost the priority of processes that have been waiting a long time. This guarantees eventual CPU access for any waiting process.

Priority inversion: a classic problem. High-priority task H needs resource R held by low-priority task L. Medium-priority task M preempts L. Now H is blocked waiting for L, but L can't run because M is running — M, which has no dependency on R, is effectively blocking H. Solution: priority inheritance (L temporarily inherits H's priority while holding R). Used in real-time systems.

Multilevel Feedback Queue (MLFQ)

MLFQ is the most practically successful classical scheduling algorithm, used in some form in FreeBSD, macOS (XNU), and influenced Windows NT's scheduler design.

Structure: multiple priority queues (e.g., 8 levels). Higher queues have smaller quanta; lower queues have larger quanta.

Rules:

If Priority(A) > Priority(B): run A
If Priority(A) == Priority(B): round-robin
New jobs start at the highest priority queue
If a job uses its full quantum, demote it one level
If a job voluntarily yields before its quantum expires, keep its priority
Periodically (every S seconds), boost all jobs to the top queue

Intuition: new jobs might be short interactive processes — give them quick service. If they keep using full quanta, they reveal themselves as CPU-bound and get demoted to lower-priority, longer-quantum queues. The periodic boost prevents starvation.

MLFQ approximates SJF without requiring advance knowledge of job length — it infers length from behavior.

Multi-Core Scheduling Challenges

Single-core scheduling theory breaks down on multi-core systems. New problems emerge.

Processor Affinity

If task A was running on core 2 and made heavy use of core 2's L1/L2 cache, migrating task A to core 7 after a reschedule discards all that cached data. Soft affinity: the scheduler prefers to keep tasks on the same core but will migrate under load imbalance. Hard affinity: sched_setaffinity() pins a task to specific cores. Used for latency-critical tasks (high-frequency trading, audio processing).

NUMA-Awareness

In NUMA (Non-Uniform Memory Access) systems — standard on multi-socket servers — memory banks are physically closer to some CPUs than others. A process running on socket 0 accessing memory allocated on socket 1's NUMA node pays 2–4× the memory latency.

Linux's NUMA scheduler (enabled by default since kernel 3.13) tracks "NUMA faults" — when a page access is remote — and migrates pages toward the CPU accessing them, or migrates tasks toward their hot memory. numactl --localalloc forces memory allocation on the local NUMA node. numastat shows NUMA hit/miss rates.

Load Balancing

With multiple CPU cores, work must be distributed. Linux uses a pull migration model: when a CPU becomes idle, it looks for work to steal from other CPUs' runqueues. This is implemented in the load_balance() function called from the idle loop.

The scheduler topology mirrors the hardware topology: SMT domains (hyperthreading pairs), MC domains (cores on the same physical chip sharing L3 cache), NUMA domains (sockets). Load balancing frequency and aggressiveness differs by domain level — frequent rebalancing within a chip (shared cache is cheap), less frequent across NUMA nodes (memory migration is expensive).

Real-Time Scheduling Policies

Linux provides three scheduling policies for real-time workloads, set via sched_setscheduler() or the chrt command.

SCHED_FIFO: a FIFO queue per priority (1–99). A SCHED_FIFO task runs until it blocks, yields, or is preempted by a higher-priority SCHED_FIFO/RR task. It is never preempted by a lower or equal priority task. Maximum determinism.

SCHED_RR: like SCHED_FIFO but with a configurable time quantum. Equal-priority SCHED_RR tasks rotate. More fair than SCHED_FIFO at the cost of slightly less determinism.

SCHED_DEADLINE: the most sophisticated real-time policy (added in Linux 3.14). Each task specifies three parameters: runtime C (how much CPU time it needs), deadline D (by when), and period T (how often). The kernel uses Earliest Deadline First (EDF) scheduling and Constant Bandwidth Server (CBS) for admission control. The kernel rejects tasks if accepting them would make all deadlines unachievable (worst-case guarantee).

chrt --deadline --sched-runtime 1000000 --sched-deadline 10000000 --sched-period 10000000 ./audio_app
# 1ms runtime per 10ms period = 10% CPU reservation with hard deadline guarantee

SCHED_DEADLINE is used in the JACK2 professional audio server and industrial control systems.

MLFQ Architecture Diagram

Scheduling Algorithm Comparison Table

Algorithm	Optimal For	Preemptive?	Starvation Possible?	Used In	Time Complexity
FCFS	Simple batch jobs, equal-length tasks	No	No	Disk I/O elevators (variant)	O(1)
SJF	Minimizing average wait time (batch)	No	Yes (long jobs)	Theoretical baseline	O(log n)
STCF	Minimizing average wait time	Yes	Yes (long jobs)	Theoretical analysis	O(log n)
Round Robin	Interactive response time	Yes	No	Basic time-sharing	O(1)
Priority (static)	Safety-critical tasks	Yes	Yes	RTOS (with SCHED_FIFO)	O(log n)
MLFQ	General-purpose (mix of workloads)	Yes	No (with boost)	FreeBSD, Windows (influenced)	O(log n) per queue
EDF (SCHED_DEADLINE)	Real-time with bandwidth guarantees	Yes	No (with CBS)	Linux RT, industrial control	O(log n)

Measuring Scheduler Performance

# Count context switches per second system-wide
perf stat -e context-switches -a sleep 5

# Measure scheduling latency (how long before a woken task gets CPU)
cyclictest -t 1 -p 80 -n -i 10000

# View per-process scheduling statistics
cat /proc/[PID]/schedstat
# Fields: time on cpu (ns), time waiting on runqueue (ns), number of timeslices run

A typical busy server: 100K–1M context switches/second. A well-tuned real-time system: cyclictest shows < 50µs worst-case latency. A default kernel under load: spikes to 1–5ms.

Key Takeaways

No scheduling algorithm is universally optimal because the objectives are fundamentally contradictory. MLFQ's elegance is that it adapts to workload characteristics — CPU-bound jobs self-demote, I/O-bound jobs self-promote, and the periodic boost prevents starvation — all without requiring advance knowledge of job characteristics.

Multi-core scheduling introduces a layer of complexity that classical theory largely ignores: NUMA topology, cache affinity, load balancing across heterogeneous memory latencies. Modern server workloads spend significant scheduler engineering effort on NUMA awareness because the memory latency gap between local and remote access is 2–4× and cannot be scheduled away.

Real-time scheduling (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE) is the answer when "best effort" is not good enough — when the question is not "how fast on average?" but "what is the worst case?" SCHED_DEADLINE's EDF with CBS provides mathematically bounded worst-case guarantees, which is why it appears in production audio and industrial systems.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

35 minLesson 7 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min