AiTechWorlds
AiTechWorlds
The scheduler runs every few milliseconds, making one of the most consequential decisions in computing: which of the dozens or hundreds of runnable processes should occupy the next CPU time slice? The answer determines whether your laptop feels responsive or sluggish, whether a web server handles 10,000 requests per second or 1,000, and whether a real-time audio application produces glitch-free sound or stutters.
What makes scheduling hard is not the mechanics — maintaining a queue is straightforward. What makes it hard is that the goals are fundamentally in tension. Maximizing throughput (completed jobs per second) means running long CPU-bound jobs to completion. Minimizing response time (time from input to first output) means constantly interrupting long jobs to give interactive processes quick turns. You cannot fully optimize for both simultaneously.
Every scheduling algorithm is a different answer to the question: whose interests do we sacrifice, and by how much?
Before examining algorithms, understanding the objective function is essential — because every scheduler design implicitly encodes a priority ordering.
CPU Utilization: keep the CPU busy. Idle CPU is wasted capacity. Goal: 90–100% utilization.
Throughput: maximize jobs completed per unit time. Relevant for batch workloads (scientific computing, video encoding, data processing).
Turnaround time: minimize the time between submitting a job and its completion. Critical for batch systems. Turnaround = finish time − arrival time.
Response time: minimize time from a request to the first response — not the complete answer, just the first byte. This is what makes UI interactions feel instant. A 50ms response feels instantaneous; 500ms feels laggy; 2 seconds breaks user engagement.
Fairness: each process gets a proportional share of CPU. Prevents starvation — a process should not wait indefinitely while others run.
The fundamental tension: optimizing for throughput means long time slices and minimal context switching. Optimizing for response time means short time slices and frequent context switching. These are contradictory. Every scheduler is a compromise.
The simplest conceivable scheduler: run processes in order of arrival. No preemption — a process runs to completion (or until it blocks).
Convoy effect: if a CPU-bound job arrives first and runs for 30 seconds, all subsequent jobs — including trivial 1ms interactive processes — wait 30 seconds. Average waiting time skyrockets. FCFS is catastrophic for interactive workloads.
Use case: batch systems where all jobs are roughly equal in length and interactivity is irrelevant.
SJF runs the job with the smallest estimated remaining CPU time next. It is provably optimal for minimizing average waiting time — no other algorithm can do better given accurate burst time estimates.
Problem: requires knowing job duration in advance. This is impossible in general. The kernel cannot know how long a process will run before it next blocks.
Heuristic: predict next burst from history: τ_n+1 = α × t_n + (1-α) × τ_n (exponential moving average of past bursts). This approximation is useful but not guaranteed accurate.
STCF is the preemptive variant: if a new job arrives with a shorter remaining time than the current job, preempt. Further improves average waiting time at the cost of frequent preemptions for long-running jobs.
Each process gets a fixed time quantum (the "time slice"), then is preempted and moved to the back of the queue. With N processes and quantum q, each process gets CPU time every N×q milliseconds — bounded response time for any number of processes.
Quantum size matters enormously:
sched_rr_timeslice_ms)RR maximizes response time at the cost of turnaround time — a single long job that runs for 10 quanta completes 10x later than under FCFS.
Each process has a numeric priority. The highest-priority runnable process always runs next. Two forms:
Static priority: assigned at creation, never changes. Problem: starvation — a low-priority process may never run if higher-priority processes are always present.
Dynamic priority with aging: periodically boost the priority of processes that have been waiting a long time. This guarantees eventual CPU access for any waiting process.
Priority inversion: a classic problem. High-priority task H needs resource R held by low-priority task L. Medium-priority task M preempts L. Now H is blocked waiting for L, but L can't run because M is running — M, which has no dependency on R, is effectively blocking H. Solution: priority inheritance (L temporarily inherits H's priority while holding R). Used in real-time systems.
MLFQ is the most practically successful classical scheduling algorithm, used in some form in FreeBSD, macOS (XNU), and influenced Windows NT's scheduler design.
Structure: multiple priority queues (e.g., 8 levels). Higher queues have smaller quanta; lower queues have larger quanta.
Rules:
Intuition: new jobs might be short interactive processes — give them quick service. If they keep using full quanta, they reveal themselves as CPU-bound and get demoted to lower-priority, longer-quantum queues. The periodic boost prevents starvation.
MLFQ approximates SJF without requiring advance knowledge of job length — it infers length from behavior.
Single-core scheduling theory breaks down on multi-core systems. New problems emerge.
If task A was running on core 2 and made heavy use of core 2's L1/L2 cache, migrating task A to core 7 after a reschedule discards all that cached data. Soft affinity: the scheduler prefers to keep tasks on the same core but will migrate under load imbalance. Hard affinity: sched_setaffinity() pins a task to specific cores. Used for latency-critical tasks (high-frequency trading, audio processing).
In NUMA (Non-Uniform Memory Access) systems — standard on multi-socket servers — memory banks are physically closer to some CPUs than others. A process running on socket 0 accessing memory allocated on socket 1's NUMA node pays 2–4× the memory latency.
Linux's NUMA scheduler (enabled by default since kernel 3.13) tracks "NUMA faults" — when a page access is remote — and migrates pages toward the CPU accessing them, or migrates tasks toward their hot memory. numactl --localalloc forces memory allocation on the local NUMA node. numastat shows NUMA hit/miss rates.
With multiple CPU cores, work must be distributed. Linux uses a pull migration model: when a CPU becomes idle, it looks for work to steal from other CPUs' runqueues. This is implemented in the load_balance() function called from the idle loop.
The scheduler topology mirrors the hardware topology: SMT domains (hyperthreading pairs), MC domains (cores on the same physical chip sharing L3 cache), NUMA domains (sockets). Load balancing frequency and aggressiveness differs by domain level — frequent rebalancing within a chip (shared cache is cheap), less frequent across NUMA nodes (memory migration is expensive).
Linux provides three scheduling policies for real-time workloads, set via sched_setscheduler() or the chrt command.
SCHED_FIFO: a FIFO queue per priority (1–99). A SCHED_FIFO task runs until it blocks, yields, or is preempted by a higher-priority SCHED_FIFO/RR task. It is never preempted by a lower or equal priority task. Maximum determinism.
SCHED_RR: like SCHED_FIFO but with a configurable time quantum. Equal-priority SCHED_RR tasks rotate. More fair than SCHED_FIFO at the cost of slightly less determinism.
SCHED_DEADLINE: the most sophisticated real-time policy (added in Linux 3.14). Each task specifies three parameters: runtime C (how much CPU time it needs), deadline D (by when), and period T (how often). The kernel uses Earliest Deadline First (EDF) scheduling and Constant Bandwidth Server (CBS) for admission control. The kernel rejects tasks if accepting them would make all deadlines unachievable (worst-case guarantee).
chrt --deadline --sched-runtime 1000000 --sched-deadline 10000000 --sched-period 10000000 ./audio_app
# 1ms runtime per 10ms period = 10% CPU reservation with hard deadline guarantee
SCHED_DEADLINE is used in the JACK2 professional audio server and industrial control systems.
| Algorithm | Optimal For | Preemptive? | Starvation Possible? | Used In | Time Complexity |
|---|---|---|---|---|---|
| FCFS | Simple batch jobs, equal-length tasks | No | No | Disk I/O elevators (variant) | O(1) |
| SJF | Minimizing average wait time (batch) | No | Yes (long jobs) | Theoretical baseline | O(log n) |
| STCF | Minimizing average wait time | Yes | Yes (long jobs) | Theoretical analysis | O(log n) |
| Round Robin | Interactive response time | Yes | No | Basic time-sharing | O(1) |
| Priority (static) | Safety-critical tasks | Yes | Yes | RTOS (with SCHED_FIFO) | O(log n) |
| MLFQ | General-purpose (mix of workloads) | Yes | No (with boost) | FreeBSD, Windows (influenced) | O(log n) per queue |
| EDF (SCHED_DEADLINE) | Real-time with bandwidth guarantees | Yes | No (with CBS) | Linux RT, industrial control | O(log n) |
# Count context switches per second system-wide
perf stat -e context-switches -a sleep 5
# Measure scheduling latency (how long before a woken task gets CPU)
cyclictest -t 1 -p 80 -n -i 10000
# View per-process scheduling statistics
cat /proc/[PID]/schedstat
# Fields: time on cpu (ns), time waiting on runqueue (ns), number of timeslices run
A typical busy server: 100K–1M context switches/second. A well-tuned real-time system: cyclictest shows < 50µs worst-case latency. A default kernel under load: spikes to 1–5ms.
No scheduling algorithm is universally optimal because the objectives are fundamentally contradictory. MLFQ's elegance is that it adapts to workload characteristics — CPU-bound jobs self-demote, I/O-bound jobs self-promote, and the periodic boost prevents starvation — all without requiring advance knowledge of job characteristics.
Multi-core scheduling introduces a layer of complexity that classical theory largely ignores: NUMA topology, cache affinity, load balancing across heterogeneous memory latencies. Modern server workloads spend significant scheduler engineering effort on NUMA awareness because the memory latency gap between local and remote access is 2–4× and cannot be scheduled away.
Real-time scheduling (SCHED_FIFO, SCHED_RR, SCHED_DEADLINE) is the answer when "best effort" is not good enough — when the question is not "how fast on average?" but "what is the worst case?" SCHED_DEADLINE's EDF with CBS provides mathematically bounded worst-case guarantees, which is why it appears in production audio and industrial systems.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises