AiTechWorlds
AiTechWorlds
In 2007, Ingo Molnár — the Linux kernel developer also responsible for the NPTL threading library and the futex mechanism — committed a complete replacement for Linux's scheduler. His commit message was characteristically direct: "CFS stands for 'Completely Fair Scheduler,' and is the new desktop process scheduler implemented by Ingo Molnár and merged in Linux 2.6.23."
What he replaced was the O(1) scheduler, introduced in Linux 2.5 in 2003 by the same developer. The O(1) scheduler was fast — it ran in constant time regardless of the number of processes. But it achieved interactivity by using a set of heuristics to distinguish "interactive" from "CPU-bound" processes: measuring sleep times, estimating interactivity scores, applying bonuses and penalties. These heuristics worked well on some workloads and catastrophically on others. The code was complex, the edge cases numerous, and the behavior often surprising.
CFS replaced all of it with a single elegant idea: model a perfect multitasking CPU, and track how far each process deviates from that ideal. No heuristics. No interactivity scores. Just one number per process — vruntime — and one rule: run the process with the smallest vruntime.
CFS starts with an idealized model: imagine a CPU that can execute N processes simultaneously, each receiving exactly 1/N of the CPU's processing power at every instant. Every nanosecond, each process advances by 1/N nanoseconds of computation.
No real CPU works this way — hardware executes one instruction stream at a time per core. CFS approximates this ideal by time-multiplexing: running each process for short periods, but ensuring that over any observation window, each process has received approximately its fair share of CPU time.
The approximation quality is determined by the scheduler period (sched_latency_ns): the target time window within which every runnable process will have executed at least once. With 4 runnable processes and a 24ms latency target, each runs for ~6ms per period.
This is not round-robin — it is weighted fair queuing based on accumulated CPU usage.
Every task_struct contains a sched_entity structure, which contains vruntime — a 64-bit nanosecond counter tracking how much CPU time this task has effectively received.
The update rule on every scheduler tick:
vruntime += delta_exec × (NICE_0_LOAD / task_weight)
Where:
delta_exec: nanoseconds of wall-clock CPU time just consumedNICE_0_LOAD: the weight of a process at nice level 0 (fixed at 1024 in Linux 6.x)task_weight: the weight corresponding to this task's nice levelThe effect: a nice-0 task accumulates vruntime at exactly the rate of real time. A nice-5 task (lower priority, weight = 335) accumulates vruntime slower — its vruntime increases by only 335/1024 ≈ 0.33 ns per real ns. A nice -5 task (higher priority, weight = 3121) accumulates vruntime faster — 3121/1024 ≈ 3.05 ns per real ns.
The scheduler's decision: always pick the task with the smallest vruntime. The task with the smallest vruntime is the one that has received the least CPU time relative to its entitlement — it is the most "behind" in the ideal model.
The result: a nice-5 task accumulates vruntime slowly, so it stays near the front of the scheduler's queue and gets scheduled frequently — but each scheduling gives it less real time before another task catches up. A nice -5 task accumulates vruntime quickly, so it gets scheduled infrequently per period — but when scheduled, its time slice is proportionally longer.
This is mathematically equivalent to weighted fair queuing, the same algorithm used in network packet scheduling.
CFS stores all runnable tasks in a self-balancing binary search tree — a red-black tree — keyed by vruntime.
Why a tree, not a sorted list?
rb_leftmost) in the runqueueThe leftmost node is always the task with the smallest vruntime — the next task to run. Picking the next task is O(1). Inserting a newly runnable task takes O(log n) comparisons.
With 1,000 runnable processes, O(log 1000) ≈ 10 comparisons per insert. The O(1) scheduler's constant-time advantage over CFS's O(log n) is irrelevant in practice — 10 comparisons at modern CPU speeds is nanoseconds, while the scheduled task will run for milliseconds.
The green node (Task A, vruntime=20ms) is the leftmost — it runs next.
Two sysctl parameters govern CFS behavior:
kernel.sched_latency_ns (default: 24,000,000 ns = 24ms): the target scheduling period. With N tasks, each gets sched_latency / N nanoseconds per period. This is a soft target — if the resulting time slice would be smaller than sched_min_granularity_ns, the time slice is clamped.
kernel.sched_min_granularity_ns (default: 3,000,000 ns = 3ms): the minimum time slice. Prevents excessive context switching when there are many tasks. With 100 tasks at 24ms latency, each task would get 0.24ms — below the minimum. CFS increases the period: actual_period = sched_min_granularity_ns × nr_tasks, so each still gets 3ms.
Tuning for different workloads:
sched_min_granularity_ns to 1ms. More context switches, faster response.sched_min_granularity_ns to 10ms. Fewer switches, better cache utilization.If a task sleeps for 5 seconds while other tasks run, its vruntime stays frozen. When it wakes, it has a vruntime far behind all current tasks — it would monopolize the CPU for seconds to "catch up."
CFS prevents this: on wakeup, the task's vruntime is set to min_vruntime (the smallest vruntime currently in the runqueue) plus a small placement bonus. The task gets a fresh start at approximately the current minimum, not its historical position.
The placement bonus (configurable via sched_wakeup_granularity_ns, default: 4ms) gives the woken task a slight head start — its vruntime is placed slightly before min_vruntime. This ensures it gets scheduled promptly (important for interactive responsiveness) without giving it so much advantage that it starves other tasks.
This is why interactive applications feel responsive under CFS: a process waiting for keyboard input has a perpetually low effective vruntime at wakeup, so it gets immediate CPU time when an event arrives.
CFS integrates with Linux cgroups (control groups) to provide hierarchical CPU bandwidth control. This is how Docker, Kubernetes, and systemd slice CPU resources.
Concept: instead of scheduling individual tasks, CFS schedules a hierarchy of groups. Each group has a sched_entity, and that entity's vruntime determines when the group gets CPU time. Within a group, tasks are scheduled by their own CFS runqueue.
CFS bandwidth control (CONFIG_CFS_BANDWIDTH): each cgroup can be given a quota and period:
cpu.cfs_quota_us = 50000 and cpu.cfs_period_us = 100000 → 50% CPU limitcpu.cfs_quota_us = 200000 → 200% CPU limit (can use 2 full cores)# Docker example: limit container to 0.5 CPUs
docker run --cpus 0.5 nginx
# Translates to: cfs_quota = 50ms, cfs_period = 100ms
# Kubernetes: resource limits
# resources.limits.cpu: "500m" (500 millicores = 0.5 CPUs)
When a cgroup exhausts its quota within a period, all tasks in it are throttled — they become unrunnable until the next period's quota is replenished. cat /sys/fs/cgroup/cpu/docker/CONTAINER_ID/cpu.stat shows throttled_time — how many nanoseconds the container was throttled.
Debugging throttling: excessive cgroup throttling is a common production performance problem. The symptom is sporadic latency spikes in a container that appears to have headroom (CPU usage below quota average) — caused by burst CPU usage within a period exceeding the quota.
# Comprehensive scheduler debug output
cat /proc/sched_debug
# Per-task scheduling statistics
cat /proc/[PID]/schedstat
# Format: cpu_time wait_time timeslices_run
# Per-CPU runqueue statistics
cat /proc/schedstat
# Nice-to-weight mapping (kernel internal table)
# These are the actual values from kernel/sched/core.c:
# nice -20: weight=88761, nice -10: weight=9548,
# nice 0: weight=1024, nice 10: weight=110, nice 19: weight=15
| Nice Value | Weight | vruntime Rate (relative to nice-0) | Effective CPU Share (vs nice-0) |
|---|---|---|---|
| -20 | 88761 | 86.7× faster accumulation | 86.7× more CPU per period |
| -10 | 9548 | 9.3× faster | 9.3× more CPU |
| -5 | 3121 | 3.1× faster | 3.1× more CPU |
| 0 | 1024 | 1.0× (baseline) | Equal share |
| 5 | 335 | 0.33× slower | 0.33× CPU |
| 10 | 110 | 0.11× slower | 0.11× CPU |
| 19 | 15 | 0.015× slower | 0.015× CPU |
| CFS Parameter | Default Value | Effect of Increasing | Tune For | sysctl Name |
|---|---|---|---|---|
sched_latency_ns | 24,000,000 ns (24ms) | Longer time slices, fewer switches, higher latency | Batch / throughput workloads | kernel.sched_latency_ns |
sched_min_granularity_ns | 3,000,000 ns (3ms) | Larger minimum slice, less context switching | CPU-bound workloads with many tasks | kernel.sched_min_granularity_ns |
sched_wakeup_granularity_ns | 4,000,000 ns (4ms) | Less aggressive preemption on wakeup | Reducing wakeup preemption storms | kernel.sched_wakeup_granularity_ns |
sched_child_runs_first | 0 (disabled) | Child runs before parent after fork | Reducing CoW faults (child execs immediately) | kernel.sched_child_runs_first |
sched_migration_cost_ns | 500,000 ns (0.5ms) | Less eager task migration between cores | NUMA workloads, cache-sensitive tasks | kernel.sched_migration_cost_ns |
CFS's elegance is in reducing the entire scheduling problem to a single invariant: run the task with the least vruntime. The red-black tree makes this O(1) to execute and O(log n) to maintain. Nice values translate to weights that modify the rate of vruntime accumulation, providing proportional CPU sharing without heuristics.
The sleep/wakeup handling — resetting vruntime to near min_vruntime on wakeup — is what gives CFS its interactive responsiveness. A UI thread spending 99% of its time blocked on input gets scheduled within a millisecond of an event because it re-enters the tree with a near-minimal vruntime.
The cgroups integration extends CFS from a per-task scheduler to a hierarchical resource manager — the foundation of container CPU isolation. Understanding throttling behavior (quota exhaustion within a period) is essential for running latency-sensitive workloads in containerized environments. Many production incidents involving "mysterious latency spikes" in Kubernetes trace back to CFS throttling — diagnosable by checking throttled_time in cgroup statistics.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises