When you run a 500MB application, Linux does not load 500MB into RAM immediately — that would take seconds. Instead, it maps the pages and loads them on demand, only when actually accessed. This "lie" is what makes programs start in milliseconds.

Launch a fresh bash process and it starts in under 5 milliseconds. The bash binary is 1.4MB, linked against glibc (~2MB), readline, ncurses, and a dozen other libraries totaling perhaps 8MB of code. Linux does not read any of it at startup. It maps the binary and all its libraries into virtual address space — creating Virtual Memory Areas with page table entries that point nowhere — and waits. The first time bash tries to execute an instruction, the CPU finds the page table entry is not present, raises a page fault, and the kernel loads exactly that one 4KB page. And only that page.

This is demand paging, and understanding its implementation reveals how modern OS memory management achieves both speed and efficiency simultaneously.

Memory Mapping: The mmap System Call

mmap() is the foundation of demand paging in Linux. It creates a mapping between a region of virtual address space and a backing store — either a file or anonymous memory.

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);

File-Backed Mappings

int fd = open("database.db", O_RDONLY);
void *data = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);

The file's contents are mapped directly into the process's virtual address space. When the process reads from data, the kernel loads the relevant 4KB pages from the file into the page cache. This is zero-copy I/O: the data goes from disk to the page cache to the process's virtual address space — never copied to a separate buffer.

SQLite, LMDB, and PostgreSQL's shared_buffers all use file-backed mmap because it delegates page management entirely to the kernel's page cache, which is often more efficient than application-level buffer management.

Anonymous Mappings

void *heap = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);

Anonymous mappings have no file backing. They are backed by the swap device (or zswap). The C library's malloc() calls brk() for small allocations and mmap(MAP_ANONYMOUS) for allocations above 128KB (MMAP_THRESHOLD).

Shared vs Private Mappings

Flag	Writes affect	Used for
`MAP_SHARED`	Other processes mapping same file; file on disk	IPC shared memory, database files
`MAP_PRIVATE`	Only this process (copy-on-write)	Program code, read-only data with potential modification

MAP_PRIVATE uses copy-on-write (COW): initially all COW pages point to the same physical frames. On the first write, the kernel allocates a new page, copies the contents, and updates the page table to point to the new private copy. This is also how fork() works — it does not copy the parent's memory, it marks all pages as COW.

Page Fault Handling in Linux

When a process accesses a virtual address with no valid page table entry, the CPU raises a #PF (Page Fault) exception. The CPU saves state, pushes the faulting address into the CR2 register (x86-64), and jumps to the kernel's fault handler.

The Fault Handling Path

The Linux page fault handler is do_page_fault() (architecture-specific) → handle_mm_fault() (architecture-independent):

Minor vs Major Faults

Minor fault (soft fault): The page is already in physical memory but the page table entry was not set up. This happens after fork() (COW pages are present but marked read-only), after shared library loading (the library's pages may already be in the page cache from another process), or when a new anonymous page is first touched. No I/O required — just an update to the page table entry.

Major fault (hard fault): The page must be loaded from disk. This blocks the process — typically 1–10ms for a spinning disk, 50–100µs for NVMe SSD.

# Count faults for a command
/usr/bin/time -v ls /
# Look for: "Major (requiring I/O) page faults" and "Minor (reclaiming a frame) page faults"

# Live fault statistics
vmstat -s | grep -i fault
# Typical application: >99% minor faults after warm cache

A freshly launched application will have many major faults on first run (cold cache). Subsequent launches have almost exclusively minor faults because the page cache retains the binary's pages.

Transparent Huge Pages (THP)

Every TLB (Translation Lookaside Buffer) miss costs 50–200 CPU cycles on a cache miss path. For a process with a 4GB working set, using standard 4KB pages requires the TLB to track 1 million entries — far exceeding the 1024–4096 TLB entries available on modern x86-64 CPUs, causing constant TLB thrashing.

Transparent Huge Pages automatically promote standard 4KB pages to 2MB huge pages where the virtual address range is aligned and the pages are contiguous in physical memory. The process does not need to change any code.

cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

# Enable for specific allocations only:
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled

# In application code:
madvise(ptr, size, MADV_HUGEPAGE);   // hint: use huge pages here
madvise(ptr, size, MADV_NOHUGEPAGE); // hint: don't use huge pages here

khugepaged: a kernel thread that scans anonymous memory looking for aligned 4KB page ranges that can be collapsed into a 2MB huge page. This runs in the background and is the mechanism that makes THP "transparent."

Performance impact: PostgreSQL and MySQL commonly report 10–30% query performance improvement with THP enabled, because the database buffer pool is a large, sequential working set — exactly what huge pages optimize.

THP downside: Allocating a 2MB contiguous physical block requires the buddy allocator to have an order-9 block free. On a fragmented system, khugepaged may fail frequently. For some workloads (Redis, certain Java applications), THP can cause latency spikes when khugepaged runs compaction — many deployments set THP to madvise mode to give applications control.

mlock: Preventing Swapping

Real-time audio applications, cryptographic key storage, and latency-sensitive databases cannot afford the unbounded latency of a page fault that triggers disk I/O.

mlock(ptr, size);          // lock specific range in RAM
mlockall(MCL_CURRENT | MCL_FUTURE);  // lock all present and future pages

mlock() pins pages in physical memory — kswapd will not evict them under any pressure. The kernel also pre-faults all pages in the range immediately (no demand paging — all pages loaded at mlock() time).

Requires CAP_IPC_LOCK capability or ulimit -l unlimited. The kernel enforces the RLIMIT_MEMLOCK limit for unprivileged processes (default: 64KB).

Use cases: sshd mlocks its private keys, rtkit uses mlock for real-time audio threads, gpg-agent mlocks the passphrase buffer.

KSM: Kernel Samepage Merging

On a server running 100 virtual machines with KVM/QEMU, each VM's memory contains large identical regions — every Linux VM has the same kernel code pages, the same libc pages, the same zero pages. KSM deduplicates these.

KSM scans anonymous pages, computes checksums, groups candidates, and for pages that are byte-for-byte identical, physically merges them: all virtual mappings point to a single physical page marked read-only. On write, the page is COW-split again.

echo 1 > /sys/kernel/mm/ksm/run    # enable KSM
cat /sys/kernel/mm/ksm/pages_saved  # number of pages saved
cat /sys/kernel/mm/ksm/pages_sharing # pages currently shared

Typical savings on KVM hosts: 20–40% memory reduction for similar Linux VMs. QEMU marks guest memory as MADV_MERGEABLE to opt into KSM.

Virtual-to-Physical Mapping Overview

Page Fault Type Reference

Fault Type	Cause	I/O Required	Latency	Example
Minor — anonymous new	First touch of heap/stack page	No	~1µs	`malloc()` first write
Minor — file cache hit	Library page in cache, PTE not set	No	~1µs	Second `bash` launch
Minor — COW	Write to shared/COW page	No	~1µs	`fork()` + write
Major — file cold	Code/data page not in cache	Yes (read)	50µs–10ms	First program launch
Major — swap in	Anonymous page swapped out	Yes (read)	50µs–10ms	Under memory pressure
Protection fault	Write to read-only region	No	Immediate SIGSEGV	Wild pointer write

Key Takeaways

Demand paging is not a single mechanism but a collaboration: mmap() creates a virtual address space promise, page tables record whether that promise has been fulfilled for each 4KB page, the CPU's fault mechanism detects unfulfilled promises, and the kernel's fault handler fulfills them — loading from file, allocating zero pages, or swapping in from disk.

The distinction between minor and major faults is often the key diagnostic when applications feel slow on first launch but fast afterward. The page cache warms up after the first run, converting future major faults to minor faults. A system with persistent major fault rates in a steady-state workload is a system under memory pressure — either the working set exceeds physical RAM, or memory is being reclaimed too aggressively. vmstat -s, /proc/vmstat, and perf stat -e major-faults,minor-faults will tell you exactly which category you are in.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

35 minLesson 10 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min