AiTechWorlds
AiTechWorlds
When you run a 500MB application, Linux does not load 500MB into RAM immediately — that would take seconds. Instead, it maps the pages and loads them on demand, only when actually accessed. This "lie" is what makes programs start in milliseconds.
Launch a fresh bash process and it starts in under 5 milliseconds. The bash binary is 1.4MB, linked against glibc (~2MB), readline, ncurses, and a dozen other libraries totaling perhaps 8MB of code. Linux does not read any of it at startup. It maps the binary and all its libraries into virtual address space — creating Virtual Memory Areas with page table entries that point nowhere — and waits. The first time bash tries to execute an instruction, the CPU finds the page table entry is not present, raises a page fault, and the kernel loads exactly that one 4KB page. And only that page.
This is demand paging, and understanding its implementation reveals how modern OS memory management achieves both speed and efficiency simultaneously.
mmap() is the foundation of demand paging in Linux. It creates a mapping between a region of virtual address space and a backing store — either a file or anonymous memory.
void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
int fd = open("database.db", O_RDONLY);
void *data = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
The file's contents are mapped directly into the process's virtual address space. When the process reads from data, the kernel loads the relevant 4KB pages from the file into the page cache. This is zero-copy I/O: the data goes from disk to the page cache to the process's virtual address space — never copied to a separate buffer.
SQLite, LMDB, and PostgreSQL's shared_buffers all use file-backed mmap because it delegates page management entirely to the kernel's page cache, which is often more efficient than application-level buffer management.
void *heap = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
Anonymous mappings have no file backing. They are backed by the swap device (or zswap). The C library's malloc() calls brk() for small allocations and mmap(MAP_ANONYMOUS) for allocations above 128KB (MMAP_THRESHOLD).
| Flag | Writes affect | Used for |
|---|---|---|
MAP_SHARED | Other processes mapping same file; file on disk | IPC shared memory, database files |
MAP_PRIVATE | Only this process (copy-on-write) | Program code, read-only data with potential modification |
MAP_PRIVATE uses copy-on-write (COW): initially all COW pages point to the same physical frames. On the first write, the kernel allocates a new page, copies the contents, and updates the page table to point to the new private copy. This is also how fork() works — it does not copy the parent's memory, it marks all pages as COW.
When a process accesses a virtual address with no valid page table entry, the CPU raises a #PF (Page Fault) exception. The CPU saves state, pushes the faulting address into the CR2 register (x86-64), and jumps to the kernel's fault handler.
The Linux page fault handler is do_page_fault() (architecture-specific) → handle_mm_fault() (architecture-independent):
Minor fault (soft fault): The page is already in physical memory but the page table entry was not set up. This happens after fork() (COW pages are present but marked read-only), after shared library loading (the library's pages may already be in the page cache from another process), or when a new anonymous page is first touched. No I/O required — just an update to the page table entry.
Major fault (hard fault): The page must be loaded from disk. This blocks the process — typically 1–10ms for a spinning disk, 50–100µs for NVMe SSD.
# Count faults for a command
/usr/bin/time -v ls /
# Look for: "Major (requiring I/O) page faults" and "Minor (reclaiming a frame) page faults"
# Live fault statistics
vmstat -s | grep -i fault
# Typical application: >99% minor faults after warm cache
A freshly launched application will have many major faults on first run (cold cache). Subsequent launches have almost exclusively minor faults because the page cache retains the binary's pages.
Every TLB (Translation Lookaside Buffer) miss costs 50–200 CPU cycles on a cache miss path. For a process with a 4GB working set, using standard 4KB pages requires the TLB to track 1 million entries — far exceeding the 1024–4096 TLB entries available on modern x86-64 CPUs, causing constant TLB thrashing.
Transparent Huge Pages automatically promote standard 4KB pages to 2MB huge pages where the virtual address range is aligned and the pages are contiguous in physical memory. The process does not need to change any code.
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never
# Enable for specific allocations only:
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
# In application code:
madvise(ptr, size, MADV_HUGEPAGE); // hint: use huge pages here
madvise(ptr, size, MADV_NOHUGEPAGE); // hint: don't use huge pages here
khugepaged: a kernel thread that scans anonymous memory looking for aligned 4KB page ranges that can be collapsed into a 2MB huge page. This runs in the background and is the mechanism that makes THP "transparent."
Performance impact: PostgreSQL and MySQL commonly report 10–30% query performance improvement with THP enabled, because the database buffer pool is a large, sequential working set — exactly what huge pages optimize.
THP downside: Allocating a 2MB contiguous physical block requires the buddy allocator to have an order-9 block free. On a fragmented system, khugepaged may fail frequently. For some workloads (Redis, certain Java applications), THP can cause latency spikes when khugepaged runs compaction — many deployments set THP to madvise mode to give applications control.
Real-time audio applications, cryptographic key storage, and latency-sensitive databases cannot afford the unbounded latency of a page fault that triggers disk I/O.
mlock(ptr, size); // lock specific range in RAM
mlockall(MCL_CURRENT | MCL_FUTURE); // lock all present and future pages
mlock() pins pages in physical memory — kswapd will not evict them under any pressure. The kernel also pre-faults all pages in the range immediately (no demand paging — all pages loaded at mlock() time).
Requires CAP_IPC_LOCK capability or ulimit -l unlimited. The kernel enforces the RLIMIT_MEMLOCK limit for unprivileged processes (default: 64KB).
Use cases: sshd mlocks its private keys, rtkit uses mlock for real-time audio threads, gpg-agent mlocks the passphrase buffer.
On a server running 100 virtual machines with KVM/QEMU, each VM's memory contains large identical regions — every Linux VM has the same kernel code pages, the same libc pages, the same zero pages. KSM deduplicates these.
KSM scans anonymous pages, computes checksums, groups candidates, and for pages that are byte-for-byte identical, physically merges them: all virtual mappings point to a single physical page marked read-only. On write, the page is COW-split again.
echo 1 > /sys/kernel/mm/ksm/run # enable KSM
cat /sys/kernel/mm/ksm/pages_saved # number of pages saved
cat /sys/kernel/mm/ksm/pages_sharing # pages currently shared
Typical savings on KVM hosts: 20–40% memory reduction for similar Linux VMs. QEMU marks guest memory as MADV_MERGEABLE to opt into KSM.
| Fault Type | Cause | I/O Required | Latency | Example |
|---|---|---|---|---|
| Minor — anonymous new | First touch of heap/stack page | No | ~1µs | malloc() first write |
| Minor — file cache hit | Library page in cache, PTE not set | No | ~1µs | Second bash launch |
| Minor — COW | Write to shared/COW page | No | ~1µs | fork() + write |
| Major — file cold | Code/data page not in cache | Yes (read) | 50µs–10ms | First program launch |
| Major — swap in | Anonymous page swapped out | Yes (read) | 50µs–10ms | Under memory pressure |
| Protection fault | Write to read-only region | No | Immediate SIGSEGV | Wild pointer write |
Demand paging is not a single mechanism but a collaboration: mmap() creates a virtual address space promise, page tables record whether that promise has been fulfilled for each 4KB page, the CPU's fault mechanism detects unfulfilled promises, and the kernel's fault handler fulfills them — loading from file, allocating zero pages, or swapping in from disk.
The distinction between minor and major faults is often the key diagnostic when applications feel slow on first launch but fast afterward. The page cache warms up after the first run, converting future major faults to minor faults. A system with persistent major fault rates in a steady-state workload is a system under memory pressure — either the working set exceeds physical RAM, or memory is being reclaimed too aggressively. vmstat -s, /proc/vmstat, and perf stat -e major-faults,minor-faults will tell you exactly which category you are in.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises