A Linux server with 128GB RAM might run 1000 processes. Each process thinks it has exclusive access to gigabytes of address space. The kernel's memory management subsystem performs this illusion every nanosecond — with astonishing elegance.

The PostgreSQL database process believes it owns 32GB. The Nginx web server believes it owns 4GB. The 50 worker threads each believe they own independent stacks. None of these beliefs correspond to reality. What actually exists is a single pool of physical memory, meticulously tracked by the kernel's page allocator, slab caches, and page reclaim daemon — handing out memory, reclaiming it, compressing it, and occasionally, when pushed to the limit, killing the process that is consuming the most.

Understanding this machinery is what separates a developer who "uses Linux" from one who can diagnose a production server at 3 AM when it starts killing processes.

NUMA Topology: The Physical Foundation

Modern multi-socket servers do not have uniform memory access. A processor on socket 0 can access RAM attached to socket 0 in approximately 70ns, but accessing RAM attached to socket 1 requires crossing the interconnect — closer to 130ns. This is Non-Uniform Memory Access (NUMA), and Linux models it explicitly.

Nodes

A NUMA node corresponds to a physical memory bank attached to one processor socket. On a 4-socket server each with 32GB RAM, Linux sees 4 NUMA nodes with 32GB each. The kernel always prefers to allocate memory from the node local to the CPU running the process.

numactl --hardware        # show NUMA topology
cat /proc/buddyinfo       # free pages per node per zone per order

Zones

Within each NUMA node, memory is partitioned into zones based on the physical address range and hardware constraints:

ZONE_DMA (< 16MB): Legacy ISA devices can only DMA to the first 16MB of physical address space. This zone is kept reserved for those drivers.
ZONE_DMA32 (< 4GB): 32-bit devices that can DMA to the first 4GB. Relevant on x86-64 systems.
ZONE_NORMAL: The main zone for regular kernel and user allocations — everything above 4GB on 64-bit systems.
ZONE_HIGHMEM: Only exists on 32-bit kernels. Because the 32-bit kernel virtual address space tops out at 1GB (with the classic 3GB/1GB split), physical pages above 896MB could not be permanently mapped. Linux 6.x on x86-64 does not use ZONE_HIGHMEM.
ZONE_MOVABLE: Pages that can be physically migrated — used for memory hot-plug and transparent huge page compaction.

Pages: The Atomic Unit

The kernel allocates physical memory in pages. On x86-64 Linux, the standard page size is 4KB. The kernel also supports:

Huge Pages (2MB): Used via Transparent Huge Pages (THP) or explicitly via hugetlbfs. Reduce TLB pressure dramatically for large working sets.
Gigantic Pages (1GB): Allocated at boot time only (hugepagesz=1G hugepages=32 on the kernel command line). Cannot be freed at runtime.

Each physical page is tracked by a struct page — a 64-byte metadata structure. For 128GB of RAM, that is 32 million page structs, consuming approximately 2GB of kernel memory just for tracking.

The Buddy Allocator: Physical Page Management

The buddy allocator is the Linux kernel's primary physical memory allocator. Its job: hand out contiguous blocks of physical pages and receive them back when freed.

Power-of-2 Orders

The buddy allocator manages pages in orders. Order N represents a contiguous block of 2^N pages:

Order	Pages	Size
0	1	4 KB
1	2	8 KB
2	4	16 KB
3	8	32 KB
4	16	64 KB
5	32	128 KB
6	64	256 KB
7	128	512 KB
8	256	1 MB
9	512	2 MB
10	1024	4 MB

Each zone maintains 11 free lists (order 0 through order 10), each holding blocks of that size.

Buddy Splitting and Merging

Allocating 8KB (order 1):

Check order 1 free list — empty
Check order 2 free list — found a 16KB block
Split: divide the 16KB block into two 8KB "buddies"
Return one 8KB block to the caller; put the other on the order 1 free list

Freeing an 8KB block:

Find the buddy of the freed block (buddy address = freed_addr XOR (1 << (order × PAGE_SHIFT)))
Is the buddy free? Yes → merge them into one 16KB block
Try to merge the 16KB block with its buddy → and so on up the tree

This coalescing is what makes the buddy allocator resistant to external fragmentation. You can never have fragmentation within an order — only between orders.

Inspecting the Buddy Allocator

cat /proc/buddyinfo

Example output:

Node 0, zone      DMA    0   0   0   1   2   1   1   0   1   1   3
Node 0, zone   Normal 4023 812 302  96  40  12   5   2   0   1   6

Each column is a free list count: the third column is order 0 (4KB), and the last is order 10 (4MB). A system with all zeros in high orders has fragmented physical memory — huge page allocations will fail.

The Slab Allocator: Kernel Object Caches

The buddy allocator solves the physical page problem, but the kernel creates millions of small objects: task_struct (~9.5KB), dentry (192 bytes), inode (~600 bytes), socket (~768 bytes). Allocating a full 4KB page for a 192-byte dentry would waste 95% of memory.

The slab allocator solves this with per-object caches.

SLAB / SLUB / SLOB

Linux has had three slab implementations:

SLAB (original): Per-object caches with coloring to spread objects across cache lines. Complex implementation.
SLUB (default since Linux 2.6.23): Simplified design. Better cache utilization on NUMA. Lower overhead per object. The default in all modern Linux distributions including kernel 6.x.
SLOB (Simple List Of Blocks): Minimalist allocator for embedded systems with very small memory (< 32MB). Not used on servers.

How SLUB Works

For each object type, SLUB maintains a cache (struct kmem_cache). Each cache contains one or more slabs. A slab is one or more contiguous pages containing a fixed number of same-sized objects.

Key properties:

Objects are pre-initialized on the free list, avoiding constructor overhead on every allocation
kmem_cache_alloc() returns an initialized object in O(1) — just pops from the free list
kmem_cache_free() returns the object to the free list — no memory is actually freed to the buddy allocator unless the entire slab becomes empty

cat /proc/slabinfo | head -20
# Shows: cache name | active_objs | total_objs | obj_size | ...

# Or with slabtop:
slabtop

Common high-count slab caches on a busy server: dentry, inode_cache, buffer_head, vm_area_struct, task_struct.

Memory Hierarchy Diagram

Page Frame Reclaim Algorithm (PFRA)

When free memory drops below a threshold, the kernel's kswapd daemon wakes up to reclaim pages. This is the kernel's most complex ongoing maintenance task.

Two LRU Lists

Linux maintains two LRU (Least Recently Used) lists per memory zone:

Active list: Pages recently accessed. Protected from immediate reclaim.
Inactive list: Pages not recently accessed. Primary candidates for reclaim.

The kernel uses the accessed bit in each page table entry. On access, the CPU hardware sets this bit. The kernel's clock algorithm periodically clears these bits and moves pages between lists based on whether the bit was set before clearing.

What kswapd Reclaims

Clean file cache pages: Pages cached from disk reads. These can be dropped immediately — if needed again, the kernel re-reads from disk. This is the preferred reclaim path because it avoids I/O.
Dirty file cache pages: Must be written to disk before reclaiming. The writeback thread handles this asynchronously.
Anonymous pages (swap): Heap and stack pages not backed by any file. To reclaim these, the kernel writes them to the swap device (swap partition or swap file). Swapped pages are compressed in memory first by zswap (enabled in most modern distributions) before hitting disk.

The OOM Killer

When reclaim fails completely — kswapd cannot free enough memory and the allocation is urgent — the kernel invokes the Out-of-Memory Killer.

The OOM killer selects a process using a scoring algorithm:

RSS score: Larger resident set = higher score (more to gain from killing it)
Nice value: Processes niced to +19 are preferred victims
Time running: Short-lived processes penalized less
OOM score adjustment: /proc/<pid>/oom_score_adj (range: -1000 to +1000). Set to -1000 to make a process unkillable. Kubernetes sets critical system pods to -997.

cat /proc/$(pidof postgres)/oom_score          # current OOM score
echo -500 > /proc/$(pidof postgres)/oom_score_adj  # protect postgres
dmesg | grep "Out of memory"                   # check if OOM killed anything

Memory Allocator Comparison Table

Allocator	Use Case	Granularity	Speed	Fragmentation	API
Buddy Allocator	Physical pages (kernel and user)	4KB–4MB (order 0–10)	Fast (O(log n) splits/merges)	External fragmentation between orders	`alloc_pages()`, `__get_free_pages()`
SLUB	Kernel objects (structs, dentries)	Object-sized (8B–8KB typical)	Very fast O(1) via free list	Minimal — per-slab alignment waste only	`kmem_cache_alloc()`, `kzalloc()`
vmalloc	Large, virtually-contiguous kernel alloc	Page-granular, not physically contiguous	Slow (TLB flush needed)	None in virtual space	`vmalloc()`, `vfree()`
kmalloc	General kernel allocations	8B–4MB	Fast (backed by SLUB power-of-2 caches)	Small internal fragmentation	`kmalloc()`, `kfree()`
THP	Large user-space pages	2MB transparent	Automatic via khugepaged	Reduced TLB fragmentation	Automatic or `madvise(MADV_HUGEPAGE)`

Key Takeaways

The Linux memory management system is a four-layer hierarchy: NUMA nodes model physical topology, zones partition each node by hardware constraints, the buddy allocator manages physical pages in power-of-2 blocks, and the slab allocator provides fast sub-page allocation for the kernel's millions of small objects.

What makes it remarkable is the reclaim machinery running underneath. kswapd continuously monitors memory pressure, silently writing dirty pages and evicting cold file cache. Only when all else fails does the OOM killer emerge. In a well-tuned system — appropriate swap, reasonable memory limits per container, correct OOM scores for critical processes — the OOM killer should never fire in production. If it does, it is a signal that the system was allocated more work than it can perform, and that is an architectural problem no kernel tuning can fully solve.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

32 minLesson 9 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min