AiTechWorlds
AiTechWorlds
A Linux server with 128GB RAM might run 1000 processes. Each process thinks it has exclusive access to gigabytes of address space. The kernel's memory management subsystem performs this illusion every nanosecond — with astonishing elegance.
The PostgreSQL database process believes it owns 32GB. The Nginx web server believes it owns 4GB. The 50 worker threads each believe they own independent stacks. None of these beliefs correspond to reality. What actually exists is a single pool of physical memory, meticulously tracked by the kernel's page allocator, slab caches, and page reclaim daemon — handing out memory, reclaiming it, compressing it, and occasionally, when pushed to the limit, killing the process that is consuming the most.
Understanding this machinery is what separates a developer who "uses Linux" from one who can diagnose a production server at 3 AM when it starts killing processes.
Modern multi-socket servers do not have uniform memory access. A processor on socket 0 can access RAM attached to socket 0 in approximately 70ns, but accessing RAM attached to socket 1 requires crossing the interconnect — closer to 130ns. This is Non-Uniform Memory Access (NUMA), and Linux models it explicitly.
A NUMA node corresponds to a physical memory bank attached to one processor socket. On a 4-socket server each with 32GB RAM, Linux sees 4 NUMA nodes with 32GB each. The kernel always prefers to allocate memory from the node local to the CPU running the process.
numactl --hardware # show NUMA topology
cat /proc/buddyinfo # free pages per node per zone per order
Within each NUMA node, memory is partitioned into zones based on the physical address range and hardware constraints:
The kernel allocates physical memory in pages. On x86-64 Linux, the standard page size is 4KB. The kernel also supports:
hugetlbfs. Reduce TLB pressure dramatically for large working sets.hugepagesz=1G hugepages=32 on the kernel command line). Cannot be freed at runtime.Each physical page is tracked by a struct page — a 64-byte metadata structure. For 128GB of RAM, that is 32 million page structs, consuming approximately 2GB of kernel memory just for tracking.
The buddy allocator is the Linux kernel's primary physical memory allocator. Its job: hand out contiguous blocks of physical pages and receive them back when freed.
The buddy allocator manages pages in orders. Order N represents a contiguous block of 2^N pages:
| Order | Pages | Size |
|---|---|---|
| 0 | 1 | 4 KB |
| 1 | 2 | 8 KB |
| 2 | 4 | 16 KB |
| 3 | 8 | 32 KB |
| 4 | 16 | 64 KB |
| 5 | 32 | 128 KB |
| 6 | 64 | 256 KB |
| 7 | 128 | 512 KB |
| 8 | 256 | 1 MB |
| 9 | 512 | 2 MB |
| 10 | 1024 | 4 MB |
Each zone maintains 11 free lists (order 0 through order 10), each holding blocks of that size.
Allocating 8KB (order 1):
Freeing an 8KB block:
This coalescing is what makes the buddy allocator resistant to external fragmentation. You can never have fragmentation within an order — only between orders.
cat /proc/buddyinfo
Example output:
Node 0, zone DMA 0 0 0 1 2 1 1 0 1 1 3
Node 0, zone Normal 4023 812 302 96 40 12 5 2 0 1 6
Each column is a free list count: the third column is order 0 (4KB), and the last is order 10 (4MB). A system with all zeros in high orders has fragmented physical memory — huge page allocations will fail.
The buddy allocator solves the physical page problem, but the kernel creates millions of small objects: task_struct (~9.5KB), dentry (192 bytes), inode (~600 bytes), socket (~768 bytes). Allocating a full 4KB page for a 192-byte dentry would waste 95% of memory.
The slab allocator solves this with per-object caches.
Linux has had three slab implementations:
For each object type, SLUB maintains a cache (struct kmem_cache). Each cache contains one or more slabs. A slab is one or more contiguous pages containing a fixed number of same-sized objects.
Key properties:
kmem_cache_alloc() returns an initialized object in O(1) — just pops from the free listkmem_cache_free() returns the object to the free list — no memory is actually freed to the buddy allocator unless the entire slab becomes emptycat /proc/slabinfo | head -20
# Shows: cache name | active_objs | total_objs | obj_size | ...
# Or with slabtop:
slabtop
Common high-count slab caches on a busy server: dentry, inode_cache, buffer_head, vm_area_struct, task_struct.
When free memory drops below a threshold, the kernel's kswapd daemon wakes up to reclaim pages. This is the kernel's most complex ongoing maintenance task.
Linux maintains two LRU (Least Recently Used) lists per memory zone:
The kernel uses the accessed bit in each page table entry. On access, the CPU hardware sets this bit. The kernel's clock algorithm periodically clears these bits and moves pages between lists based on whether the bit was set before clearing.
Clean file cache pages: Pages cached from disk reads. These can be dropped immediately — if needed again, the kernel re-reads from disk. This is the preferred reclaim path because it avoids I/O.
Dirty file cache pages: Must be written to disk before reclaiming. The writeback thread handles this asynchronously.
Anonymous pages (swap): Heap and stack pages not backed by any file. To reclaim these, the kernel writes them to the swap device (swap partition or swap file). Swapped pages are compressed in memory first by zswap (enabled in most modern distributions) before hitting disk.
When reclaim fails completely — kswapd cannot free enough memory and the allocation is urgent — the kernel invokes the Out-of-Memory Killer.
The OOM killer selects a process using a scoring algorithm:
/proc/<pid>/oom_score_adj (range: -1000 to +1000). Set to -1000 to make a process unkillable. Kubernetes sets critical system pods to -997.cat /proc/$(pidof postgres)/oom_score # current OOM score
echo -500 > /proc/$(pidof postgres)/oom_score_adj # protect postgres
dmesg | grep "Out of memory" # check if OOM killed anything
| Allocator | Use Case | Granularity | Speed | Fragmentation | API |
|---|---|---|---|---|---|
| Buddy Allocator | Physical pages (kernel and user) | 4KB–4MB (order 0–10) | Fast (O(log n) splits/merges) | External fragmentation between orders | alloc_pages(), __get_free_pages() |
| SLUB | Kernel objects (structs, dentries) | Object-sized (8B–8KB typical) | Very fast O(1) via free list | Minimal — per-slab alignment waste only | kmem_cache_alloc(), kzalloc() |
| vmalloc | Large, virtually-contiguous kernel alloc | Page-granular, not physically contiguous | Slow (TLB flush needed) | None in virtual space | vmalloc(), vfree() |
| kmalloc | General kernel allocations | 8B–4MB | Fast (backed by SLUB power-of-2 caches) | Small internal fragmentation | kmalloc(), kfree() |
| THP | Large user-space pages | 2MB transparent | Automatic via khugepaged | Reduced TLB fragmentation | Automatic or madvise(MADV_HUGEPAGE) |
The Linux memory management system is a four-layer hierarchy: NUMA nodes model physical topology, zones partition each node by hardware constraints, the buddy allocator manages physical pages in power-of-2 blocks, and the slab allocator provides fast sub-page allocation for the kernel's millions of small objects.
What makes it remarkable is the reclaim machinery running underneath. kswapd continuously monitors memory pressure, silently writing dirty pages and evicting cold file cache. Only when all else fails does the OOM killer emerge. In a well-tuned system — appropriate swap, reasonable memory limits per container, correct OOM scores for critical processes — the OOM killer should never fire in production. If it does, it is a signal that the system was allocated more work than it can perform, and that is an architectural problem no kernel tuning can fully solve.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises