AiTechWorlds
AiTechWorlds
It is 2:14 AM. Your monitoring system fires an alert: a Linux container running a Node.js web application is reporting p99 latency spiking to 8 seconds. CPU utilization is 94%. Memory usage is at 97% of the container limit. I/O wait is 23%. Three instances have already been OOM-killed in the past hour.
You SSH into the host. You have everything you have learned in this course, and you have the Linux kernel's own diagnostic tooling. This capstone project is your systematic diagnosis and resolution — working through scheduling, memory, I/O, security hardening, and kernel design in sequence.
This is not a toy exercise. Every tool, every /proc file, every kernel parameter referenced here is real, verified against Linux 6.x, and applicable to actual production systems.
Start with the scheduler. High CPU utilization can mean two very different things: useful work being done efficiently, or processes fighting each other for CPU time.
# Step 1: Check overall context switch rates
vmstat 1 10
# procs: r=runnable, b=blocked
# cs: context switches per second
# us/sy/id/wa: user/system/idle/iowait CPU %
# Example output showing a problem:
# r b swpd free ... cs us sy id wa
# 12 3 0 45000 ... 180000 87 8 2 3
# r=12 (12 processes wanting CPU, only 8 cores) → scheduling contention
# cs=180000 (180K context switches/sec) → extremely high
# Step 2: Identify involuntary vs voluntary context switches
for pid in $(ps -eo pid --no-headers); do
awk -v p=$pid '/voluntary_ctxt_switches/{vol=$2} /nonvoluntary_ctxt_switches/{nonvol=$2}
END{if(nonvol>1000) print p, "voluntary:", vol, "nonvoluntary:", nonvol}' \
/proc/$pid/status 2>/dev/null
done | sort -k4 -rn | head -20
# High nonvoluntary = process is being preempted (CPU-bound, competing)
# High voluntary = process is often waiting (I/O-bound or over-synchronized)
# Per-CPU scheduler statistics:
cat /proc/schedstat
# Format: cpu<N> <yld_count> <legacy> <legacy> <sched_count> <sched_goidle>
# <ttwu_count> <ttwu_local> <run_delay_ns> <pcount>
# run_delay_ns = total nanoseconds processes waited to run on this CPU
# Per-process scheduler stats:
cat /proc/<pid>/schedstat
# time_on_cpu_ns wait_for_cpu_ns timeslices_run
# If wait_for_cpu_ns >> time_on_cpu_ns: process is starving for CPU
# CFS scheduler per-task stats:
cat /proc/<pid>/sched
# Shows: nr_voluntary_switches, nr_involuntary_switches,
# se.sum_exec_runtime, se.wait_sum, se.sleep_avg (kernel build dependent)
Problem identified: 3 Node.js worker processes are CPU-bound and competing with the logging daemon and health check processes for 8 cores.
# Solution 1: Pin Node.js workers to specific CPUs (CPU affinity)
taskset -cp 0-5 <node-pid> # pin to CPUs 0-5
# or at launch:
taskset -c 0-5 node server.js
# Solution 2: Lower priority of non-critical processes
renice +10 <logging-pid> # logging is less critical than serving requests
renice +15 <health-check-pid>
# Solution 3: cgroup CPU bandwidth (preferred for containers)
# Allow Node workers to use 600% of CPU (6 full cores out of 8):
echo "600000 100000" > /sys/fs/cgroup/node-app/cpu.max
# Allow logging to use only 50% of one core:
echo "50000 100000" > /sys/fs/cgroup/node-logging/cpu.max
cat /proc/meminfo
# MemTotal: 131072000 kB -- total physical RAM
# MemFree: 45320 kB -- truly unused
# MemAvailable: 12405400 kB -- free + reclaimable cache (the number that matters)
# Buffers: 234560 kB -- block device buffers
# Cached: 89234560 kB -- page cache
# SwapCached: 123400 kB -- pages in swap AND still in RAM (recently swapped in)
# Active: 45234560 kB -- recently used, less likely to reclaim
# Inactive: 34234560 kB -- not recently used, candidate for reclaim
# SwapTotal: 8388608 kB
# SwapFree: 3456789 kB -- 59% of swap used -- concerning
# Dirty: 123456 kB -- dirty pages waiting for writeback
# Writeback: 12340 kB -- dirty pages currently being written
# Slab: 4234560 kB -- kernel slab allocations
# SReclaimable: 3234560 kB -- portion of slab that can be reclaimed
# SUnreclaim: 1000000 kB -- slab memory that cannot be reclaimed
# CommitLimit: 73924608 kB -- how much total memory can be committed
# Committed_AS: 124567890 kB -- currently committed (overcommit in use)
Reading this output: MemAvailable (not MemFree) is the real "how much memory is left" number. Here, only ~12GB available from 128GB — the system is under memory pressure. SwapFree is at 59% — the kernel is actively swapping.
cat /proc/buddyinfo
# Node 0, zone Normal 892 145 32 12 4 1 0 0 0 0 0
# 4KB 8KB 16KB 32KB 64KB ... 4MB
# Many small blocks, zero large blocks = fragmented
# Huge page allocations (2MB = order 9) will fail
# Force memory compaction (reclaims large contiguous blocks):
echo 1 > /proc/sys/vm/compact_memory # compact all zones (triggers khugepaged)
# Check THP compaction stats:
grep -i huge /proc/vmstat
# thp_fault_alloc: allocations served by THP
# thp_collapse_alloc: pages compacted into huge pages
# thp_split_page: huge pages split back (bad — means fragmentation won)
# Watch a process's RSS and VMA count over time:
while true; do
ps -p <pid> -o pid,rss,vsz | tail -1
cat /proc/<pid>/status | grep VmRSS
cat /proc/<pid>/smaps_rollup | grep -E "Rss:|Anonymous:"
sleep 10
done
# If Anonymous memory grows monotonically without corresponding file activity:
# → likely heap leak (malloc without free)
# Detailed VMA breakdown:
cat /proc/<pid>/smaps | awk '/^[0-9a-f]/{vma=$0} /^Rss/{print vma, $2}' | sort -k2 -rn | head -20
# Shows which VMAs are consuming the most RSS
# Check for VMA count explosion (each mmap = 1 VMA):
cat /proc/<pid>/status | grep VmPTE # page table entries (proxy for VMA count)
ls /proc/<pid>/maps | wc -l # (use: wc -l < /proc/<pid>/maps)
wc -l /proc/<pid>/maps # number of VMAs
Finding: The Node.js process has 847 VMAs, growing by ~10 per minute. Each new HTTP request creates an anonymous mmap that is never freed. This is a Node.js native addon leaking mmap() calls. Fix: upgrade the addon or add explicit munmap() in the cleanup path.
# Overall I/O statistics (1-second samples, 5 times):
iostat -x 1 5
# Device r/s w/s rMB/s wMB/s await svctm %util
# nvme0n1 1200 890 18.4 42.1 12.4 0.8 98.3
# %util=98.3 → disk is saturated (for HDDs; NVMe saturation is more nuanced)
# await=12.4ms → average wait time (queue + service) — high for NVMe (should be <1ms)
# Per-process I/O:
iotop -oa # accumulated I/O, all processes, sorted by I/O
# Shows: process, read/write bytes, I/O%, PRIO
# I/O scheduler for each device:
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq
# "none" is correct for NVMe (let NVMe's own queue management handle it)
# "mq-deadline" is good for mixed workloads on SATA SSDs
# Switch I/O scheduler:
echo "mq-deadline" > /sys/block/sda/queue/scheduler # for SATA SSD
echo "none" > /sys/block/nvme0n1/queue/scheduler # for NVMe (no scheduler overhead)
# See interrupt counts per CPU:
cat /proc/interrupts
# CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
# 24: 134521 89234 2341 2341 2341 2341 2341 2341 nvme-irq0
# Problem: CPU0 is handling 134K NVMe interrupts vs ~2K on others → CPU0 bottleneck
# IRQ affinity: spread NVMe interrupts across CPUs
cat /proc/irq/24/smp_affinity # current CPU affinity mask (hex bitmask)
echo ff > /proc/irq/24/smp_affinity # allow all 8 CPUs (0xFF = all cores)
# Or use irqbalance daemon:
systemctl enable --now irqbalance
# Soft IRQ distribution:
cat /proc/softirqs
# BLOCK: columns per CPU showing block I/O softirq counts
# NET_RX: network receive processing
# TASKLET: driver tasklets
# If one CPU shows 10x others: IRQ affinity imbalance
Apply everything learned from the security mechanisms lesson to the production container:
# Audit current capabilities:
docker inspect my-container | jq '.[0].HostConfig.CapAdd, .[0].HostConfig.CapDrop'
# Minimum capability set for a Node.js web server:
docker run \
--cap-drop ALL \
--cap-add NET_BIND_SERVICE \
# (only if binding to port < 1024)
my-node-app
# Verify running process capabilities:
cat /proc/$(pidof node)/status | grep Cap
capsh --decode=$(cat /proc/$(pidof node)/status | grep CapEff | awk '{print $2}')
# Use Docker's default seccomp profile (already blocks 44 risky syscalls):
docker run --security-opt seccomp=/etc/docker/seccomp-default.json my-node-app
# Or generate a minimal profile using strace profiling:
strace -f -e trace=all node server.js 2>&1 | grep "^[a-z]" | awk -F'(' '{print $1}' | sort -u
# → list of all syscalls the app actually uses → whitelist only these
# Resource limits in docker-compose.yml:
# deploy:
# resources:
# limits:
# cpus: '6.0'
# memory: 4G
# reservations:
# cpus: '2.0'
# memory: 2G
# Or in docker run:
docker run \
--cpus="6.0" \
--memory="4g" \
--memory-swap="4g" \ # disable swap for this container
--pids-limit=500 \ # prevent fork bomb
my-node-app
# Verify cgroup limits applied:
cat /sys/fs/cgroup/system.slice/docker-<id>.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-<id>.scope/cpu.max
# Check SELinux status:
getenforce # Enforcing / Permissive / Disabled
docker inspect my-container | jq '.[0].HostConfig.SecurityOpt'
# Should include "label=type:container_t" (SELinux) or "apparmor=docker-default"
# Docker's default AppArmor profile denies:
# - Writing to /proc/sysrq-trigger
# - mount/umount
# - Loading kernel modules
# - Direct hardware device access
# Check AppArmor status:
aa-status | grep docker
This is not compilable kernel code — it is a precise pseudocode design showing the structure of a kernel module that implements a character device for monitoring process memory statistics.
MODULE: proc_memstat_device
DESCRIPTION: Character device at /dev/memstat that returns memory statistics
for a specified PID on read()
CONCURRENCY: Spinlock protects shared per-device state
=== MODULE DATA STRUCTURES ===
struct memstat_device {
spinlock_t lock; // protects last_queried_pid
pid_t last_queried_pid;
struct cdev char_dev; // character device struct
dev_t dev_number; // major:minor device number
};
static struct memstat_device g_dev; // single global device instance
=== MODULE INIT FUNCTION ===
int memstat_init(void):
// 1. Allocate major/minor device number
alloc_chrdev_region(&g_dev.dev_number, 0, 1, "memstat")
// 2. Initialize spinlock
spin_lock_init(&g_dev.lock)
g_dev.last_queried_pid = 0
// 3. Initialize and register character device
cdev_init(&g_dev.char_dev, &memstat_fops)
cdev_add(&g_dev.char_dev, g_dev.dev_number, 1)
// 4. Create /dev/memstat via sysfs
device_create(memstat_class, NULL, g_dev.dev_number, NULL, "memstat")
// 5. Register interrupt handler for demonstration
// (hypothetical hardware event IRQ 45)
request_irq(45, memstat_irq_handler, IRQF_SHARED, "memstat", &g_dev)
return 0 // success
=== FILE OPERATIONS ===
struct file_operations memstat_fops = {
.owner = THIS_MODULE,
.open = memstat_open,
.read = memstat_read,
.write = memstat_write,
.release = memstat_release,
};
int memstat_open(struct inode *inode, struct file *filp):
// Store device reference in file's private data
filp->private_data = container_of(inode->i_cdev, struct memstat_device, char_dev)
return 0
// write(): user writes a PID as ASCII string → store it
ssize_t memstat_write(struct file *filp, const char __user *buf, size_t count, loff_t *ppos):
char kbuf[16]
pid_t pid
if count > 15: return -EINVAL
// Copy from user space (NEVER dereference user pointers directly in kernel)
if copy_from_user(kbuf, buf, count): return -EFAULT
kbuf[count] = '\0'
pid = simple_strtol(kbuf, NULL, 10)
// Protect shared state with spinlock
spin_lock(&g_dev.lock)
g_dev.last_queried_pid = pid
spin_unlock(&g_dev.lock)
return count
// read(): return memory stats for stored PID
ssize_t memstat_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos):
struct task_struct *task
struct mm_struct *mm
char output[256]
size_t len
pid_t pid
// Read shared state with spinlock
spin_lock(&g_dev.lock)
pid = g_dev.last_queried_pid
spin_unlock(&g_dev.lock)
if pid == 0: return -EINVAL
// Look up task_struct by PID (uses RCU internally)
rcu_read_lock()
task = find_task_by_vpid(pid) // searches PID namespace
if task == NULL:
rcu_read_unlock()
return -ESRCH // no such process
// Get mm_struct (process memory descriptor)
mm = get_task_mm(task) // increments mm reference count
rcu_read_unlock()
if mm == NULL:
return -EINVAL // kernel thread, no mm
// Read memory stats (mmap_lock protects mm fields)
mmap_read_lock(mm)
len = snprintf(output, sizeof(output),
"pid=%d rss_kb=%lu vss_kb=%lu map_count=%d\n",
pid,
get_mm_rss(mm) * PAGE_SIZE / 1024, // resident set size in KB
mm->total_vm * PAGE_SIZE / 1024, // virtual set size in KB
mm->map_count) // number of VMAs
mmap_read_unlock(mm)
mmput(mm) // decrement reference count
if *ppos >= len: return 0 // EOF
if copy_to_user(buf, output, len): return -EFAULT
*ppos += len
return len
=== INTERRUPT HANDLER ===
irqreturn_t memstat_irq_handler(int irq, void *dev_id):
struct memstat_device *dev = dev_id
// Interrupt handlers run with local interrupts disabled
// MUST use spin_lock (not spin_lock_irqsave) since IRQs already disabled
spin_lock(&dev->lock)
// In a real module: handle hardware event, update statistics
// e.g., increment per-CPU counter, signal waitqueue
spin_unlock(&dev->lock)
return IRQ_HANDLED
=== MODULE CLEANUP ===
void memstat_exit(void):
// Reverse order of init
free_irq(45, &g_dev) // unregister IRQ handler
device_destroy(memstat_class, g_dev.dev_number) // remove /dev/memstat
cdev_del(&g_dev.char_dev) // unregister char device
unregister_chrdev_region(g_dev.dev_number, 1) // release major/minor
// Spinlock needs no explicit cleanup (stack/static allocated)
| Concern | Design Decision | Why |
|---|---|---|
| User pointer access | copy_from_user() / copy_to_user() | User pointers may fault; kernel must handle gracefully |
| Shared state protection | spinlock_t | IRQ handler cannot sleep, so mutex forbidden |
| RCU for task lookup | rcu_read_lock() around find_task_by_vpid() | task_struct list protected by RCU |
| mm reference counting | get_task_mm() + mmput() | Process can exit while we hold mm pointer |
| mmap_lock for mm fields | mmap_read_lock(mm) | Protects mm->map_count, mm->total_vm from concurrent modification |
| Module cleanup order | Reverse of init | Prevents use-after-free during unload |
This capstone brought together every major topic from the OS Internals series. Here is the full map of what you now understand:
| Topic | Core Concept | Production Application |
|---|---|---|
| Kernel Architecture | Monolithic vs microkernel trade-offs | Why Linux won: deployment > theoretical purity |
| Process Scheduling | CFS, runqueue, voluntary vs involuntary switches | CPU affinity, nice values, cgroup bandwidth |
| Interrupts & IRQs | Hardware → IRQ → top half → bottom half | IRQ affinity for NVMe, softirq balancing |
| System Calls | User/kernel boundary, syscall table, vsyscall | strace, seccomp filter design |
| Process Management | task_struct, PCB, fork/exec/wait | PID namespaces, process trees |
| Linux Memory Management | Buddy, slab, NUMA nodes, zones | /proc/buddyinfo, kswapd tuning, OOM scores |
| Demand Paging | Page fault: minor vs major, mmap, page cache | vmstat faults, THP, mlock for real-time |
| Virtual Memory Areas | VMA tree, process address space layout, ASLR | /proc/pid/maps, smaps memory leak analysis |
| VFS | superblock, inode, dentry, file, dcache | mount namespaces, bind mounts, dcache tuning |
| ext4 Internals | Block groups, extents, journaling modes | debugfs, e4defrag, journal mode selection |
| Kernel Synchronization & RCU | Spinlock, mutex, seqlock, RCU grace period | lockdep, perf lock, RCU usage patterns |
| OS Security Mechanisms | Capabilities, namespaces, cgroups, seccomp, SELinux | Container hardening, capability audit |
The production incident scenario in this capstone was not contrived. Every symptom — CPU contention, memory pressure with active swapping, NVMe IRQ imbalance, insecure container configuration — appears in real production postmortems. What changed after working through this course is the vocabulary and the tooling to diagnose each layer independently.
The kernel is not a black box. Every /proc file, every sysctl, every debugfs command is a window into live kernel data structures. /proc/buddyinfo exposes the buddy allocator's free lists. /proc/<pid>/smaps exposes the VMA tree. /proc/interrupts exposes the IRQ dispatch table. The kernel documents itself in real time, and systems engineers who know how to read that documentation have a decisive advantage when diagnosing performance and reliability problems.
The pseudocode kernel module in Part 5 is the synthesizing exercise: it required you to know task_struct (process management), mm_struct (memory management), spinlock and mmap_lock (synchronization), copy_from_user (the user/kernel boundary), and cdev/file_operations (the VFS device interface) — all in 50 lines of pseudocode. That is the shape of kernel programming: a small amount of code that touches every subsystem simultaneously, where a single mistake in any of them causes a kernel panic.
Systems programming at this level is difficult, consequential, and deeply satisfying. The kernel is where all abstractions end and hardware begins — and you now have the foundation to work in that space.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises