It is 2:14 AM. Your monitoring system fires an alert: a Linux container running a Node.js web application is reporting p99 latency spiking to 8 seconds. CPU utilization is 94%. Memory usage is at 97% of the container limit. I/O wait is 23%. Three instances have already been OOM-killed in the past hour.

You SSH into the host. You have everything you have learned in this course, and you have the Linux kernel's own diagnostic tooling. This capstone project is your systematic diagnosis and resolution — working through scheduling, memory, I/O, security hardening, and kernel design in sequence.

This is not a toy exercise. Every tool, every /proc file, every kernel parameter referenced here is real, verified against Linux 6.x, and applicable to actual production systems.

Part 1: Process Scheduling Analysis

Diagnosing Scheduling Problems

Start with the scheduler. High CPU utilization can mean two very different things: useful work being done efficiently, or processes fighting each other for CPU time.

# Step 1: Check overall context switch rates
vmstat 1 10
# procs: r=runnable, b=blocked
# cs: context switches per second
# us/sy/id/wa: user/system/idle/iowait CPU %

# Example output showing a problem:
# r  b   swpd   free  ...  cs  us sy id wa
# 12 3      0  45000  ... 180000  87  8  2  3
# r=12 (12 processes wanting CPU, only 8 cores) → scheduling contention
# cs=180000 (180K context switches/sec) → extremely high

# Step 2: Identify involuntary vs voluntary context switches
for pid in $(ps -eo pid --no-headers); do
    awk -v p=$pid '/voluntary_ctxt_switches/{vol=$2} /nonvoluntary_ctxt_switches/{nonvol=$2}
    END{if(nonvol>1000) print p, "voluntary:", vol, "nonvoluntary:", nonvol}' \
    /proc/$pid/status 2>/dev/null
done | sort -k4 -rn | head -20

# High nonvoluntary = process is being preempted (CPU-bound, competing)
# High voluntary = process is often waiting (I/O-bound or over-synchronized)

Scheduler Statistics

# Per-CPU scheduler statistics:
cat /proc/schedstat
# Format: cpu<N> <yld_count> <legacy> <legacy> <sched_count> <sched_goidle>
#         <ttwu_count> <ttwu_local> <run_delay_ns> <pcount>
# run_delay_ns = total nanoseconds processes waited to run on this CPU

# Per-process scheduler stats:
cat /proc/<pid>/schedstat
# time_on_cpu_ns  wait_for_cpu_ns  timeslices_run
# If wait_for_cpu_ns >> time_on_cpu_ns: process is starving for CPU

# CFS scheduler per-task stats:
cat /proc/<pid>/sched
# Shows: nr_voluntary_switches, nr_involuntary_switches,
#        se.sum_exec_runtime, se.wait_sum, se.sleep_avg (kernel build dependent)

Solutions: CPU Affinity, Nice Values, and cgroup CPU Bandwidth

Problem identified: 3 Node.js worker processes are CPU-bound and competing with the logging daemon and health check processes for 8 cores.

# Solution 1: Pin Node.js workers to specific CPUs (CPU affinity)
taskset -cp 0-5 <node-pid>     # pin to CPUs 0-5
# or at launch:
taskset -c 0-5 node server.js

# Solution 2: Lower priority of non-critical processes
renice +10 <logging-pid>       # logging is less critical than serving requests
renice +15 <health-check-pid>

# Solution 3: cgroup CPU bandwidth (preferred for containers)
# Allow Node workers to use 600% of CPU (6 full cores out of 8):
echo "600000 100000" > /sys/fs/cgroup/node-app/cpu.max
# Allow logging to use only 50% of one core:
echo "50000 100000" > /sys/fs/cgroup/node-logging/cpu.max

Scheduling Analysis Summary

Part 2: Memory Analysis

Reading /proc/meminfo

cat /proc/meminfo
# MemTotal:       131072000 kB   -- total physical RAM
# MemFree:            45320 kB   -- truly unused
# MemAvailable:    12405400 kB   -- free + reclaimable cache (the number that matters)
# Buffers:           234560 kB   -- block device buffers
# Cached:          89234560 kB   -- page cache
# SwapCached:        123400 kB   -- pages in swap AND still in RAM (recently swapped in)
# Active:          45234560 kB   -- recently used, less likely to reclaim
# Inactive:        34234560 kB   -- not recently used, candidate for reclaim
# SwapTotal:        8388608 kB
# SwapFree:         3456789 kB   -- 59% of swap used -- concerning
# Dirty:             123456 kB   -- dirty pages waiting for writeback
# Writeback:          12340 kB   -- dirty pages currently being written
# Slab:             4234560 kB   -- kernel slab allocations
# SReclaimable:     3234560 kB   -- portion of slab that can be reclaimed
# SUnreclaim:       1000000 kB   -- slab memory that cannot be reclaimed
# CommitLimit:     73924608 kB   -- how much total memory can be committed
# Committed_AS:   124567890 kB   -- currently committed (overcommit in use)

Reading this output: MemAvailable (not MemFree) is the real "how much memory is left" number. Here, only ~12GB available from 128GB — the system is under memory pressure. SwapFree is at 59% — the kernel is actively swapping.

Buddy Allocator Fragmentation

cat /proc/buddyinfo
# Node 0, zone   Normal   892  145   32   12   4   1   0   0   0   0   0
#                         4KB  8KB  16KB 32KB 64KB ...                 4MB
# Many small blocks, zero large blocks = fragmented
# Huge page allocations (2MB = order 9) will fail

# Force memory compaction (reclaims large contiguous blocks):
echo 1 > /proc/sys/vm/compact_memory   # compact all zones (triggers khugepaged)

# Check THP compaction stats:
grep -i huge /proc/vmstat
# thp_fault_alloc: allocations served by THP
# thp_collapse_alloc: pages compacted into huge pages
# thp_split_page: huge pages split back (bad — means fragmentation won)

Detecting Memory Leaks via VMA Growth

# Watch a process's RSS and VMA count over time:
while true; do
    ps -p <pid> -o pid,rss,vsz | tail -1
    cat /proc/<pid>/status | grep VmRSS
    cat /proc/<pid>/smaps_rollup | grep -E "Rss:|Anonymous:"
    sleep 10
done

# If Anonymous memory grows monotonically without corresponding file activity:
# → likely heap leak (malloc without free)

# Detailed VMA breakdown:
cat /proc/<pid>/smaps | awk '/^[0-9a-f]/{vma=$0} /^Rss/{print vma, $2}' | sort -k2 -rn | head -20
# Shows which VMAs are consuming the most RSS

# Check for VMA count explosion (each mmap = 1 VMA):
cat /proc/<pid>/status | grep VmPTE   # page table entries (proxy for VMA count)
ls /proc/<pid>/maps | wc -l           # (use: wc -l < /proc/<pid>/maps)
wc -l /proc/<pid>/maps                # number of VMAs

Finding: The Node.js process has 847 VMAs, growing by ~10 per minute. Each new HTTP request creates an anonymous mmap that is never freed. This is a Node.js native addon leaking mmap() calls. Fix: upgrade the addon or add explicit munmap() in the cleanup path.

Part 3: I/O and Interrupt Analysis

I/O Diagnostics

# Overall I/O statistics (1-second samples, 5 times):
iostat -x 1 5
# Device    r/s   w/s  rMB/s  wMB/s  await  svctm  %util
# nvme0n1  1200   890   18.4   42.1   12.4    0.8    98.3
# %util=98.3 → disk is saturated (for HDDs; NVMe saturation is more nuanced)
# await=12.4ms → average wait time (queue + service) — high for NVMe (should be <1ms)

# Per-process I/O:
iotop -oa     # accumulated I/O, all processes, sorted by I/O
# Shows: process, read/write bytes, I/O%, PRIO

# I/O scheduler for each device:
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq
# "none" is correct for NVMe (let NVMe's own queue management handle it)
# "mq-deadline" is good for mixed workloads on SATA SSDs

# Switch I/O scheduler:
echo "mq-deadline" > /sys/block/sda/queue/scheduler  # for SATA SSD
echo "none" > /sys/block/nvme0n1/queue/scheduler     # for NVMe (no scheduler overhead)

Interrupt Analysis

# See interrupt counts per CPU:
cat /proc/interrupts
#            CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
# 24: 134521  89234   2341    2341   2341   2341   2341   2341  nvme-irq0
# Problem: CPU0 is handling 134K NVMe interrupts vs ~2K on others → CPU0 bottleneck

# IRQ affinity: spread NVMe interrupts across CPUs
cat /proc/irq/24/smp_affinity     # current CPU affinity mask (hex bitmask)
echo ff > /proc/irq/24/smp_affinity  # allow all 8 CPUs (0xFF = all cores)
# Or use irqbalance daemon:
systemctl enable --now irqbalance

# Soft IRQ distribution:
cat /proc/softirqs
# BLOCK: columns per CPU showing block I/O softirq counts
# NET_RX: network receive processing
# TASKLET: driver tasklets
# If one CPU shows 10x others: IRQ affinity imbalance

I/O Analysis Diagram

Part 4: Security Hardening Checklist

Apply everything learned from the security mechanisms lesson to the production container:

Step 1: Drop Unnecessary Capabilities

# Audit current capabilities:
docker inspect my-container | jq '.[0].HostConfig.CapAdd, .[0].HostConfig.CapDrop'

# Minimum capability set for a Node.js web server:
docker run \
  --cap-drop ALL \
  --cap-add NET_BIND_SERVICE \
  # (only if binding to port < 1024)
  my-node-app

# Verify running process capabilities:
cat /proc/$(pidof node)/status | grep Cap
capsh --decode=$(cat /proc/$(pidof node)/status | grep CapEff | awk '{print $2}')

Step 2: Apply seccomp Filter

# Use Docker's default seccomp profile (already blocks 44 risky syscalls):
docker run --security-opt seccomp=/etc/docker/seccomp-default.json my-node-app

# Or generate a minimal profile using strace profiling:
strace -f -e trace=all node server.js 2>&1 | grep "^[a-z]" | awk -F'(' '{print $1}' | sort -u
# → list of all syscalls the app actually uses → whitelist only these

Step 3: Set cgroup Limits

# Resource limits in docker-compose.yml:
# deploy:
#   resources:
#     limits:
#       cpus: '6.0'
#       memory: 4G
#     reservations:
#       cpus: '2.0'
#       memory: 2G

# Or in docker run:
docker run \
  --cpus="6.0" \
  --memory="4g" \
  --memory-swap="4g" \        # disable swap for this container
  --pids-limit=500 \           # prevent fork bomb
  my-node-app

# Verify cgroup limits applied:
cat /sys/fs/cgroup/system.slice/docker-<id>.scope/memory.max
cat /sys/fs/cgroup/system.slice/docker-<id>.scope/cpu.max

Step 4: Verify SELinux/AppArmor

# Check SELinux status:
getenforce                              # Enforcing / Permissive / Disabled
docker inspect my-container | jq '.[0].HostConfig.SecurityOpt'
# Should include "label=type:container_t" (SELinux) or "apparmor=docker-default"

# Docker's default AppArmor profile denies:
# - Writing to /proc/sysrq-trigger
# - mount/umount
# - Loading kernel modules
# - Direct hardware device access

# Check AppArmor status:
aa-status | grep docker

Part 5: Conceptual Kernel Module Design

This is not compilable kernel code — it is a precise pseudocode design showing the structure of a kernel module that implements a character device for monitoring process memory statistics.

MODULE: proc_memstat_device
DESCRIPTION: Character device at /dev/memstat that returns memory statistics
             for a specified PID on read()
CONCURRENCY: Spinlock protects shared per-device state

=== MODULE DATA STRUCTURES ===

struct memstat_device {
    spinlock_t    lock;           // protects last_queried_pid
    pid_t         last_queried_pid;
    struct cdev   char_dev;       // character device struct
    dev_t         dev_number;     // major:minor device number
};

static struct memstat_device g_dev;  // single global device instance

=== MODULE INIT FUNCTION ===

int memstat_init(void):
    // 1. Allocate major/minor device number
    alloc_chrdev_region(&g_dev.dev_number, 0, 1, "memstat")

    // 2. Initialize spinlock
    spin_lock_init(&g_dev.lock)
    g_dev.last_queried_pid = 0

    // 3. Initialize and register character device
    cdev_init(&g_dev.char_dev, &memstat_fops)
    cdev_add(&g_dev.char_dev, g_dev.dev_number, 1)

    // 4. Create /dev/memstat via sysfs
    device_create(memstat_class, NULL, g_dev.dev_number, NULL, "memstat")

    // 5. Register interrupt handler for demonstration
    //    (hypothetical hardware event IRQ 45)
    request_irq(45, memstat_irq_handler, IRQF_SHARED, "memstat", &g_dev)

    return 0  // success

=== FILE OPERATIONS ===

struct file_operations memstat_fops = {
    .owner   = THIS_MODULE,
    .open    = memstat_open,
    .read    = memstat_read,
    .write   = memstat_write,
    .release = memstat_release,
};

int memstat_open(struct inode *inode, struct file *filp):
    // Store device reference in file's private data
    filp->private_data = container_of(inode->i_cdev, struct memstat_device, char_dev)
    return 0

// write(): user writes a PID as ASCII string → store it
ssize_t memstat_write(struct file *filp, const char __user *buf, size_t count, loff_t *ppos):
    char kbuf[16]
    pid_t pid

    if count > 15: return -EINVAL

    // Copy from user space (NEVER dereference user pointers directly in kernel)
    if copy_from_user(kbuf, buf, count): return -EFAULT
    kbuf[count] = '\0'

    pid = simple_strtol(kbuf, NULL, 10)

    // Protect shared state with spinlock
    spin_lock(&g_dev.lock)
    g_dev.last_queried_pid = pid
    spin_unlock(&g_dev.lock)

    return count

// read(): return memory stats for stored PID
ssize_t memstat_read(struct file *filp, char __user *buf, size_t count, loff_t *ppos):
    struct task_struct *task
    struct mm_struct *mm
    char output[256]
    size_t len
    pid_t pid

    // Read shared state with spinlock
    spin_lock(&g_dev.lock)
    pid = g_dev.last_queried_pid
    spin_unlock(&g_dev.lock)

    if pid == 0: return -EINVAL

    // Look up task_struct by PID (uses RCU internally)
    rcu_read_lock()
    task = find_task_by_vpid(pid)  // searches PID namespace
    if task == NULL:
        rcu_read_unlock()
        return -ESRCH  // no such process

    // Get mm_struct (process memory descriptor)
    mm = get_task_mm(task)  // increments mm reference count
    rcu_read_unlock()

    if mm == NULL:
        return -EINVAL  // kernel thread, no mm

    // Read memory stats (mmap_lock protects mm fields)
    mmap_read_lock(mm)
    len = snprintf(output, sizeof(output),
        "pid=%d rss_kb=%lu vss_kb=%lu map_count=%d\n",
        pid,
        get_mm_rss(mm) * PAGE_SIZE / 1024,  // resident set size in KB
        mm->total_vm * PAGE_SIZE / 1024,     // virtual set size in KB
        mm->map_count)                        // number of VMAs
    mmap_read_unlock(mm)

    mmput(mm)  // decrement reference count

    if *ppos >= len: return 0  // EOF
    if copy_to_user(buf, output, len): return -EFAULT
    *ppos += len
    return len

=== INTERRUPT HANDLER ===

irqreturn_t memstat_irq_handler(int irq, void *dev_id):
    struct memstat_device *dev = dev_id

    // Interrupt handlers run with local interrupts disabled
    // MUST use spin_lock (not spin_lock_irqsave) since IRQs already disabled
    spin_lock(&dev->lock)
    // In a real module: handle hardware event, update statistics
    // e.g., increment per-CPU counter, signal waitqueue
    spin_unlock(&dev->lock)

    return IRQ_HANDLED

=== MODULE CLEANUP ===

void memstat_exit(void):
    // Reverse order of init
    free_irq(45, &g_dev)                            // unregister IRQ handler
    device_destroy(memstat_class, g_dev.dev_number) // remove /dev/memstat
    cdev_del(&g_dev.char_dev)                       // unregister char device
    unregister_chrdev_region(g_dev.dev_number, 1)   // release major/minor
    // Spinlock needs no explicit cleanup (stack/static allocated)

Key Module Design Observations

Concern	Design Decision	Why
User pointer access	`copy_from_user()` / `copy_to_user()`	User pointers may fault; kernel must handle gracefully
Shared state protection	`spinlock_t`	IRQ handler cannot sleep, so mutex forbidden
RCU for task lookup	`rcu_read_lock()` around `find_task_by_vpid()`	`task_struct` list protected by RCU
mm reference counting	`get_task_mm()` + `mmput()`	Process can exit while we hold mm pointer
mmap_lock for mm fields	`mmap_read_lock(mm)`	Protects mm->map_count, mm->total_vm from concurrent modification
Module cleanup order	Reverse of init	Prevents use-after-free during unload

What You Learned in This Course

This capstone brought together every major topic from the OS Internals series. Here is the full map of what you now understand:

Topic	Core Concept	Production Application
Kernel Architecture	Monolithic vs microkernel trade-offs	Why Linux won: deployment > theoretical purity
Process Scheduling	CFS, runqueue, voluntary vs involuntary switches	CPU affinity, nice values, cgroup bandwidth
Interrupts & IRQs	Hardware → IRQ → top half → bottom half	IRQ affinity for NVMe, softirq balancing
System Calls	User/kernel boundary, syscall table, vsyscall	strace, seccomp filter design
Process Management	task_struct, PCB, fork/exec/wait	PID namespaces, process trees
Linux Memory Management	Buddy, slab, NUMA nodes, zones	/proc/buddyinfo, kswapd tuning, OOM scores
Demand Paging	Page fault: minor vs major, mmap, page cache	vmstat faults, THP, mlock for real-time
Virtual Memory Areas	VMA tree, process address space layout, ASLR	/proc/pid/maps, smaps memory leak analysis
VFS	superblock, inode, dentry, file, dcache	mount namespaces, bind mounts, dcache tuning
ext4 Internals	Block groups, extents, journaling modes	debugfs, e4defrag, journal mode selection
Kernel Synchronization & RCU	Spinlock, mutex, seqlock, RCU grace period	lockdep, perf lock, RCU usage patterns
OS Security Mechanisms	Capabilities, namespaces, cgroups, seccomp, SELinux	Container hardening, capability audit

Key Takeaways

The production incident scenario in this capstone was not contrived. Every symptom — CPU contention, memory pressure with active swapping, NVMe IRQ imbalance, insecure container configuration — appears in real production postmortems. What changed after working through this course is the vocabulary and the tooling to diagnose each layer independently.

The kernel is not a black box. Every /proc file, every sysctl, every debugfs command is a window into live kernel data structures. /proc/buddyinfo exposes the buddy allocator's free lists. /proc/<pid>/smaps exposes the VMA tree. /proc/interrupts exposes the IRQ dispatch table. The kernel documents itself in real time, and systems engineers who know how to read that documentation have a decisive advantage when diagnosing performance and reliability problems.

The pseudocode kernel module in Part 5 is the synthesizing exercise: it required you to know task_struct (process management), mm_struct (memory management), spinlock and mmap_lock (synchronization), copy_from_user (the user/kernel boundary), and cdev/file_operations (the VFS device interface) — all in 50 lines of pseudocode. That is the shape of kernel programming: a small amount of code that touches every subsystem simultaneously, where a single mistake in any of them causes a kernel panic.

Systems programming at this level is difficult, consequential, and deeply satisfying. The kernel is where all abstractions end and hardware begins — and you now have the foundation to work in that space.

Previous 🎉 View Course Summary

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

45 minLesson 16 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min