In 2013, Edward Snowden revealed that NSA programs exploited OS-level vulnerabilities. The Linux response accelerated: capabilities, namespaces, seccomp, SELinux, and container security all got dramatically stronger. Understanding OS security mechanisms means understanding why modern systems are actually secure — and where they still are not.

The challenge was not new. The Unix model from 1969 was binary: root (UID 0) can do anything, everyone else can do almost nothing security-sensitive. For 30 years, every daemon that needed any privilege — sending network packets, binding to port 80, reading other users' files — had to run as root. Compromise one daemon and you owned the entire machine. The explosion of networked services in the 1990s made this catastrophic. Apache running as root meant a buffer overflow in an HTTP parser gave an attacker a root shell.

The Linux security model since the 2010s is layered defense-in-depth. Capabilities split the root privilege into fine-grained tokens. Namespaces create virtual isolated machines within the kernel. cgroups limit resource consumption. seccomp whitelists the allowed system calls. SELinux enforces mandatory access control that processes cannot override. Docker, Kubernetes, and every container runtime are built entirely on these five mechanisms. Understanding them is understanding the foundation of modern cloud infrastructure.

Traditional Unix Security: The All-or-Nothing Problem

Classic Unix security is three components:

UID/GID: Every process has a user ID and group ID. Every file has an owner and group.
Permission bits: rwxr-xr-x — owner, group, and others each get read/write/execute bits.
Superuser (UID 0): Bypasses all permission checks. Unlimited access to everything.

This works perfectly for a timesharing system where you trust the administrator. It fails badly for a networked server:

Apache needs to bind to port 80 (requires CAP_NET_BIND_SERVICE) — so Apache runs as root
ping needs raw sockets (requires CAP_NET_RAW) — so ping has the setuid-root bit
ntpd needs to set system time (requires CAP_SYS_TIME) — so ntpd runs as root
Any exploitable bug in any of these programs = root shell

Linux Capabilities: Splitting the Superuser

Since Linux kernel 2.2 (1999), the superuser privilege has been divided into capabilities — individual tokens that grant specific privileges. Linux 6.x defines 41 capabilities.

Key Capabilities

Capability	What It Grants	Example User
`CAP_NET_BIND_SERVICE`	Bind to ports < 1024	Web server, SSH daemon
`CAP_NET_RAW`	Use raw sockets, packet capture	ping, tcpdump, Wireshark
`CAP_NET_ADMIN`	Configure network interfaces, iptables	Network management daemons
`CAP_SYS_PTRACE`	Trace/debug any process	gdb, strace
`CAP_SYS_ADMIN`	Broad administrative functions (mount, ioctl, etc.)	systemd, Docker daemon
`CAP_CHOWN`	Change file owner/group to any value	chown command
`CAP_KILL`	Send signals to any process	systemd, kill commands
`CAP_SYS_TIME`	Set system clock	ntpd, chronyd
`CAP_DAC_OVERRIDE`	Bypass file read/write/execute permission checks	Backup utilities
`CAP_SETUID`	Change UID to any value	su, sudo

File Capabilities

# Grant a binary specific capabilities without making it setuid-root:
setcap cap_net_bind_service+ep /usr/bin/node    # Node.js can bind port 80
setcap cap_net_raw+ep /usr/bin/ping              # ping can use raw sockets

# View capabilities on a file:
getcap /usr/bin/ping
# /usr/bin/ping cap_net_raw=ep

# View capabilities of a running process:
cat /proc/$(pidof nginx)/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 0000000000000400
# CapEff: 0000000000000400  (decode: capsh --decode=0000000000000400)

Container Capability Hardening

Docker drops all capabilities by default and adds back only those needed. The default Docker capability set includes CAP_NET_BIND_SERVICE, CAP_CHOWN, CAP_SETUID and a few others — but explicitly drops CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_NET_ADMIN and dozens more.

docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx  # minimal capabilities

Linux Namespaces: Isolation Without Virtualization

Namespaces create isolated views of kernel resources. A process in a namespace sees only the resources visible within that namespace, not the host's global view.

Linux 6.x implements 8 namespace types:

PID Namespace

Inside a container, ps aux shows processes starting at PID 1. The container's init process is PID 1 inside the namespace and (for example) PID 34821 on the host. The container cannot see host PIDs; signals cannot cross namespace boundaries.

# From host: see container's real PID
docker inspect --format='{{.State.Pid}}' my-container

# From container: only sees its own PID namespace
docker exec my-container ps aux
# PID 1 = the container's CMD

Network Namespace

Each network namespace has its own:

Network interfaces (lo, eth0, virtual veth pairs)
IP addresses and routing tables
iptables/nftables rules
Socket table (no cross-namespace socket visibility)

Docker creates a veth pair: one end in the container's network namespace, one end in the host namespace connected to a bridge (docker0). The container sends packets through its eth0 → host's veth-peer → docker0 bridge → NAT → external network.

User Namespace

The most powerful namespace for security: maps a range of container UIDs to non-root host UIDs. UID 0 inside the container corresponds to an unprivileged UID (e.g., 100000) on the host.

# Unprivileged container: UID 0 inside maps to UID 100000 outside
cat /proc/<container-pid>/uid_map
# 0    100000    65536
# (container UIDs 0-65535 map to host UIDs 100000-165535)

This enables rootless containers: Podman and rootless Docker run entire container runtimes without any host root privileges. Even if a container process escapes the namespace, it has no host capabilities.

cgroups v2: Resource Limits and Accounting

Control Groups (cgroups) limit and account for resource usage. cgroups v2 (the default since kernel 5.x and systemd 244) provides a unified hierarchy.

Structure

cgroups v2 uses a single hierarchy of cgroup directories under /sys/fs/cgroup/. Each directory is a cgroup; files within it control resource limits.

ls /sys/fs/cgroup/system.slice/docker-abc123.scope/
# cgroup.controllers  cpu.max  memory.max  io.max  pids.max  ...

Resource Controllers

# CPU bandwidth: allow 50% of one CPU core
echo "50000 100000" > /sys/fs/cgroup/my-app/cpu.max
# Format: quota period (microseconds). 50000/100000 = 50%

# Memory limit: hard limit at 512MB
echo "536870912" > /sys/fs/cgroup/my-app/memory.max

# Swap limit: disable swap for this cgroup
echo "0" > /sys/fs/cgroup/my-app/memory.swap.max

# I/O bandwidth: write limit of 10MB/s on device 8:0
echo "8:0 wbps=10485760" > /sys/fs/cgroup/my-app/io.max

# PID limit: at most 100 processes
echo "100" > /sys/fs/cgroup/my-app/pids.max

Docker's --cpus, --memory, --blkio-weight flags all write these cgroup files. Kubernetes resource requests/limits also map directly to cgroup v2 controllers.

seccomp: System Call Filtering

Even with namespaces and capabilities, a process can still call any Linux system call. seccomp (Secure Computing mode) whitelists the system calls a process is allowed to make.

# View a process's seccomp mode:
cat /proc/$(pidof chrome)/status | grep Seccomp
# Seccomp: 2
# 0 = not in seccomp, 1 = strict (only read/write/exit/sigreturn), 2 = filter mode

seccomp BPF Filters

seccomp filters are written as BPF (Berkeley Packet Filter) programs — a restricted virtual machine that evaluates system call arguments and returns an action (ALLOW, KILL, ERRNO, TRAP):

// Pseudocode: block open() with O_WRONLY flag
if (syscall == SYS_openat && (flags & O_WRONLY))
    return SECCOMP_RET_ERRNO(EACCES);
return SECCOMP_RET_ALLOW;

Docker Default seccomp Profile

Docker ships a default seccomp profile that blocks approximately 44 system calls out of the ~400 available. Blocked calls include:

reboot, kexec_load — could affect host
ptrace — process debugging/injection
mount, umount2 — filesystem manipulation
create_module, init_module — kernel module loading
clock_settime — change system clock

# Run with custom seccomp profile:
docker run --security-opt seccomp=/path/to/profile.json nginx

# Run without seccomp (dangerous):
docker run --security-opt seccomp=unconfined nginx

Chrome and Firefox use seccomp to isolate each renderer process: the renderer can only call a whitelist of ~70 syscalls. Even a full sandbox escape in the renderer cannot call execve() to spawn a shell — it is not in the whitelist.

SELinux: Mandatory Access Control

Discretionary Access Control (DAC) — Unix permissions — is controlled by the file owner. A process running as a user can do anything that user can do. Mandatory Access Control (MAC) means the kernel enforces a policy that neither the process nor its owner can override.

SELinux assigns a security label to every file, process, network port, and device. The policy defines which labels can interact. A compromised Apache process can only access files labeled httpd_content_t — even if it is running as root, it cannot read /etc/shadow (labeled shadow_t) unless the policy explicitly allows it.

# View SELinux labels:
ls -Z /var/www/html/
# -rw-r--r--. root root unconfined_u:object_r:httpd_sys_content_t:s0 index.html

ps -eZ | grep httpd
# system_u:system_r:httpd_t:s0   1234 httpd

# Check if SELinux is enforcing:
getenforce    # Enforcing / Permissive / Disabled
sestatus      # detailed status

# Why was something denied?
audit2why < /var/log/audit/audit.log
sealert -a /var/log/audit/audit.log

# Allow a specific operation (generate policy module):
audit2allow -a -M my-policy
semodule -i my-policy.pp

SELinux Contexts

A label has format user:role:type:level. The type is the most important component — SELinux policy is largely about type enforcement (TE). Each type has defined allowed operations against other types.

httpd_t (Apache's type) is allowed read on httpd_sys_content_t by policy. It is NOT allowed access to shadow_t, sshd_key_t, or user_home_t. Even a root shell started by Apache cannot escape these constraints — SELinux operates below the capability layer.

Linux Security Layers Diagram

Security Mechanism Comparison

Mechanism	What It Limits	Granularity	Used By	Overhead
Capabilities	Specific privileged operations	Per-capability (41 tokens)	All Linux processes, containers	Negligible (capability check per syscall)
Namespaces	Visibility of kernel resources	Per-namespace-type per process	Docker, Podman, systemd	Moderate (namespace lookup per access)
cgroups v2	CPU, memory, I/O, PID count	Per-cgroup-hierarchy	Docker, Kubernetes, systemd	Low (accounting at kernel boundaries)
seccomp	System call whitelist	Per-syscall (+ argument filtering)	Chrome, Docker, Firefox, systemd	~1% syscall overhead (BPF evaluation)
SELinux	File/process/network label interactions	Per-type-per-operation (MAC policy)	RHEL, Fedora, Android	2–5% file I/O overhead
AppArmor	File and network access per program	Per-program profile	Ubuntu, SUSE, Debian	Similar to SELinux

Key Takeaways

Modern Linux security is not one mechanism — it is five overlapping mechanisms, each designed to limit a different attack surface. The layering is intentional: capabilities reduce the blast radius of privileged code, namespaces prevent lateral movement across isolation boundaries, cgroups prevent resource exhaustion attacks, seccomp eliminates entire classes of kernel attack surface, and SELinux enforces policy that survives even a root compromise.

Container security is the practical application of all five. A Docker container with default settings runs with a reduced capability set, in six namespaces, under cgroup resource limits, under the default seccomp filter, and (on RHEL/Fedora) under SELinux confinement. Five independent mechanisms must all fail simultaneously for a container escape to succeed. That is the definition of defense-in-depth — and it is why containers are considered secure in production despite sharing a kernel with the host.

The weakest link remains CAP_SYS_ADMIN. It is so broad that granting it to a container is nearly equivalent to running privileged. Audit your container security posture by checking which capabilities are granted (docker inspect --format='{{.HostConfig.CapAdd}}'), which seccomp profile is applied, and whether SELinux labels are enforced. These five commands will tell you more about your actual security posture than any compliance checklist.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

28 minLesson 15 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min