AiTechWorlds
AiTechWorlds
In 2013, Edward Snowden revealed that NSA programs exploited OS-level vulnerabilities. The Linux response accelerated: capabilities, namespaces, seccomp, SELinux, and container security all got dramatically stronger. Understanding OS security mechanisms means understanding why modern systems are actually secure — and where they still are not.
The challenge was not new. The Unix model from 1969 was binary: root (UID 0) can do anything, everyone else can do almost nothing security-sensitive. For 30 years, every daemon that needed any privilege — sending network packets, binding to port 80, reading other users' files — had to run as root. Compromise one daemon and you owned the entire machine. The explosion of networked services in the 1990s made this catastrophic. Apache running as root meant a buffer overflow in an HTTP parser gave an attacker a root shell.
The Linux security model since the 2010s is layered defense-in-depth. Capabilities split the root privilege into fine-grained tokens. Namespaces create virtual isolated machines within the kernel. cgroups limit resource consumption. seccomp whitelists the allowed system calls. SELinux enforces mandatory access control that processes cannot override. Docker, Kubernetes, and every container runtime are built entirely on these five mechanisms. Understanding them is understanding the foundation of modern cloud infrastructure.
Classic Unix security is three components:
rwxr-xr-x — owner, group, and others each get read/write/execute bits.This works perfectly for a timesharing system where you trust the administrator. It fails badly for a networked server:
ping needs raw sockets (requires CAP_NET_RAW) — so ping has the setuid-root bitntpd needs to set system time (requires CAP_SYS_TIME) — so ntpd runs as rootSince Linux kernel 2.2 (1999), the superuser privilege has been divided into capabilities — individual tokens that grant specific privileges. Linux 6.x defines 41 capabilities.
| Capability | What It Grants | Example User |
|---|---|---|
CAP_NET_BIND_SERVICE | Bind to ports < 1024 | Web server, SSH daemon |
CAP_NET_RAW | Use raw sockets, packet capture | ping, tcpdump, Wireshark |
CAP_NET_ADMIN | Configure network interfaces, iptables | Network management daemons |
CAP_SYS_PTRACE | Trace/debug any process | gdb, strace |
CAP_SYS_ADMIN | Broad administrative functions (mount, ioctl, etc.) | systemd, Docker daemon |
CAP_CHOWN | Change file owner/group to any value | chown command |
CAP_KILL | Send signals to any process | systemd, kill commands |
CAP_SYS_TIME | Set system clock | ntpd, chronyd |
CAP_DAC_OVERRIDE | Bypass file read/write/execute permission checks | Backup utilities |
CAP_SETUID | Change UID to any value | su, sudo |
# Grant a binary specific capabilities without making it setuid-root:
setcap cap_net_bind_service+ep /usr/bin/node # Node.js can bind port 80
setcap cap_net_raw+ep /usr/bin/ping # ping can use raw sockets
# View capabilities on a file:
getcap /usr/bin/ping
# /usr/bin/ping cap_net_raw=ep
# View capabilities of a running process:
cat /proc/$(pidof nginx)/status | grep Cap
# CapInh: 0000000000000000
# CapPrm: 0000000000000400
# CapEff: 0000000000000400 (decode: capsh --decode=0000000000000400)
Docker drops all capabilities by default and adds back only those needed. The default Docker capability set includes CAP_NET_BIND_SERVICE, CAP_CHOWN, CAP_SETUID and a few others — but explicitly drops CAP_SYS_ADMIN, CAP_SYS_PTRACE, CAP_NET_ADMIN and dozens more.
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx # minimal capabilities
Namespaces create isolated views of kernel resources. A process in a namespace sees only the resources visible within that namespace, not the host's global view.
Linux 6.x implements 8 namespace types:
Inside a container, ps aux shows processes starting at PID 1. The container's init process is PID 1 inside the namespace and (for example) PID 34821 on the host. The container cannot see host PIDs; signals cannot cross namespace boundaries.
# From host: see container's real PID
docker inspect --format='{{.State.Pid}}' my-container
# From container: only sees its own PID namespace
docker exec my-container ps aux
# PID 1 = the container's CMD
Each network namespace has its own:
lo, eth0, virtual veth pairs)Docker creates a veth pair: one end in the container's network namespace, one end in the host namespace connected to a bridge (docker0). The container sends packets through its eth0 → host's veth-peer → docker0 bridge → NAT → external network.
The most powerful namespace for security: maps a range of container UIDs to non-root host UIDs. UID 0 inside the container corresponds to an unprivileged UID (e.g., 100000) on the host.
# Unprivileged container: UID 0 inside maps to UID 100000 outside
cat /proc/<container-pid>/uid_map
# 0 100000 65536
# (container UIDs 0-65535 map to host UIDs 100000-165535)
This enables rootless containers: Podman and rootless Docker run entire container runtimes without any host root privileges. Even if a container process escapes the namespace, it has no host capabilities.
Control Groups (cgroups) limit and account for resource usage. cgroups v2 (the default since kernel 5.x and systemd 244) provides a unified hierarchy.
cgroups v2 uses a single hierarchy of cgroup directories under /sys/fs/cgroup/. Each directory is a cgroup; files within it control resource limits.
ls /sys/fs/cgroup/system.slice/docker-abc123.scope/
# cgroup.controllers cpu.max memory.max io.max pids.max ...
# CPU bandwidth: allow 50% of one CPU core
echo "50000 100000" > /sys/fs/cgroup/my-app/cpu.max
# Format: quota period (microseconds). 50000/100000 = 50%
# Memory limit: hard limit at 512MB
echo "536870912" > /sys/fs/cgroup/my-app/memory.max
# Swap limit: disable swap for this cgroup
echo "0" > /sys/fs/cgroup/my-app/memory.swap.max
# I/O bandwidth: write limit of 10MB/s on device 8:0
echo "8:0 wbps=10485760" > /sys/fs/cgroup/my-app/io.max
# PID limit: at most 100 processes
echo "100" > /sys/fs/cgroup/my-app/pids.max
Docker's --cpus, --memory, --blkio-weight flags all write these cgroup files. Kubernetes resource requests/limits also map directly to cgroup v2 controllers.
Even with namespaces and capabilities, a process can still call any Linux system call. seccomp (Secure Computing mode) whitelists the system calls a process is allowed to make.
# View a process's seccomp mode:
cat /proc/$(pidof chrome)/status | grep Seccomp
# Seccomp: 2
# 0 = not in seccomp, 1 = strict (only read/write/exit/sigreturn), 2 = filter mode
seccomp filters are written as BPF (Berkeley Packet Filter) programs — a restricted virtual machine that evaluates system call arguments and returns an action (ALLOW, KILL, ERRNO, TRAP):
// Pseudocode: block open() with O_WRONLY flag
if (syscall == SYS_openat && (flags & O_WRONLY))
return SECCOMP_RET_ERRNO(EACCES);
return SECCOMP_RET_ALLOW;
Docker ships a default seccomp profile that blocks approximately 44 system calls out of the ~400 available. Blocked calls include:
reboot, kexec_load — could affect hostptrace — process debugging/injectionmount, umount2 — filesystem manipulationcreate_module, init_module — kernel module loadingclock_settime — change system clock# Run with custom seccomp profile:
docker run --security-opt seccomp=/path/to/profile.json nginx
# Run without seccomp (dangerous):
docker run --security-opt seccomp=unconfined nginx
Chrome and Firefox use seccomp to isolate each renderer process: the renderer can only call a whitelist of ~70 syscalls. Even a full sandbox escape in the renderer cannot call execve() to spawn a shell — it is not in the whitelist.
Discretionary Access Control (DAC) — Unix permissions — is controlled by the file owner. A process running as a user can do anything that user can do. Mandatory Access Control (MAC) means the kernel enforces a policy that neither the process nor its owner can override.
SELinux assigns a security label to every file, process, network port, and device. The policy defines which labels can interact. A compromised Apache process can only access files labeled httpd_content_t — even if it is running as root, it cannot read /etc/shadow (labeled shadow_t) unless the policy explicitly allows it.
# View SELinux labels:
ls -Z /var/www/html/
# -rw-r--r--. root root unconfined_u:object_r:httpd_sys_content_t:s0 index.html
ps -eZ | grep httpd
# system_u:system_r:httpd_t:s0 1234 httpd
# Check if SELinux is enforcing:
getenforce # Enforcing / Permissive / Disabled
sestatus # detailed status
# Why was something denied?
audit2why < /var/log/audit/audit.log
sealert -a /var/log/audit/audit.log
# Allow a specific operation (generate policy module):
audit2allow -a -M my-policy
semodule -i my-policy.pp
A label has format user:role:type:level. The type is the most important component — SELinux policy is largely about type enforcement (TE). Each type has defined allowed operations against other types.
httpd_t (Apache's type) is allowed read on httpd_sys_content_t by policy. It is NOT allowed access to shadow_t, sshd_key_t, or user_home_t. Even a root shell started by Apache cannot escape these constraints — SELinux operates below the capability layer.
| Mechanism | What It Limits | Granularity | Used By | Overhead |
|---|---|---|---|---|
| Capabilities | Specific privileged operations | Per-capability (41 tokens) | All Linux processes, containers | Negligible (capability check per syscall) |
| Namespaces | Visibility of kernel resources | Per-namespace-type per process | Docker, Podman, systemd | Moderate (namespace lookup per access) |
| cgroups v2 | CPU, memory, I/O, PID count | Per-cgroup-hierarchy | Docker, Kubernetes, systemd | Low (accounting at kernel boundaries) |
| seccomp | System call whitelist | Per-syscall (+ argument filtering) | Chrome, Docker, Firefox, systemd | ~1% syscall overhead (BPF evaluation) |
| SELinux | File/process/network label interactions | Per-type-per-operation (MAC policy) | RHEL, Fedora, Android | 2–5% file I/O overhead |
| AppArmor | File and network access per program | Per-program profile | Ubuntu, SUSE, Debian | Similar to SELinux |
Modern Linux security is not one mechanism — it is five overlapping mechanisms, each designed to limit a different attack surface. The layering is intentional: capabilities reduce the blast radius of privileged code, namespaces prevent lateral movement across isolation boundaries, cgroups prevent resource exhaustion attacks, seccomp eliminates entire classes of kernel attack surface, and SELinux enforces policy that survives even a root compromise.
Container security is the practical application of all five. A Docker container with default settings runs with a reduced capability set, in six namespaces, under cgroup resource limits, under the default seccomp filter, and (on RHEL/Fedora) under SELinux confinement. Five independent mechanisms must all fail simultaneously for a container escape to succeed. That is the definition of defense-in-depth — and it is why containers are considered secure in production despite sharing a kernel with the host.
The weakest link remains CAP_SYS_ADMIN. It is so broad that granting it to a container is nearly equivalent to running privileged. Audit your container security posture by checking which capabilities are granted (docker inspect --format='{{.HostConfig.CapAdd}}'), which seccomp profile is applied, and whether SELinux labels are enforced. These five commands will tell you more about your actual security posture than any compliance checklist.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises