AiTechWorlds
AiTechWorlds
When you call open("file.txt", "r") in Python, you are not accessing the filesystem. You are asking permission to access it. Your Python process runs in Ring 3 — user space — with no direct access to hardware, no ability to read arbitrary memory, and no way to touch the filesystem on its own. The filesystem lives in kernel space, where Ring 0 code runs with full hardware privilege.
That call crosses a boundary that exists for one reason: security. If user code could access the filesystem directly, any program could read any file, overwrite kernel data structures, and crash the entire system. The system call is the carefully controlled gate in the wall between user space and kernel space.
When open() executes, here is what actually happens: Python calls the C library's open() wrapper, which loads a syscall number into a CPU register and executes a single instruction — syscall on x86-64 — that transfers control to the kernel. The kernel validates the request, performs the operation, and returns a result. Your Python code resumes. The whole round-trip takes roughly 100 nanoseconds.
Understanding this mechanism is not academic. It determines performance bottlenecks, security boundaries, and the fundamental architecture of every application you will ever write.
System calls are the formal API between user programs and the OS kernel. They are the only legitimate way for user-space code to request privileged operations.
Linux 6.x defines approximately 350 system calls. Windows defines over 1,000 (though many are internal and undocumented). The POSIX standard specifies a portable subset that works across Linux, macOS, and BSD — the interface your C standard library wraps.
You almost never call system calls directly. The C library (glibc on Linux) provides wrapper functions: open(), read(), write(). These wrappers handle argument marshaling, error translation (kernel returns negative errno, glibc converts to the errno global), and in some cases bypass the kernel entirely.
Process management: fork() creates a new process by duplicating the caller. exec() (execve) replaces the current process image with a new program. exit() terminates the process. wait() / waitpid() waits for a child process to change state. getpid() returns the process ID.
File I/O: open() / openat() opens a file and returns a file descriptor. read() and write() transfer data. close() releases the descriptor. lseek() repositions the file offset. mmap() maps a file or device into the process's virtual address space.
Memory management: brk() and sbrk() adjust the heap boundary (used internally by malloc). mmap() is the modern interface for both file mapping and anonymous memory allocation. munmap() releases mappings.
Networking: socket() creates a communication endpoint. bind(), listen(), accept() set up a server. connect() initiates a client connection. send() / recv() transfer data. setsockopt() configures socket behavior.
Signals: kill() sends a signal to a process. sigaction() installs a signal handler. signal() is the simplified (and less reliable) predecessor.
IPC: pipe() creates a unidirectional byte channel. msgget() / msgsnd() / msgrcv() implement System V message queues. semget() creates semaphores. shmget() allocates shared memory segments.
The following describes the exact hardware path of a system call on x86-64 Linux 6.x.
In user space, before the syscall instruction:
rax (e.g., rax = 2 for open)rdi (arg1), rsi (arg2), rdx (arg3), r10 (arg4), r8 (arg5), r9 (arg6)syscall instructionHardware actions on syscall:
rip (return address) into rcx and rflags into r11LSTAR MSR (Model-Specific Register)entry_SYSCALL_64 in the Linux kernelIn kernel space:
do_syscall_64()sys_call_table[rax] — an array of function pointerssys_openat)raxReturning to user space:
sysret instruction executesrip from rcx, rflags from r11raxThe entire mechanism costs approximately 100 nanoseconds under normal conditions. This seems fast, but a tight loop making syscalls can spend 50%+ of its time in kernel transitions.
The Virtual Dynamic Shared Object (VDSO) is a kernel mechanism that maps a small region of kernel code directly into every process's virtual address space. User code calls it like a regular function — but the code executes data that the kernel maintains.
The classic example is gettimeofday(). Getting the current time normally requires a system call — but time is not sensitive data that needs access control. The kernel maintains a shared memory page (vvar) that contains the current time and clock parameters. The VDSO's gettimeofday() reads from this page without ever executing syscall.
Result: gettimeofday() via VDSO takes approximately 5 nanoseconds. Via a real system call, approximately 100 nanoseconds — a 20x difference.
Other calls accelerated by VDSO: clock_gettime(), clock_getres(), time(), and getcpu().
You can inspect a process's VDSO mapping in /proc/[PID]/maps — look for the line tagged [vdso].
strace intercepts every system call made by a process and prints them. It is invaluable for debugging and performance analysis.
strace ls /tmp
The output shows the full system call trace: execve loads the ls binary, openat opens the directory, getdents64 reads directory entries, write outputs the results, exit_group terminates. You see the exact sequence of kernel interactions for a simple directory listing.
strace -c ls /tmp produces a summary with call counts and cumulative time per syscall. strace -p PID attaches to a running process.
Understanding strace output lets you diagnose: why a program is slow (too many syscalls), what files it accesses (security audit), and why it is hanging (blocked on which syscall).
/proc is a virtual filesystem — it has no disk backing. It is a window into kernel data structures, rendered as files and directories.
/proc/[PID]/maps: the complete virtual memory map of a process — every mapped region, its permissions, and what file backs it/proc/[PID]/fd/: a directory of symlinks, one per open file descriptor/proc/[PID]/status: human-readable task_struct fields — state, memory usage, context switches/proc/cpuinfo: per-core CPU information including model, flags, and cache sizes/proc/meminfo: system-wide memory statistics — total, free, buffers, swap/proc/sys/kernel/: tunable kernel parameters (writable — the sysctl interface)The /proc filesystem is not just for inspection. It is the primary configuration interface for a running Linux system. Writing to /proc/sys/net/ipv4/ip_forward enables IP forwarding without rebooting. Writing to /proc/sys/kernel/sched_latency_ns changes the scheduler's period.
/sys (sysfs) is a newer, more structured counterpart that exposes device and driver information as a hierarchy mirroring the kernel's internal object model.
| System Call | Category | Description | Typical Usage | glibc Wrapper |
|---|---|---|---|---|
openat(2) | File I/O | Open file, return file descriptor | File access, log writing | open() |
read(2) | File I/O | Read bytes from file descriptor | Consuming input, parsing files | read() |
write(2) | File I/O | Write bytes to file descriptor | Output, logging | write() |
mmap(2) | Memory | Map file or anonymous memory | malloc internals, file mapping | mmap() |
fork(2) | Process | Duplicate calling process | Shell launching commands | fork() |
execve(2) | Process | Replace process image with new program | Running programs | execv() family |
clone(2) | Process | Create child with shared namespaces | Thread creation | pthread_create() |
socket(2) | Network | Create network communication endpoint | TCP/UDP servers and clients | socket() |
futex(2) | Sync | Fast user-space mutex | Mutex/condvar in pthreads | pthread_mutex_lock() |
epoll_wait(2) | I/O Mux | Wait for events on multiple fds | High-performance servers (nginx) | epoll_wait() |
System calls are not just an API — they are the security boundary that makes multi-process operating systems possible. Every time you write to a file, allocate memory, or open a socket, you are crossing from user space into kernel space and back. The mechanisms involved — privilege rings, the syscall instruction, sys_call_table dispatch — are worth knowing precisely because they explain performance characteristics (why network I/O is expensive, why VDSO calls are cheap) and security properties (why a user process cannot read another's memory without explicit sharing).
When you run strace on a misbehaving program and find it making 10,000 stat() calls per second looking for a configuration file that does not exist, you are not looking at an abstraction — you are watching 1,000,000 nanoseconds of unnecessary privilege transitions per second. That knowledge is actionable.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises