AiTechWorlds
AiTechWorlds
Run ps aux on any Linux system. You see dozens of entries: system daemons, user shells, background services. What you are looking at is a list of task_struct instances — a C structure defined in include/linux/sched.h in the Linux kernel source tree. In Linux 6.x, task_struct has over 800 fields across more than 700 lines of source code.
Every process, every thread, every kernel worker — all are task_struct. Linux makes no kernel-level distinction between a process and a thread. Both are "tasks." The difference between them is which resources they share, encoded in flags passed to the clone() system call.
Understanding task_struct is understanding how Linux manages the entire computational workload. It is the node in every scheduler queue, the key in every PID lookup, and the repository of everything the kernel knows about a running program. When a process crashes and the kernel logs a backtrace, it is reading from the dying process's task_struct and kernel stack.
task_struct is defined in include/linux/sched.h. The following are the fields you need to understand to reason about process behavior.
Identity and state:
pid_t pid: the Process ID — unique identifier for this specific taskpid_t tgid: Thread Group ID — for the main thread, equals pid; for threads, equals the main thread's pid. This is what getpid() returns for all threads in a grouplong __state: current execution state (see Process States below)char comm[TASK_COMM_LEN]: the executable name (16 bytes), visible in ps outputMemory:
struct mm_struct *mm: pointer to the virtual memory descriptor — page tables, virtual memory areas (VMAs), memory statistics. Threads in the same process share one mm_structstruct mm_struct *active_mm: the actually-active memory context; used by kernel threads that borrow the previous userspace mmFiles and filesystem:
struct files_struct *files: the open file descriptor table — all threads in a process share thisstruct fs_struct *fs: root directory, current working directory, umaskSignals:
struct signal_struct *signal: signal handlers, pending signals, disposition tablestruct sighand_struct *sighand: signal handler function pointers with reference countingScheduling:
int prio, static_prio, normal_prio: current, configured, and normal priority valuesunsigned int policy: scheduling policy (SCHED_NORMAL, SCHED_FIFO, SCHED_RR, SCHED_DEADLINE)struct sched_entity se: the CFS scheduler entity — contains vruntime, the key CFS fieldstruct rb_node run_node: the red-black tree node for insertion into the CFS runqueueCPU context:
struct thread_struct thread: architecture-specific CPU state (x86-64: segment registers, FPU state, syscall entry stack pointer)void *stack: pointer to the kernel stack — every task has a dedicated 8KB kernel stack on x86-64Relationships:
struct task_struct *parent: pointer to the parent taskstruct list_head children: list of child tasksstruct list_head sibling: position in parent's children liststruct list_head tasks: entry in the global task list (all tasks linked together)Accounting:
u64 utime, stime: CPU time spent in user mode and kernel mode (in nanoseconds)struct prev_cputime prev_cputime: previous accounting snapshotunsigned long nvcsw, nivcsw: voluntary and involuntary context switch countsA task's __state field determines where it sits in the kernel's data structures.
D state in ps output. A process stuck in D is typically waiting for a storage device that is not responding. You cannot kill a D-state process — SIGKILL cannot interrupt this sleepexit(). Its memory and most resources are freed, but its task_struct remains until the parent calls wait() to collect the exit statusWhen a process calls fork(), the kernel executes copy_process():
task_struct (from the SLAB allocator)0, parent receives child's PIDThe critical insight is Copy-on-Write memory. When copy_process() duplicates the parent's mm_struct, it does not copy the actual memory contents. Instead, both parent and child page tables point to the same physical frames, marked read-only. When either process writes to a page, the CPU triggers a page fault, the kernel allocates a new physical frame, copies the page content, and updates the page table. Only then does the write complete.
This means fork()ing a process with 4GB of mapped memory takes microseconds — not the seconds it would take to copy 4GB. Redis's BGSAVE command forks the main process to write a snapshot to disk, relying entirely on CoW to avoid copying the dataset at fork time.
fork() creates a copy of the parent. exec() (implemented as execve(2)) replaces the current process image:
do_execve() in the kernel opens and reads the new binarymm_release(), anonymous pages freed)After a successful execve(), there is no return — the old code is gone.
When a process exits, its task_struct is not immediately freed. The process enters EXIT_ZOMBIE state: resources (memory, file descriptors) are released, but the task_struct remains, holding the exit status (exit_code) until the parent retrieves it via wait() or waitpid().
Why? The parent may need the exit status. The child's PID must remain valid until the parent has called wait(). Without this, a parent could call waitpid(child_pid, ...) and get the wrong result because the PID has been reused.
Zombies accumulate when parents do not call wait(). A process with thousands of zombie children will eventually exhaust the PID namespace. This is a real bug category in production systems — look for it when /proc/sys/kernel/pid_max is approached.
Orphan processes: if a parent exits before its children, those children become orphans. The kernel assigns them to the nearest subreaper (set via prctl(PR_SET_CHILD_SUBREAPER)), or to PID 1 (systemd/init). PID 1 calls wait() in a loop, reaping all orphaned zombies. This is one of init's fundamental responsibilities.
Linux's implementation insight: a "thread" is just a process that shares certain resources with its parent. There is no separate kernel object for threads.
The clone() system call takes a flags argument that controls what is shared:
clone(CLONE_VM | CLONE_FILES | CLONE_FS | CLONE_SIGHAND | CLONE_THREAD, ...)
CLONE_VM: share the mm_struct — same virtual address space → threadCLONE_FILES: share the file descriptor tableCLONE_SIGHAND: share signal handlersCLONE_THREAD: put the new task in the same thread group (same tgid)Without these flags — just clone() with no sharing — you get behavior similar to fork(): a new process with independent copies.
pthread_create() in glibc calls clone() with the appropriate flags. fork() calls clone() with minimal sharing flags. Both are the same kernel primitive.
| State | __state Value | Meaning | What Wakes It | Schedulable? |
|---|---|---|---|---|
| TASK_RUNNING | 0 | Running or queued to run | N/A (already runnable) | Yes |
| TASK_INTERRUPTIBLE | 1 | Sleeping, wakes on signal | Signal or event | No (until woken) |
| TASK_UNINTERRUPTIBLE | 2 | Sleeping, signal-immune | Event only (usually I/O) | No |
| TASK_STOPPED | 4 | Stopped by SIGSTOP/SIGTSTP | SIGCONT | No |
| TASK_TRACED | 8 | Stopped by debugger (ptrace) | Debugger continue | No |
| EXIT_ZOMBIE | 32 | Exited, awaiting parent wait() | parent calls wait() | No |
task_struct is the central data structure of the Linux kernel. Every scheduler decision, every signal delivery, every memory allocation, every file I/O — all of it ultimately reads from or writes to a task_struct.
The fork-exec model, inherited from Unix and optimized with Copy-on-Write, is why Unix-like systems are so efficient at process creation. The unification of processes and threads under clone() is why Linux's threading implementation is both simpler and more flexible than most alternatives. Understanding zombie processes is not just trivia — zombie accumulation is a real operational problem in production systems with bugs in signal handling or parent process management.
When you use ps, top, or htop, you are reading task_struct fields. When you use strace -p PID, the kernel is using ptrace() to set TASK_TRACED on the target's __state. Everything maps back to this structure.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises