AiTechWorlds
AiTechWorlds
In 1990, POSIX.1c standardized the threads API. The committee agreed on the interface: pthread_create(), pthread_mutex_lock(), pthread_cond_wait(). They agreed on the semantics: threads share address space, files, and signal handlers. They agreed that threads needed to exist.
What they could not agree on — and what the API conspicuously omits — is where threads should live. Should threads be a user-space library abstraction? Entities the kernel manages directly? Some combination of both?
This was not a theoretical debate. The answer determines whether a blocking network call in one thread freezes all threads in the process. It determines whether a process with 10,000 threads consumes 10,000 kernel scheduler slots. It determines whether two threads can truly execute in parallel on two CPU cores, or merely take turns on one.
Different operating systems and runtimes made different choices, with performance implications that still ripple through every concurrent program written today. The Go language's goroutines and the Java virtual thread model introduced in JDK 21 are both direct responses to the limitations of the kernel thread model — choices made in 1990 still driving architectural decisions in 2024.
N user-space threads map to exactly 1 kernel thread. The user-space runtime library (not the kernel) handles thread creation, scheduling, and context switching between threads.
How it works: the runtime maintains its own runqueue and context switches between threads by saving and restoring registers in user space — no system call, no privilege change, no kernel involvement. Extremely fast thread creation and switching.
The fatal limitation: any blocking system call suspends the entire process. When one thread calls read() on a socket that has no data, the kernel blocks the single underlying OS thread. All other user-space threads become unscheduled — they are logically runnable but cannot progress because their carrier OS thread is asleep. There is no parallelism across CPU cores.
Historical example: Java's "green threads" (JDK 1.0–1.2). Early Java applications ran on platforms where creating native OS threads was expensive or unavailable. The JVM implemented its own user-space thread scheduler. Thread.sleep() in one thread could starve others. Synchronized blocks were simpler to implement but couldn't take advantage of multiple processors.
Java dropped green threads in JDK 1.3 (2000) in favor of native threads. JDK 21 reintroduced user-space scheduling as Virtual Threads — but correctly, using the M:N model described below.
Each user-space thread maps to exactly one kernel thread (task_struct on Linux). Thread creation calls into the kernel. Scheduling is done entirely by the kernel scheduler.
True parallelism: thread A and thread B can execute simultaneously on core 0 and core 1. One thread blocking on I/O does not affect others. The kernel can preempt any thread at any time.
Cost: thread creation invokes clone() — a system call with kernel allocations (kernel stack, task_struct, PID allocation). On Linux, creating a thread takes roughly 5–15 µs. A process with 10,000 threads places 10,000 entries in the kernel scheduler's data structures.
This is what Linux uses. pthread_create() on Linux calls clone(CLONE_VM | CLONE_FILES | ...). Every POSIX thread is a Linux task_struct. The kernel scheduler manages them identically to processes — threads are processes that share memory.
Windows also uses 1:1 (kernel threads). The Win32 CreateThread() API maps to a kernel thread object. This is standard across all major modern operating systems.
N user-space threads map to M kernel threads, where M ≤ N. A user-space scheduler multiplexes N threads onto M OS threads. The runtime can create thousands of lightweight threads while only keeping M OS threads active with the kernel.
Best of both worlds (in theory): fast thread creation (user space), true parallelism (M kernel threads across cores), no kernel involvement for thread switches between user-space threads that share a kernel thread.
Fiendishly complex to implement correctly: the user-space scheduler and kernel scheduler must cooperate. If a user-space thread blocks on a system call, the kernel thread blocks too — the scheduler must detect this and activate another kernel thread ("scheduler activation" or "upcalls"). This requires kernel support or intrusive signal handling. Getting it right without introducing deadlocks, priority inversions, and race conditions is extremely hard.
Solaris (Sun/Oracle Unix) used M:N with its lightweight process (LWP) model for many years before switching to 1:1 in Solaris 9. Go's goroutines implement M:N correctly — this is discussed in detail below.
pthreads is the POSIX threads API, implemented in glibc on Linux. All calls go through glibc, which on Linux translates to clone() system calls.
Thread lifecycle:
pthread_t tid;
pthread_create(&tid, NULL, thread_function, arg); // create
pthread_join(tid, &retval); // wait for completion
pthread_detach(tid); // let it clean itself up
Mutual exclusion (mutex):
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&lock);
// critical section
pthread_mutex_unlock(&lock);
pthread_mutex_lock() is implemented with futex(2) — a system call that is fast when uncontended (the lock state is a user-space atomic variable; no syscall needed if the mutex is free) and falls into the kernel only when contention requires sleeping.
Condition variables:
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
pthread_cond_wait(&cond, &lock); // atomically unlock and sleep
pthread_cond_signal(&cond); // wake one waiter
pthread_cond_broadcast(&cond); // wake all waiters
The pthread_cond_wait() / pthread_cond_signal() pair is the foundation of producer-consumer patterns, bounded queues, and thread pool implementations.
Thread-local storage (TLS): __thread in C, thread_local in C++11.
__thread int per_thread_counter = 0;
Each thread gets its own copy. The C library maintains a TLS block at a fixed negative offset from the fs segment register on x86-64. errno is a TLS variable — that is why it is safe to use in multithreaded programs.
Go's goroutine model is the most successful production M:N implementation. Understanding it illuminates why the theoretical benefits of M:N are worth the implementation complexity.
The GMP model:
go func().GOMAXPROCS Ps (default: number of CPU cores).Scheduling: each M is associated with one P. The M runs goroutines from P's local runqueue. When M's goroutine makes a blocking system call, Go's runtime detaches P from M and attaches P to another M (or creates a new M). The blocking M and goroutine are parked waiting for the syscall. When it completes, the goroutine returns to a P's runqueue.
Work stealing: when P's local runqueue is empty, it steals half the runnable goroutines from another P's queue — classic work stealing, same algorithm as Java's ForkJoinPool.
Result: 1 million goroutines is practical. Each goroutine uses 2KB+ of stack vs ~8MB for a POSIX thread stack (and a full task_struct in the kernel). A server handling 1 million concurrent connections is realistic with goroutines; it requires creative engineering with kernel threads.
| Thread Model | Kernel Involvement | True Parallelism | Thread Creation Cost | Blocking I/O Effect | Examples | Go Equivalent |
|---|---|---|---|---|---|---|
| Many-to-One | None for switch | No (1 core max) | Nanoseconds | Blocks all threads | Old Java green threads, early Erlang | N/A |
| One-to-One | Full (per thread) | Yes | 5–15 µs (clone syscall) | Only blocks one thread | Linux pthreads, Windows threads, JDK 1.3+ | N/A |
| Many-to-Many | Partial (M threads) | Yes (M cores) | Nanoseconds (user space) | Detached; new M activated | Solaris LWPs, Go goroutines, JDK 21 Virtual Threads | Goroutine on GMP scheduler |
Data races occur when two threads access the same memory location concurrently and at least one is writing, without synchronization. They produce undefined behavior in C/C++ — not just incorrect values but arbitrary code execution.
Atomic operations (std::atomic<> in C++, sync/atomic in Go) provide hardware-guaranteed load-modify-store operations without locks. They are the building block of lock-free data structures.
Lock-free programming is notoriously difficult to get right due to the ABA problem, memory ordering, and compiler/CPU reordering. The Linux kernel's READ_ONCE() / WRITE_ONCE() macros and smp_mb() memory barriers address these at the kernel level.
The choice of thread model is not just an implementation detail — it shapes the performance characteristics and programming model of everything built on top. The 1:1 model's simplicity comes at the cost of scalability at very high thread counts. The M:N model's efficiency comes at the cost of enormous implementation complexity.
Go's success with goroutines has validated the M:N model for production use — but only because Go's runtime engineers spent years getting work stealing, goroutine preemption, and syscall handling correct. Java's Virtual Threads in JDK 21 made the same architectural choice for the same reasons: the JVM platform was hitting the wall of kernel thread overhead at scale.
Understanding these models means understanding why your Go server can handle 100,000 concurrent connections comfortably, why a naive Java thread-per-request server collapses at 10,000, and why Nginx's event loop outperforms Apache's thread-per-connection model for high-concurrency workloads — even though Nginx uses a single-threaded model per worker.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises