In 1990, POSIX.1c standardized the threads API. The committee agreed on the interface: pthread_create(), pthread_mutex_lock(), pthread_cond_wait(). They agreed on the semantics: threads share address space, files, and signal handlers. They agreed that threads needed to exist.

What they could not agree on — and what the API conspicuously omits — is where threads should live. Should threads be a user-space library abstraction? Entities the kernel manages directly? Some combination of both?

This was not a theoretical debate. The answer determines whether a blocking network call in one thread freezes all threads in the process. It determines whether a process with 10,000 threads consumes 10,000 kernel scheduler slots. It determines whether two threads can truly execute in parallel on two CPU cores, or merely take turns on one.

Different operating systems and runtimes made different choices, with performance implications that still ripple through every concurrent program written today. The Go language's goroutines and the Java virtual thread model introduced in JDK 21 are both direct responses to the limitations of the kernel thread model — choices made in 1990 still driving architectural decisions in 2024.

The Three Fundamental Thread Models

Many-to-One: Green Threads

N user-space threads map to exactly 1 kernel thread. The user-space runtime library (not the kernel) handles thread creation, scheduling, and context switching between threads.

How it works: the runtime maintains its own runqueue and context switches between threads by saving and restoring registers in user space — no system call, no privilege change, no kernel involvement. Extremely fast thread creation and switching.

The fatal limitation: any blocking system call suspends the entire process. When one thread calls read() on a socket that has no data, the kernel blocks the single underlying OS thread. All other user-space threads become unscheduled — they are logically runnable but cannot progress because their carrier OS thread is asleep. There is no parallelism across CPU cores.

Historical example: Java's "green threads" (JDK 1.0–1.2). Early Java applications ran on platforms where creating native OS threads was expensive or unavailable. The JVM implemented its own user-space thread scheduler. Thread.sleep() in one thread could starve others. Synchronized blocks were simpler to implement but couldn't take advantage of multiple processors.

Java dropped green threads in JDK 1.3 (2000) in favor of native threads. JDK 21 reintroduced user-space scheduling as Virtual Threads — but correctly, using the M:N model described below.

One-to-One: Kernel Threads

Each user-space thread maps to exactly one kernel thread (task_struct on Linux). Thread creation calls into the kernel. Scheduling is done entirely by the kernel scheduler.

True parallelism: thread A and thread B can execute simultaneously on core 0 and core 1. One thread blocking on I/O does not affect others. The kernel can preempt any thread at any time.

Cost: thread creation invokes clone() — a system call with kernel allocations (kernel stack, task_struct, PID allocation). On Linux, creating a thread takes roughly 5–15 µs. A process with 10,000 threads places 10,000 entries in the kernel scheduler's data structures.

This is what Linux uses. pthread_create() on Linux calls clone(CLONE_VM | CLONE_FILES | ...). Every POSIX thread is a Linux task_struct. The kernel scheduler manages them identically to processes — threads are processes that share memory.

Windows also uses 1:1 (kernel threads). The Win32 CreateThread() API maps to a kernel thread object. This is standard across all major modern operating systems.

Many-to-Many: The N:M Model

N user-space threads map to M kernel threads, where M ≤ N. A user-space scheduler multiplexes N threads onto M OS threads. The runtime can create thousands of lightweight threads while only keeping M OS threads active with the kernel.

Best of both worlds (in theory): fast thread creation (user space), true parallelism (M kernel threads across cores), no kernel involvement for thread switches between user-space threads that share a kernel thread.

Fiendishly complex to implement correctly: the user-space scheduler and kernel scheduler must cooperate. If a user-space thread blocks on a system call, the kernel thread blocks too — the scheduler must detect this and activate another kernel thread ("scheduler activation" or "upcalls"). This requires kernel support or intrusive signal handling. Getting it right without introducing deadlocks, priority inversions, and race conditions is extremely hard.

Solaris (Sun/Oracle Unix) used M:N with its lightweight process (LWP) model for many years before switching to 1:1 in Solaris 9. Go's goroutines implement M:N correctly — this is discussed in detail below.

POSIX Threads: The API in Depth

pthreads is the POSIX threads API, implemented in glibc on Linux. All calls go through glibc, which on Linux translates to clone() system calls.

Thread lifecycle:

pthread_t tid;
pthread_create(&tid, NULL, thread_function, arg);  // create
pthread_join(tid, &retval);                         // wait for completion
pthread_detach(tid);                                // let it clean itself up

Mutual exclusion (mutex):

pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_lock(&lock);
// critical section
pthread_mutex_unlock(&lock);

pthread_mutex_lock() is implemented with futex(2) — a system call that is fast when uncontended (the lock state is a user-space atomic variable; no syscall needed if the mutex is free) and falls into the kernel only when contention requires sleeping.

Condition variables:

pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
pthread_cond_wait(&cond, &lock);   // atomically unlock and sleep
pthread_cond_signal(&cond);        // wake one waiter
pthread_cond_broadcast(&cond);     // wake all waiters

The pthread_cond_wait() / pthread_cond_signal() pair is the foundation of producer-consumer patterns, bounded queues, and thread pool implementations.

Thread-local storage (TLS): __thread in C, thread_local in C++11.

__thread int per_thread_counter = 0;

Each thread gets its own copy. The C library maintains a TLS block at a fixed negative offset from the fs segment register on x86-64. errno is a TLS variable — that is why it is safe to use in multithreaded programs.

Go's Goroutines: A Correct M:N Implementation

Go's goroutine model is the most successful production M:N implementation. Understanding it illuminates why the theoretical benefits of M:N are worth the implementation complexity.

The GMP model:

G (Goroutine): a user-space "thread." Starts with a 2KB stack (grows dynamically up to 1GB by default). Created with go func().
M (Machine): an OS thread. Typically one per CPU core, but can be more if goroutines block on system calls.
P (Processor): a scheduler context that holds a local runqueue of goroutines. There are GOMAXPROCS Ps (default: number of CPU cores).

Scheduling: each M is associated with one P. The M runs goroutines from P's local runqueue. When M's goroutine makes a blocking system call, Go's runtime detaches P from M and attaches P to another M (or creates a new M). The blocking M and goroutine are parked waiting for the syscall. When it completes, the goroutine returns to a P's runqueue.

Work stealing: when P's local runqueue is empty, it steals half the runnable goroutines from another P's queue — classic work stealing, same algorithm as Java's ForkJoinPool.

Result: 1 million goroutines is practical. Each goroutine uses 2KB+ of stack vs ~8MB for a POSIX thread stack (and a full task_struct in the kernel). A server handling 1 million concurrent connections is realistic with goroutines; it requires creative engineering with kernel threads.

Thread Model Comparison Diagram

Thread Model Comparison Table

Thread Model	Kernel Involvement	True Parallelism	Thread Creation Cost	Blocking I/O Effect	Examples	Go Equivalent
Many-to-One	None for switch	No (1 core max)	Nanoseconds	Blocks all threads	Old Java green threads, early Erlang	N/A
One-to-One	Full (per thread)	Yes	5–15 µs (clone syscall)	Only blocks one thread	Linux pthreads, Windows threads, JDK 1.3+	N/A
Many-to-Many	Partial (M threads)	Yes (M cores)	Nanoseconds (user space)	Detached; new M activated	Solaris LWPs, Go goroutines, JDK 21 Virtual Threads	Goroutine on GMP scheduler

Practical Thread Safety Notes

Data races occur when two threads access the same memory location concurrently and at least one is writing, without synchronization. They produce undefined behavior in C/C++ — not just incorrect values but arbitrary code execution.

Atomic operations (std::atomic<> in C++, sync/atomic in Go) provide hardware-guaranteed load-modify-store operations without locks. They are the building block of lock-free data structures.

Lock-free programming is notoriously difficult to get right due to the ABA problem, memory ordering, and compiler/CPU reordering. The Linux kernel's READ_ONCE() / WRITE_ONCE() macros and smp_mb() memory barriers address these at the kernel level.

Key Takeaways

The choice of thread model is not just an implementation detail — it shapes the performance characteristics and programming model of everything built on top. The 1:1 model's simplicity comes at the cost of scalability at very high thread counts. The M:N model's efficiency comes at the cost of enormous implementation complexity.

Go's success with goroutines has validated the M:N model for production use — but only because Go's runtime engineers spent years getting work stealing, goroutine preemption, and syscall handling correct. Java's Virtual Threads in JDK 21 made the same architectural choice for the same reasons: the JVM platform was hitting the wall of kernel thread overhead at scale.

Understanding these models means understanding why your Go server can handle 100,000 concurrent connections comfortably, why a naive Java thread-per-request server collapses at 10,000, and why Nginx's event loop outperforms Apache's thread-per-connection model for high-concurrency workloads — even though Nginx uses a single-threaded model per worker.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

30 minLesson 5 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min