In 1999, a hospital in Tokyo faced a surgical crisis. Patient volume had tripled. The solution wasn't to find a surgeon who could operate 32 times faster — that's physically impossible. The solution was to build more operating theaters and hire more surgeons.

A single surgeon can only be in one place at once. Thirty-two surgeons in thirty-two operating rooms can handle thirty-two patients simultaneously. The hospital's throughput is thirty-two times greater, even though no individual surgeon is any faster.

This is the story of multicore processors. In 2004, Intel's Prescott processor reached 3.8 GHz — and came within striking distance of catching fire. The era of single-core frequency scaling hit a wall, not because of a lack of engineering ambition, but because of the fundamental laws of physics. The industry's response was to stop building faster surgeons and start building more operating rooms.

Why Single-Core Scaling Stopped: The Thermal Wall

From 1970 to 2004, CPU performance doubled roughly every 18 months — a manifestation of Dennard Scaling alongside Moore's Law.

Dennard Scaling predicted that as transistors shrink:

Clock frequency could increase proportionally
Voltage could decrease proportionally
Power density would remain constant

This held true for decades. Then, around 130nm and below, Dennard Scaling broke down:

Transistor leakage current stopped shrinking with the transistors themselves
Voltage couldn't drop below ~0.7V without reliability problems
Power = C × V² × f (Capacitance × Voltage² × Frequency)
With V stuck and f increasing, power density exploded

Intel's Prescott (2004) at 90nm, 3.8 GHz consumed 115 watts — a processor the size of a postage stamp producing more heat per square centimeter than a hot plate. Intel's planned 4 GHz Tejas processor was cancelled after thermal simulations suggested it would require cooling infrastructure impractical for consumer systems.

The Power Wall and Thermal Wall were the same wall. The frequency race ended.

After 2004, clock speeds stagnated. The new strategy: more cores at moderate frequencies.

The Multicore Revolution

Rather than one core at 4 GHz, build four cores at 3 GHz. For single-threaded work, you lose a little. For multi-threaded workloads, you multiply throughput.

Modern CPU core counts:

Processor	Cores	Configuration	Max Boost
Intel Core i9-14900K	24	8 P-cores + 16 E-cores	6.0 GHz
AMD Ryzen 9 7950X	16	16 Zen 4 cores	5.7 GHz
Apple M3 Max	16	12 P-cores + 4 E-cores	4.0 GHz
AMD EPYC 9654 (server)	96	96 Zen 4 cores	3.7 GHz
AWS Graviton4 (ARM)	96	96 Neoverse V2 cores	2.8 GHz

Intel's hybrid architecture (P-cores + E-cores, since Alder Lake 2021) mirrors ARM's big.LITTLE design: large, high-performance cores for intensive workloads, small efficiency cores for background tasks. The OS scheduler is aware of core types and assigns tasks accordingly.

Flynn's Taxonomy (1966): Classifying Parallel Architectures

Michael Flynn's 1966 classification remains the standard framework for parallel computer architectures:

Class	Full Name	Description	Real-World Example
SISD	Single Instruction, Single Data	Classic serial processor — one instruction, one data element	Early Intel 8086, simple microcontrollers
SIMD	Single Instruction, Multiple Data	One instruction applies to many data elements in parallel	CPU vector units (SSE, AVX), GPU shader cores
MISD	Multiple Instructions, Single Data	Multiple processors apply different operations to the same data	Space Shuttle flight computers (fault tolerance through redundancy)
MIMD	Multiple Instructions, Multiple Data	Multiple processors execute independent instruction streams on independent data	Multicore CPUs, HPC clusters, distributed systems

Modern CPUs are MIMD at the core level (multiple independent cores) and SIMD within each core (vector execution units). This combination provides parallelism at two levels simultaneously.

SIMD in Practice: SSE, AVX, and Beyond

SIMD (Single Instruction, Multiple Data) allows one instruction to operate on a vector of data elements simultaneously:

Without SIMD: ADD R1, R2 — adds two 32-bit numbers → 1 operation
With AVX2: VADDPS YMM0, YMM1, YMM2 — adds eight 32-bit floats → 8 operations

The evolution of x86 SIMD:

Extension	Width	Elements (float32)	Year	Notes
MMX	64-bit	2	1997	Integer only, overlapped FPU registers
SSE	128-bit	4	1999	Dedicated XMM registers
SSE2	128-bit	4	2001	Added double precision, integer ops
SSE4.2	128-bit	4	2008	String processing, CRC32
AVX	256-bit	8	2011	New YMM registers, 3-operand encoding
AVX2	256-bit	8	2013	Integer gather, FMA instructions
AVX-512	512-bit	16	2016	Intel Xeon; dropped in some desktop CPUs

Real speedup example: Image processing — converting 1000 pixels from RGB to grayscale:

Scalar (no SIMD): 3,000 multiply-add operations
AVX2 (8 floats/op): ~375 operations — 8× throughput improvement

Auto-vectorization by compilers (GCC with -O3 -march=native, LLVM) can apply SIMD automatically for simple loops. Manual intrinsics give full control.

Cache Coherence in Multicore: The MESI Protocol

With multiple cores sharing the same physical memory, each maintaining their own private L1/L2 caches, a fundamental problem emerges: coherence.

If Core 0 reads x into its L1 cache, then Core 1 writes x, Core 0's cached value is now stale. Without coherence, programs produce wrong answers.

The MESI protocol (covered in cache lesson) assigns each cache line one state: Modified, Exclusive, Shared, Invalid. State transitions are broadcast on the cache interconnect bus (or ring bus / mesh in modern CPUs).

False sharing: Two cores access different variables that happen to share the same 64-byte cache line. Every write by one core invalidates the other core's copy — even though neither core is actually touching the other's data. The cache line bounces between cores at memory bus frequency, serializing execution.

Fix: pad data structures so unrelated variables don't share cache lines. In Linux kernel: ____cacheline_aligned_in_smp attribute.

Amdahl's Law: The Hard Limit on Parallelism

Gene Amdahl (1967) formalized the ceiling on parallel speedup:

$$\text{Speedup}(N) = \frac{1}{S + \frac{(1-S)}{N}}$$

Where:

S = fraction of program that is serial (cannot be parallelized)
N = number of parallel processors
(1 - S) = parallelizable fraction

If 10% of a program is serial (S = 0.1):

2 cores: speedup = 1 / (0.1 + 0.45) = 1.82×
8 cores: speedup = 1 / (0.1 + 0.11) = 4.7×
64 cores: speedup = 1 / (0.1 + 0.014) = 8.8×
Infinite cores: maximum speedup = 10×

Sixty-four processors provide less than 9× speedup because the 10% serial code becomes the only bottleneck. Amdahl's Law explains why massive parallelism doesn't automatically produce massive speedup for most real-world applications.

Gustafson's Law offers a different perspective: if the problem scales with the number of processors (larger problems for more processors), the serial fraction becomes relatively smaller and speedup approaches linear.

Hyperthreading / Simultaneous Multithreading (SMT)

Intel Hyperthreading (trademarked SMT) makes one physical core appear as two logical cores to the OS:

Two complete sets of architectural state: registers, PC, flags
Both threads share the same execution units (ALU, FPU, caches)
When Thread 0 stalls (cache miss, branch misprediction), Thread 1's instructions can use the idle execution resources

Performance gain: typically 10–30% throughput improvement for mixed workloads. Not 2× — the two threads compete for the same execution units.

AMD's implementation: SMT2 (2 threads per core) on Zen architecture. Both companies found 2-way SMT to be the sweet spot; SMT4 (IBM POWER10) is used in high-throughput server chips.

Summary

The thermal wall of ~2004 ended single-core frequency scaling and forced the industry toward multicore processors. Flynn's Taxonomy classifies parallel architectures as SISD, SIMD, MISD, and MIMD — modern CPUs are simultaneously MIMD (multiple independent cores) and SIMD (vector execution within each core). SIMD extensions (SSE, AVX2, AVX-512) provide 4–16× data throughput for vectorizable workloads. Cache coherence (MESI protocol) ensures correctness but introduces false sharing as a performance hazard. Amdahl's Law provides the fundamental ceiling: with 10% serial code, no amount of parallelism can exceed 10× speedup — making the identification and reduction of serial bottlenecks as important as adding cores.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

30 minLesson 14 of 16

Course Contents(16 lessons)

▾

Chapter 1: Foundations

What Is Computer Architecture? Von Neumann vs Harvard20 min

Number Systems: Binary, Octal, Hexadecimal28 min

Data Representation: Integers, Floats, and IEEE 75430 min

Chapter 2: Digital Logic

Boolean Algebra and Logic Gates32 min

Combinational Circuits: Adders, Multiplexers, Decoders28 min

Sequential Circuits: Flip-Flops, Registers, Counters30 min

Chapter 3: CPU Architecture

ALU, Registers, and the Datapath32 min

Instruction Set Architecture: RISC vs CISC35 min

CPU Pipeline: The 5-Stage Execution Engine35 min

Pipeline Hazards and Modern Solutions30 min

Chapter 4: Memory Systems

Cache Memory: Mapping, Associativity, Replacement35 min

Virtual Memory, Page Tables, and TLB32 min

Chapter 5: I/O and Advanced Topics

I/O Systems, Interrupts, and DMA28 min

Parallel Processing: Multicore and Flynn's Taxonomy30 min

Modern CPU Architectures: ARM, x86-64, Apple Silicon28 min

Chapter 6: Final Project

Final Project: Analyze and Compare CPU Architectures45 min