AiTechWorlds
AiTechWorlds
In 1999, a hospital in Tokyo faced a surgical crisis. Patient volume had tripled. The solution wasn't to find a surgeon who could operate 32 times faster — that's physically impossible. The solution was to build more operating theaters and hire more surgeons.
A single surgeon can only be in one place at once. Thirty-two surgeons in thirty-two operating rooms can handle thirty-two patients simultaneously. The hospital's throughput is thirty-two times greater, even though no individual surgeon is any faster.
This is the story of multicore processors. In 2004, Intel's Prescott processor reached 3.8 GHz — and came within striking distance of catching fire. The era of single-core frequency scaling hit a wall, not because of a lack of engineering ambition, but because of the fundamental laws of physics. The industry's response was to stop building faster surgeons and start building more operating rooms.
From 1970 to 2004, CPU performance doubled roughly every 18 months — a manifestation of Dennard Scaling alongside Moore's Law.
Dennard Scaling predicted that as transistors shrink:
This held true for decades. Then, around 130nm and below, Dennard Scaling broke down:
Intel's Prescott (2004) at 90nm, 3.8 GHz consumed 115 watts — a processor the size of a postage stamp producing more heat per square centimeter than a hot plate. Intel's planned 4 GHz Tejas processor was cancelled after thermal simulations suggested it would require cooling infrastructure impractical for consumer systems.
The Power Wall and Thermal Wall were the same wall. The frequency race ended.
After 2004, clock speeds stagnated. The new strategy: more cores at moderate frequencies.
Rather than one core at 4 GHz, build four cores at 3 GHz. For single-threaded work, you lose a little. For multi-threaded workloads, you multiply throughput.
Modern CPU core counts:
| Processor | Cores | Configuration | Max Boost |
|---|---|---|---|
| Intel Core i9-14900K | 24 | 8 P-cores + 16 E-cores | 6.0 GHz |
| AMD Ryzen 9 7950X | 16 | 16 Zen 4 cores | 5.7 GHz |
| Apple M3 Max | 16 | 12 P-cores + 4 E-cores | 4.0 GHz |
| AMD EPYC 9654 (server) | 96 | 96 Zen 4 cores | 3.7 GHz |
| AWS Graviton4 (ARM) | 96 | 96 Neoverse V2 cores | 2.8 GHz |
Intel's hybrid architecture (P-cores + E-cores, since Alder Lake 2021) mirrors ARM's big.LITTLE design: large, high-performance cores for intensive workloads, small efficiency cores for background tasks. The OS scheduler is aware of core types and assigns tasks accordingly.
Michael Flynn's 1966 classification remains the standard framework for parallel computer architectures:
| Class | Full Name | Description | Real-World Example |
|---|---|---|---|
| SISD | Single Instruction, Single Data | Classic serial processor — one instruction, one data element | Early Intel 8086, simple microcontrollers |
| SIMD | Single Instruction, Multiple Data | One instruction applies to many data elements in parallel | CPU vector units (SSE, AVX), GPU shader cores |
| MISD | Multiple Instructions, Single Data | Multiple processors apply different operations to the same data | Space Shuttle flight computers (fault tolerance through redundancy) |
| MIMD | Multiple Instructions, Multiple Data | Multiple processors execute independent instruction streams on independent data | Multicore CPUs, HPC clusters, distributed systems |
Modern CPUs are MIMD at the core level (multiple independent cores) and SIMD within each core (vector execution units). This combination provides parallelism at two levels simultaneously.
SIMD (Single Instruction, Multiple Data) allows one instruction to operate on a vector of data elements simultaneously:
ADD R1, R2 — adds two 32-bit numbers → 1 operationVADDPS YMM0, YMM1, YMM2 — adds eight 32-bit floats → 8 operationsThe evolution of x86 SIMD:
| Extension | Width | Elements (float32) | Year | Notes |
|---|---|---|---|---|
| MMX | 64-bit | 2 | 1997 | Integer only, overlapped FPU registers |
| SSE | 128-bit | 4 | 1999 | Dedicated XMM registers |
| SSE2 | 128-bit | 4 | 2001 | Added double precision, integer ops |
| SSE4.2 | 128-bit | 4 | 2008 | String processing, CRC32 |
| AVX | 256-bit | 8 | 2011 | New YMM registers, 3-operand encoding |
| AVX2 | 256-bit | 8 | 2013 | Integer gather, FMA instructions |
| AVX-512 | 512-bit | 16 | 2016 | Intel Xeon; dropped in some desktop CPUs |
Real speedup example: Image processing — converting 1000 pixels from RGB to grayscale:
Auto-vectorization by compilers (GCC with -O3 -march=native, LLVM) can apply SIMD automatically for simple loops. Manual intrinsics give full control.
With multiple cores sharing the same physical memory, each maintaining their own private L1/L2 caches, a fundamental problem emerges: coherence.
If Core 0 reads x into its L1 cache, then Core 1 writes x, Core 0's cached value is now stale. Without coherence, programs produce wrong answers.
The MESI protocol (covered in cache lesson) assigns each cache line one state: Modified, Exclusive, Shared, Invalid. State transitions are broadcast on the cache interconnect bus (or ring bus / mesh in modern CPUs).
False sharing: Two cores access different variables that happen to share the same 64-byte cache line. Every write by one core invalidates the other core's copy — even though neither core is actually touching the other's data. The cache line bounces between cores at memory bus frequency, serializing execution.
Fix: pad data structures so unrelated variables don't share cache lines. In Linux kernel: ____cacheline_aligned_in_smp attribute.
Gene Amdahl (1967) formalized the ceiling on parallel speedup:
$$\text{Speedup}(N) = \frac{1}{S + \frac{(1-S)}{N}}$$
Where:
If 10% of a program is serial (S = 0.1):
Sixty-four processors provide less than 9× speedup because the 10% serial code becomes the only bottleneck. Amdahl's Law explains why massive parallelism doesn't automatically produce massive speedup for most real-world applications.
Gustafson's Law offers a different perspective: if the problem scales with the number of processors (larger problems for more processors), the serial fraction becomes relatively smaller and speedup approaches linear.
Intel Hyperthreading (trademarked SMT) makes one physical core appear as two logical cores to the OS:
Performance gain: typically 10–30% throughput improvement for mixed workloads. Not 2× — the two threads compete for the same execution units.
AMD's implementation: SMT2 (2 threads per core) on Zen architecture. Both companies found 2-way SMT to be the sweet spot; SMT4 (IBM POWER10) is used in high-throughput server chips.
The thermal wall of ~2004 ended single-core frequency scaling and forced the industry toward multicore processors. Flynn's Taxonomy classifies parallel architectures as SISD, SIMD, MISD, and MIMD — modern CPUs are simultaneously MIMD (multiple independent cores) and SIMD (vector execution within each core). SIMD extensions (SSE, AVX2, AVX-512) provide 4–16× data throughput for vectorizable workloads. Cache coherence (MESI protocol) ensures correctness but introduces false sharing as a performance hazard. Amdahl's Law provides the fundamental ceiling: with 10% serial code, no amount of parallelism can exceed 10× speedup — making the identification and reduction of serial bottlenecks as important as adding cores.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises