Picture a massive car factory in Detroit, circa 1913. Before Henry Ford's assembly line, one team of workers would build an entire car from scratch — bolting the chassis, installing the engine, fitting the seats, painting the body — before moving to the next car. It was thorough, but agonizingly slow.

Ford's insight was radical: break the work into stations. One station welds the frame. The next fits the engine. The next installs the interior. Each car moves down the line, and every station is always busy with a different car. The factory's throughput exploded.

A modern CPU pipeline is exactly this idea, applied to instructions instead of cars. Rather than fully completing one instruction before touching the next, the processor breaks instruction execution into distinct stages and keeps every stage busy simultaneously. The result is dramatic — instead of taking five clock cycles per instruction, a five-stage pipeline can complete one instruction every single clock cycle once it's full.

This single architectural concept underpins the performance of every modern processor from your laptop's Intel Core to your phone's Apple A18.

Why Pipelining Exists: The Speed Problem

Before we understand how pipelining works, let's understand why it had to be invented.

A CPU's job is to execute instructions. Even the simplest instruction — say, ADD R1, R2, R3 — involves multiple steps: fetch the instruction from memory, figure out what it means, do the arithmetic, and store the result. In a non-pipelined (serial) CPU, the processor must finish every one of those steps before it can even look at the next instruction.

If each step takes one clock cycle and there are five steps, one instruction takes 5 clock cycles. At 1 GHz, that means only 200 million instructions per second — a paltry figure by modern standards.

Now imagine overlapping those steps. While instruction 1 is in its third step (Execute), instruction 2 can be in its second step (Decode), and instruction 3 can be doing its first step (Fetch). All three instructions are making progress simultaneously. Throughput becomes one instruction per clock cycle instead of one per five cycles — a 5× improvement in theory.

This is the fundamental promise of pipelining: parallelism in time, not space.

The Classic 5-Stage RISC Pipeline (MIPS)

The textbook pipeline that every computer architecture course teaches comes from the MIPS architecture, developed at Stanford in the early 1980s. It breaks instruction execution into exactly five stages:

At cycle 5, all five stages are simultaneously occupied by different instructions — the pipeline is full.

Stage	Name	What Happens	Hardware Used	Duration
IF	Instruction Fetch	Read instruction from memory at the Program Counter (PC) address; increment PC	Instruction Cache, PC Register, Instruction Register	1 cycle
ID	Instruction Decode	Identify instruction opcode; read source register values from register file	Control Unit, Register File, Sign Extender	1 cycle
EX	Execute	ALU performs arithmetic, logic, or address calculation	ALU, Multiplier, Barrel Shifter	1 cycle
MEM	Memory Access	Load/store instructions read/write data memory; others pass through	Data Cache, Memory Bus	1 cycle
WB	Write Back	Write result to destination register in register file	Register File (write port)	1 cycle

Pipeline Registers

Between each pair of stages sit pipeline registers — flip-flops that latch the outputs of one stage and feed them into the next on the next clock edge. These are the "conveyor belt" mechanism:

IF/ID register: holds the fetched instruction and next PC
ID/EX register: holds register values, control signals, immediate
EX/MEM register: holds ALU result, branch target, control signals
MEM/WB register: holds memory data or ALU result, destination register

Throughput vs Latency

Pipelining improves throughput (instructions completed per unit time) but not latency (time for one instruction to complete). A single instruction still takes 5 clock cycles from start to finish — pipelining doesn't help that instruction individually. What it does is ensure that by cycle 5, the first instruction finishes, and from then on, a new instruction finishes every single cycle.

Key insight: Pipelining is about keeping all hardware busy at all times, not about speeding up individual instructions.

CPI (Cycles Per Instruction):

Ideal pipeline: CPI = 1.0
Real-world pipeline: CPI = 1.1–1.3 due to pipeline hazards (stalls and flushes)
A program with CPI = 1.2 means on average an instruction completes every 1.2 clock cycles

Modern CPUs: Beyond Five Stages

The MIPS 5-stage pipeline is beautiful in its simplicity, but modern processors push this concept to extremes:

Intel Pentium 4 (Northwood, 2002): 20-stage pipeline, reaching 3.06 GHz
Intel Prescott (2004): 31-stage pipeline, reaching 3.8 GHz — infamously hot
Intel Core (Nehalem, 2008): 14-stage pipeline — a deliberate pullback for efficiency
Intel 13th Gen Raptor Lake (2022): approximately 14–19 stages depending on core type
AMD Zen 4 (2022): approximately 19-stage pipeline

More stages mean each stage does less work, which allows a higher clock frequency. But it also means a branch misprediction (having to flush the pipeline) wastes more cycles. This is the fundamental pipeline depth trade-off.

Real Performance Numbers

To understand the real-world impact of pipelining, consider this comparison:

Without pipelining (serial execution, 5-stage work):

1,000 instructions × 5 cycles each = 5,000 clock cycles

With 5-stage pipeline (after fill):

5 cycles to fill + 995 remaining = ~1,000 clock cycles
Speedup: approximately 5×

At 3 GHz with ideal CPI = 1.0:

3 billion instructions per second — that's 3 GIPS (Giga Instructions Per Second)

Modern superscalar processors execute multiple instructions simultaneously in the same pipeline stage (Intel's P-cores execute up to 6 instructions per cycle), pushing effective IPC (Instructions Per Clock) well above 1.0. A modern Intel Core i9 commonly achieves IPC of 5–6 in real workloads — meaning six instructions complete every clock cycle.

Summary

The CPU pipeline transforms instruction execution from a strictly sequential process into an overlapped, assembly-line process. The classic 5-stage MIPS pipeline — IF → ID → EX → MEM → WB — achieves a theoretical throughput of one instruction per clock cycle. Modern processors extend this to 14–31 stages to enable multi-GHz clock speeds, and combine deep pipelines with superscalar execution to reach IPC values far above 1.0.

Pipeline hazards (structural, data, and control conflicts) prevent the ideal from being achieved in practice, keeping real-world CPI between 1.1 and 1.3 — the subject of the next lesson.

Key numbers to remember:

Classic RISC pipeline: 5 stages
Ideal CPI: 1.0
Real-world CPI: 1.1–1.3
Modern Intel pipeline depth: 14–19 stages
Modern Intel IPC: 5–6 (superscalar + out-of-order)

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

35 minLesson 9 of 16

Course Contents(16 lessons)

▾

Chapter 1: Foundations

What Is Computer Architecture? Von Neumann vs Harvard20 min

Number Systems: Binary, Octal, Hexadecimal28 min

Data Representation: Integers, Floats, and IEEE 75430 min

Chapter 2: Digital Logic

Boolean Algebra and Logic Gates32 min

Combinational Circuits: Adders, Multiplexers, Decoders28 min

Sequential Circuits: Flip-Flops, Registers, Counters30 min

Chapter 3: CPU Architecture

ALU, Registers, and the Datapath32 min

Instruction Set Architecture: RISC vs CISC35 min

CPU Pipeline: The 5-Stage Execution Engine35 min

Pipeline Hazards and Modern Solutions30 min

Chapter 4: Memory Systems

Cache Memory: Mapping, Associativity, Replacement35 min

Virtual Memory, Page Tables, and TLB32 min

Chapter 5: I/O and Advanced Topics

I/O Systems, Interrupts, and DMA28 min

Parallel Processing: Multicore and Flynn's Taxonomy30 min

Modern CPU Architectures: ARM, x86-64, Apple Silicon28 min

Chapter 6: Final Project

Final Project: Analyze and Compare CPU Architectures45 min