You've studied pipeline stages, hazards, cache mapping, virtual memory, I/O systems, parallel processing, and modern CPU architectures. But reading about concepts and applying them to real hardware specifications are two different cognitive experiences.

This capstone project bridges that gap. You will analyze two of the most significant real-world processors in current use, design your own comparative framework, apply pipeline and cache concepts to concrete problems, and verify that every number you write traces back to a concept from this course.

Engineers who can read a datasheet and immediately identify which architectural choices produced which specification — that's the skill this project develops.

Part 1: Intel Core i7-13700K Deep Analysis

The Intel Core i7-13700K (Raptor Lake, Q4 2022) is a mainstream desktop processor that embodies Intel's hybrid architecture design.

Verified Specifications

Specification	Value
Core Configuration	8 Performance-cores (P-cores) + 8 Efficiency-cores (E-cores)
Thread Count	24 threads (P-cores are hyperthreaded: 8×2 = 16; E-cores: 8×1 = 8)
P-core Base / Boost	3.4 GHz base / 5.4 GHz max turbo
E-core Base / Boost	2.5 GHz base / 4.2 GHz boost
L1 Cache (P-core)	48 KB instruction + 48 KB data per core
L1 Cache (E-core)	64 KB instruction + 32 KB data per core
L2 Cache	2 MB per P-core; 4 MB per E-core cluster (4 cores share)
L3 Cache	30 MB shared (Intel Smart Cache)
Process Node	Intel 7 (enhanced 10nm ESF)
Transistors	~25 billion
TDP	125W base; 253W max turbo power (MTP)
Memory Support	DDR4-3200 or DDR5-5600 dual-channel
PCIe	PCIe 5.0 ×16 (GPU) + PCIe 4.0 ×4 (NVMe)
Socket	LGA1700

Connecting Specifications to Course Concepts

Pipeline (Lesson: CPU Pipeline Stages): The i7-13700K uses a 14–19 stage pipeline in P-cores (Golden Cove / Raptor Cove microarchitecture). This depth enables the 5.4 GHz boost clock — shorter stages mean faster clocks. The E-cores (Gracemont) use a shorter pipeline (~12 stages) prioritizing power efficiency over peak clock.

Hazards (Lesson: Pipeline Hazards & Solutions): Intel's Raptor Cove uses a Tournament branch predictor (TAGE-SC-L family) achieving 95–99% accuracy on typical workloads. The out-of-order execution window holds 512 reorder buffer (ROB) entries — meaning 512 instructions can be "in flight" simultaneously while the scheduler finds independent work to hide latency.

Cache (Lesson: Cache Memory & Mapping):

L1 I-cache: 48KB, 8-way set associative, 12-cycle fill latency
L1 D-cache: 48KB, 12-way set associative, 5-cycle hit latency
L2: 2MB, 16-way set associative, ~14-cycle hit latency
L3: 30MB, 24-way set associative, ~40-cycle hit latency
All use PLRU (Pseudo-LRU) replacement policy
MESI protocol for cache coherence across all 16 physical cores

Virtual Memory (Lesson: Virtual Memory & Paging): Supports 4-level and 5-level paging (LA57 for 57-bit virtual addresses — 128 PB virtual space). Includes a 2,048-entry L2 TLB for 4KB pages.

I/O (Lesson: I/O Systems, Interrupts & DMA): The APIC handles 256 interrupt vectors with hardware priority levels. PCIe 5.0 ×16 provides 64 GB/s DMA bandwidth to the discrete GPU; PCIe 4.0 ×4 to NVMe SSDs provides up to 7 GB/s DMA bandwidth.

Parallel Processing (Lesson: Parallel Processing & Multicore): The hybrid P+E core design directly implements Flynn's MIMD taxonomy. Intel's Thread Director (hardware telemetry) assists the Windows 11/Linux 5.18+ scheduler in assigning tasks to appropriate core types. Hyperthreading on P-cores provides 2-way SMT.

Part 2: Apple M3 Pro Analysis & Intel Comparison

The Apple M3 Pro (November 2023, TSMC 3nm N3B) represents the ARM architectural philosophy applied to mainstream laptop computing.

Verified Specifications

Specification	Value
Core Configuration	12 CPU cores: 6 P-cores + 6 E-cores
GPU Cores	18-core Apple GPU
Neural Engine	16-core, 18 TOPS
Memory	Unified: 18 GB or 36 GB LPDDR5 on-package
Memory Bandwidth	150 GB/s
Process	TSMC 3nm (N3B)
Transistors	37 billion
TDP	~30W sustained (thermal envelope)
L1 Cache (P-core)	192 KB instruction + 128 KB data per core
L2 Cache (P-cluster)	24 MB shared L2 (6 P-cores)
System Level Cache	24 MB SLC (L3 equivalent)
PCIe	PCIe 4.0 ×4 to NVMe

Architectural Trade-off Comparison

Key trade-offs Apple made:

Unified Memory vs Discrete GPU VRAM: Apple's UMA means the 18 GB is shared between CPU and GPU. A workstation GPU (RTX 4090) has 24 GB dedicated VRAM + 32+ GB system RAM. For AI inference and video editing, UMA is often faster (no PCIe copy). For high-end gaming, dedicated VRAM with >32 GB capacity wins.
ARM ISA vs x86-64: Apple's M3 Pro has ~30% better performance-per-watt than the i7-13700K. But x86-64 native compatibility matters — some professional software (older CAD tools, Windows-only applications) requires Rosetta 2 translation or virtualization, adding overhead.
150 GB/s vs ~90 GB/s bandwidth: The M3 Pro's LPDDR5 on-package memory provides dramatically higher bandwidth than the i7-13700K's off-die DDR5-5600 (~89 GB/s). This benefits GPU rendering, video transcoding, and memory-bound workloads.
Neural Engine: The 16-core Neural Engine handles Core ML inference workloads (image recognition, LLM inference) at 18 TOPS with vastly better power efficiency than running the same work on CPU or GPU. The i7-13700K has no dedicated NPU (Neural Processing Unit) — Intel's Meteor Lake added one in 2023.

Part 3: Architecture Comparison Table (Your Framework)

Use this table as a starting point. Extend it with your own research:

Attribute	Intel (i9-14900K)	AMD (Ryzen 9 7950X)	ARM (Cortex-X4)	Apple Silicon (M4)	RISC-V (SiFive P670)
ISA	x86-64	x86-64	ARMv9-A	ARMv9-A	RV64GC
Transistors	~25B	~13B (CCD) + ~6B (IOD)	~300M (core)	28B	~150M
Process Node	Intel 7 (10nm)	TSMC 5nm / 6nm IOD	TSMC 4nm	TSMC 3nm	TSMC 7nm
TDP / Power	125W–253W	170W	1–5W (mobile)	~30W (M4 Pro)	~0.5W
Perf / Watt	Moderate	Good	Excellent	Exceptional	Good
Memory	DDR5-5600	DDR5-5200	LPDDR5X	LPDDR5 unified	DDR4
Primary Use Case	Desktop gaming/workstation	Desktop gaming/workstation	Smartphones (Google Pixel, Galaxy)	Mac laptops/desktops	Embedded/IoT/edge AI

Part 4: Pipeline Hazard Application Exercise

Consider these five sequential instructions on a 5-stage RISC pipeline (MIPS-like):

I1: ADD  R1, R2, R3   # R1 = R2 + R3
I2: SUB  R4, R1, R5   # R4 = R1 - R5   ← reads R1 (written by I1)
I3: AND  R6, R4, R7   # R6 = R4 & R7   ← reads R4 (written by I2)
I4: LW   R8, 0(R6)    # R8 = Memory[R6]← reads R6 (written by I3)
I5: ADD  R9, R8, R1   # R9 = R8 + R1   ← reads R8 (written by I4), reads R1 (written by I1)

RAW Hazard Analysis

Dependency	Instructions	Gap (cycles)	Resolvable by Forwarding?	Stalls Needed
R1: I1 → I2	ADD writes R1; SUB reads R1	1 cycle apart	Yes (EX/MEM → EX forward)	0 stalls
R4: I2 → I3	SUB writes R4; AND reads R4	1 cycle apart	Yes (EX/MEM → EX forward)	0 stalls
R6: I3 → I4	AND writes R6; LW reads R6	1 cycle apart	Yes (EX/MEM → EX forward — address calc)	0 stalls
R8: I4 → I5	LW writes R8; ADD reads R8	1 cycle apart	Partial — Load-Use hazard	1 stall required
R1: I1 → I5	ADD writes R1; ADD reads R1	4 cycles apart	Yes (already in register file by WB)	0 stalls

Key finding: Only the I4 → I5 load-use hazard requires an unavoidable stall. All other RAW hazards are resolved by forwarding (EX/MEM → EX path). Without forwarding, I1→I2 and I2→I3 would each require 2 stall cycles, and I3→I4 would require 1 stall — a total of 5 wasted cycles reduced to 1.

Pipeline execution timeline with forwarding + 1 load-use stall:

Cycle	1	2	3	4	5	6	7	8	9	10
I1	IF	ID	EX	MEM	WB
I2		IF	ID	EX	MEM	WB
I3			IF	ID	EX	MEM	WB
I4				IF	ID	EX	MEM	WB
bubble						IF	stall
I5							IF	ID	EX	MEM

Bold = forwarding; stall = 1 inserted bubble due to load-use hazard

Part 5: Cache Calculation Exercise

Given: A 4-way set associative cache with:

Total size: 32 KB (32,768 bytes)
Cache line (block) size: 64 bytes
Associativity: 4-way
Address space: 64-bit

Step-by-Step Calculation

Step 1: Number of cache lines total

Total lines = Cache size / Line size = 32,768 / 64 = 512 lines

Step 2: Number of sets

Sets = Total lines / Associativity = 512 / 4 = 128 sets

Step 3: Offset bits (bits needed to address one byte within a 64-byte line)

Offset bits = log₂(64) = 6 bits

Step 4: Index bits (bits needed to select one of 128 sets)

Index bits = log₂(128) = 7 bits

Step 5: Tag bits (remaining bits identify which memory block)

Tag bits = 64 - 7 - 6 = 51 bits

Summary Table:

Parameter	Calculation	Result
Total cache lines	32,768 / 64	512 lines
Number of sets	512 / 4	128 sets
Offset bits	log₂(64)	6 bits
Index bits	log₂(128)	7 bits
Tag bits	64 − 7 − 6	51 bits
Tag storage overhead	512 lines × 51 bits	~3.2 KB
Valid + dirty bits	512 × 2 bits	128 bytes

Verification: 6 + 7 + 51 = 64 bits ✓

What You Learned: Course Recap

This course covered the complete picture of how modern processors work:

Lesson	Core Concept	Key Number
What is Computer Architecture	von Neumann model, ISA abstraction	Harvard vs. von Neumann
CPU Pipeline Stages	5-stage RISC pipeline (IF/ID/EX/MEM/WB)	Ideal CPI = 1.0
Pipeline Hazards & Solutions	Structural, Data, Control hazards; forwarding; branch prediction	95–99% branch prediction accuracy
Cache Memory & Mapping	Locality, hierarchy, direct/set-associative/full mapping	300× CPU-to-RAM speed gap
Virtual Memory & Paging	Pages, frames, page tables, TLB, page faults	4-level page table on x86-64
I/O Systems, Interrupts & DMA	Polling, interrupt-driven, DMA; IRQ, ISR, IDT	DMA: 0 CPU cycles for bulk transfer
Parallel Processing & Multicore	Flynn taxonomy, Amdahl's Law, SIMD, cache coherence	10× max speedup with 10% serial code
Modern CPU Architectures	x86-64, ARM, RISC-V, process nodes, chiplets	3nm = 292M transistors/mm²

The unifying insight: Every architectural decision in a CPU — pipeline depth, cache size, branch predictor complexity, core count, ISA choice — is a trade-off. Performance vs. power. Throughput vs. latency. Complexity vs. reliability. The engineer's job is to understand which trade-offs serve the target use case.

A smartphone processor and a server processor can both be "great CPUs" while making almost entirely opposite architectural choices. The measure of mastery is knowing why those choices differ — and this course has given you exactly that foundation.

Previous 🎉 View Course Summary

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

45 minLesson 16 of 16

Course Contents(16 lessons)

▾

Chapter 1: Foundations

What Is Computer Architecture? Von Neumann vs Harvard20 min

Number Systems: Binary, Octal, Hexadecimal28 min

Data Representation: Integers, Floats, and IEEE 75430 min

Chapter 2: Digital Logic

Boolean Algebra and Logic Gates32 min

Combinational Circuits: Adders, Multiplexers, Decoders28 min

Sequential Circuits: Flip-Flops, Registers, Counters30 min

Chapter 3: CPU Architecture

ALU, Registers, and the Datapath32 min

Instruction Set Architecture: RISC vs CISC35 min

CPU Pipeline: The 5-Stage Execution Engine35 min

Pipeline Hazards and Modern Solutions30 min

Chapter 4: Memory Systems

Cache Memory: Mapping, Associativity, Replacement35 min

Virtual Memory, Page Tables, and TLB32 min

Chapter 5: I/O and Advanced Topics

I/O Systems, Interrupts, and DMA28 min

Parallel Processing: Multicore and Flynn's Taxonomy30 min

Modern CPU Architectures: ARM, x86-64, Apple Silicon28 min

Chapter 6: Final Project

Final Project: Analyze and Compare CPU Architectures45 min

Chapter 6: Final Project

Final Project: Analyze and Compare CPU Architectures

Capstone Project: Real CPU Architecture Analysis & Application

Why This Project Exists

Engineers who can read a datasheet and immediately identify which architectural choices produced which specification — that's the skill this project develops.

Part 1: Intel Core i7-13700K Deep Analysis

The Intel Core i7-13700K (Raptor Lake, Q4 2022) is a mainstream desktop processor that embodies Intel's hybrid architecture design.

Verified Specifications

Specification	Value
Core Configuration	8 Performance-cores (P-cores) + 8 Efficiency-cores (E-cores)
Thread Count	24 threads (P-cores are hyperthreaded: 8×2 = 16; E-cores: 8×1 = 8)
P-core Base / Boost	3.4 GHz base / 5.4 GHz max turbo
E-core Base / Boost	2.5 GHz base / 4.2 GHz boost
L1 Cache (P-core)	48 KB instruction + 48 KB data per core
L1 Cache (E-core)	64 KB instruction + 32 KB data per core
L2 Cache	2 MB per P-core; 4 MB per E-core cluster (4 cores share)
L3 Cache	30 MB shared (Intel Smart Cache)
Process Node	Intel 7 (enhanced 10nm ESF)
Transistors	~25 billion
TDP	125W base; 253W max turbo power (MTP)
Memory Support	DDR4-3200 or DDR5-5600 dual-channel
PCIe	PCIe 5.0 ×16 (GPU) + PCIe 4.0 ×4 (NVMe)
Socket	LGA1700

Connecting Specifications to Course Concepts

Cache (Lesson: Cache Memory & Mapping):

L1 I-cache: 48KB, 8-way set associative, 12-cycle fill latency
L1 D-cache: 48KB, 12-way set associative, 5-cycle hit latency
L2: 2MB, 16-way set associative, ~14-cycle hit latency
L3: 30MB, 24-way set associative, ~40-cycle hit latency
All use PLRU (Pseudo-LRU) replacement policy
MESI protocol for cache coherence across all 16 physical cores

Part 2: Apple M3 Pro Analysis & Intel Comparison

The Apple M3 Pro (November 2023, TSMC 3nm N3B) represents the ARM architectural philosophy applied to mainstream laptop computing.

Verified Specifications

Specification	Value
Core Configuration	12 CPU cores: 6 P-cores + 6 E-cores
GPU Cores	18-core Apple GPU
Neural Engine	16-core, 18 TOPS
Memory	Unified: 18 GB or 36 GB LPDDR5 on-package
Memory Bandwidth	150 GB/s
Process	TSMC 3nm (N3B)
Transistors	37 billion
TDP	~30W sustained (thermal envelope)
L1 Cache (P-core)	192 KB instruction + 128 KB data per core
L2 Cache (P-cluster)	24 MB shared L2 (6 P-cores)
System Level Cache	24 MB SLC (L3 equivalent)
PCIe	PCIe 4.0 ×4 to NVMe

Architectural Trade-off Comparison

Key trade-offs Apple made:

Unified Memory vs Discrete GPU VRAM: Apple's UMA means the 18 GB is shared between CPU and GPU. A workstation GPU (RTX 4090) has 24 GB dedicated VRAM + 32+ GB system RAM. For AI inference and video editing, UMA is often faster (no PCIe copy). For high-end gaming, dedicated VRAM with >32 GB capacity wins.
ARM ISA vs x86-64: Apple's M3 Pro has ~30% better performance-per-watt than the i7-13700K. But x86-64 native compatibility matters — some professional software (older CAD tools, Windows-only applications) requires Rosetta 2 translation or virtualization, adding overhead.
150 GB/s vs ~90 GB/s bandwidth: The M3 Pro's LPDDR5 on-package memory provides dramatically higher bandwidth than the i7-13700K's off-die DDR5-5600 (~89 GB/s). This benefits GPU rendering, video transcoding, and memory-bound workloads.
Neural Engine: The 16-core Neural Engine handles Core ML inference workloads (image recognition, LLM inference) at 18 TOPS with vastly better power efficiency than running the same work on CPU or GPU. The i7-13700K has no dedicated NPU (Neural Processing Unit) — Intel's Meteor Lake added one in 2023.

Part 3: Architecture Comparison Table (Your Framework)

Use this table as a starting point. Extend it with your own research:

Attribute	Intel (i9-14900K)	AMD (Ryzen 9 7950X)	ARM (Cortex-X4)	Apple Silicon (M4)	RISC-V (SiFive P670)
ISA	x86-64	x86-64	ARMv9-A	ARMv9-A	RV64GC
Transistors	~25B	~13B (CCD) + ~6B (IOD)	~300M (core)	28B	~150M
Process Node	Intel 7 (10nm)	TSMC 5nm / 6nm IOD	TSMC 4nm	TSMC 3nm	TSMC 7nm
TDP / Power	125W–253W	170W	1–5W (mobile)	~30W (M4 Pro)	~0.5W
Perf / Watt	Moderate	Good	Excellent	Exceptional	Good
Memory	DDR5-5600	DDR5-5200	LPDDR5X	LPDDR5 unified	DDR4
Primary Use Case	Desktop gaming/workstation	Desktop gaming/workstation	Smartphones (Google Pixel, Galaxy)	Mac laptops/desktops	Embedded/IoT/edge AI

Part 4: Pipeline Hazard Application Exercise

Consider these five sequential instructions on a 5-stage RISC pipeline (MIPS-like):

I1: ADD  R1, R2, R3   # R1 = R2 + R3
I2: SUB  R4, R1, R5   # R4 = R1 - R5   ← reads R1 (written by I1)
I3: AND  R6, R4, R7   # R6 = R4 & R7   ← reads R4 (written by I2)
I4: LW   R8, 0(R6)    # R8 = Memory[R6]← reads R6 (written by I3)
I5: ADD  R9, R8, R1   # R9 = R8 + R1   ← reads R8 (written by I4), reads R1 (written by I1)

RAW Hazard Analysis

Dependency	Instructions	Gap (cycles)	Resolvable by Forwarding?	Stalls Needed
R1: I1 → I2	ADD writes R1; SUB reads R1	1 cycle apart	Yes (EX/MEM → EX forward)	0 stalls
R4: I2 → I3	SUB writes R4; AND reads R4	1 cycle apart	Yes (EX/MEM → EX forward)	0 stalls
R6: I3 → I4	AND writes R6; LW reads R6	1 cycle apart	Yes (EX/MEM → EX forward — address calc)	0 stalls
R8: I4 → I5	LW writes R8; ADD reads R8	1 cycle apart	Partial — Load-Use hazard	1 stall required
R1: I1 → I5	ADD writes R1; ADD reads R1	4 cycles apart	Yes (already in register file by WB)	0 stalls

Pipeline execution timeline with forwarding + 1 load-use stall:

Cycle	1	2	3	4	5	6	7	8	9	10
I1	IF	ID	EX	MEM	WB
I2		IF	ID	EX	MEM	WB
I3			IF	ID	EX	MEM	WB
I4				IF	ID	EX	MEM	WB
bubble						IF	stall
I5							IF	ID	EX	MEM

Bold = forwarding; stall = 1 inserted bubble due to load-use hazard

Part 5: Cache Calculation Exercise

Given: A 4-way set associative cache with:

Total size: 32 KB (32,768 bytes)
Cache line (block) size: 64 bytes
Associativity: 4-way
Address space: 64-bit

Step-by-Step Calculation

Step 1: Number of cache lines total

Total lines = Cache size / Line size = 32,768 / 64 = 512 lines

Step 2: Number of sets

Sets = Total lines / Associativity = 512 / 4 = 128 sets

Step 3: Offset bits (bits needed to address one byte within a 64-byte line)

Offset bits = log₂(64) = 6 bits

Step 4: Index bits (bits needed to select one of 128 sets)

Index bits = log₂(128) = 7 bits

Step 5: Tag bits (remaining bits identify which memory block)

Tag bits = 64 - 7 - 6 = 51 bits

Summary Table:

Parameter	Calculation	Result
Total cache lines	32,768 / 64	512 lines
Number of sets	512 / 4	128 sets
Offset bits	log₂(64)	6 bits
Index bits	log₂(128)	7 bits
Tag bits	64 − 7 − 6	51 bits
Tag storage overhead	512 lines × 51 bits	~3.2 KB
Valid + dirty bits	512 × 2 bits	128 bytes

Verification: 6 + 7 + 51 = 64 bits ✓

What You Learned: Course Recap

This course covered the complete picture of how modern processors work:

Lesson	Core Concept	Key Number
What is Computer Architecture	von Neumann model, ISA abstraction	Harvard vs. von Neumann
CPU Pipeline Stages	5-stage RISC pipeline (IF/ID/EX/MEM/WB)	Ideal CPI = 1.0
Pipeline Hazards & Solutions	Structural, Data, Control hazards; forwarding; branch prediction	95–99% branch prediction accuracy
Cache Memory & Mapping	Locality, hierarchy, direct/set-associative/full mapping	300× CPU-to-RAM speed gap
Virtual Memory & Paging	Pages, frames, page tables, TLB, page faults	4-level page table on x86-64
I/O Systems, Interrupts & DMA	Polling, interrupt-driven, DMA; IRQ, ISR, IDT	DMA: 0 CPU cycles for bulk transfer
Parallel Processing & Multicore	Flynn taxonomy, Amdahl's Law, SIMD, cache coherence	10× max speedup with 10% serial code
Modern CPU Architectures	x86-64, ARM, RISC-V, process nodes, chiplets	3nm = 292M transistors/mm²

Previous 🎉 View Course Summary

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →