AiTechWorlds
AiTechWorlds
You've studied pipeline stages, hazards, cache mapping, virtual memory, I/O systems, parallel processing, and modern CPU architectures. But reading about concepts and applying them to real hardware specifications are two different cognitive experiences.
This capstone project bridges that gap. You will analyze two of the most significant real-world processors in current use, design your own comparative framework, apply pipeline and cache concepts to concrete problems, and verify that every number you write traces back to a concept from this course.
Engineers who can read a datasheet and immediately identify which architectural choices produced which specification — that's the skill this project develops.
The Intel Core i7-13700K (Raptor Lake, Q4 2022) is a mainstream desktop processor that embodies Intel's hybrid architecture design.
| Specification | Value |
|---|---|
| Core Configuration | 8 Performance-cores (P-cores) + 8 Efficiency-cores (E-cores) |
| Thread Count | 24 threads (P-cores are hyperthreaded: 8×2 = 16; E-cores: 8×1 = 8) |
| P-core Base / Boost | 3.4 GHz base / 5.4 GHz max turbo |
| E-core Base / Boost | 2.5 GHz base / 4.2 GHz boost |
| L1 Cache (P-core) | 48 KB instruction + 48 KB data per core |
| L1 Cache (E-core) | 64 KB instruction + 32 KB data per core |
| L2 Cache | 2 MB per P-core; 4 MB per E-core cluster (4 cores share) |
| L3 Cache | 30 MB shared (Intel Smart Cache) |
| Process Node | Intel 7 (enhanced 10nm ESF) |
| Transistors | ~25 billion |
| TDP | 125W base; 253W max turbo power (MTP) |
| Memory Support | DDR4-3200 or DDR5-5600 dual-channel |
| PCIe | PCIe 5.0 ×16 (GPU) + PCIe 4.0 ×4 (NVMe) |
| Socket | LGA1700 |
Pipeline (Lesson: CPU Pipeline Stages): The i7-13700K uses a 14–19 stage pipeline in P-cores (Golden Cove / Raptor Cove microarchitecture). This depth enables the 5.4 GHz boost clock — shorter stages mean faster clocks. The E-cores (Gracemont) use a shorter pipeline (~12 stages) prioritizing power efficiency over peak clock.
Hazards (Lesson: Pipeline Hazards & Solutions): Intel's Raptor Cove uses a Tournament branch predictor (TAGE-SC-L family) achieving 95–99% accuracy on typical workloads. The out-of-order execution window holds 512 reorder buffer (ROB) entries — meaning 512 instructions can be "in flight" simultaneously while the scheduler finds independent work to hide latency.
Cache (Lesson: Cache Memory & Mapping):
Virtual Memory (Lesson: Virtual Memory & Paging): Supports 4-level and 5-level paging (LA57 for 57-bit virtual addresses — 128 PB virtual space). Includes a 2,048-entry L2 TLB for 4KB pages.
I/O (Lesson: I/O Systems, Interrupts & DMA): The APIC handles 256 interrupt vectors with hardware priority levels. PCIe 5.0 ×16 provides 64 GB/s DMA bandwidth to the discrete GPU; PCIe 4.0 ×4 to NVMe SSDs provides up to 7 GB/s DMA bandwidth.
Parallel Processing (Lesson: Parallel Processing & Multicore): The hybrid P+E core design directly implements Flynn's MIMD taxonomy. Intel's Thread Director (hardware telemetry) assists the Windows 11/Linux 5.18+ scheduler in assigning tasks to appropriate core types. Hyperthreading on P-cores provides 2-way SMT.
The Apple M3 Pro (November 2023, TSMC 3nm N3B) represents the ARM architectural philosophy applied to mainstream laptop computing.
| Specification | Value |
|---|---|
| Core Configuration | 12 CPU cores: 6 P-cores + 6 E-cores |
| GPU Cores | 18-core Apple GPU |
| Neural Engine | 16-core, 18 TOPS |
| Memory | Unified: 18 GB or 36 GB LPDDR5 on-package |
| Memory Bandwidth | 150 GB/s |
| Process | TSMC 3nm (N3B) |
| Transistors | 37 billion |
| TDP | ~30W sustained (thermal envelope) |
| L1 Cache (P-core) | 192 KB instruction + 128 KB data per core |
| L2 Cache (P-cluster) | 24 MB shared L2 (6 P-cores) |
| System Level Cache | 24 MB SLC (L3 equivalent) |
| PCIe | PCIe 4.0 ×4 to NVMe |
Key trade-offs Apple made:
Unified Memory vs Discrete GPU VRAM: Apple's UMA means the 18 GB is shared between CPU and GPU. A workstation GPU (RTX 4090) has 24 GB dedicated VRAM + 32+ GB system RAM. For AI inference and video editing, UMA is often faster (no PCIe copy). For high-end gaming, dedicated VRAM with >32 GB capacity wins.
ARM ISA vs x86-64: Apple's M3 Pro has ~30% better performance-per-watt than the i7-13700K. But x86-64 native compatibility matters — some professional software (older CAD tools, Windows-only applications) requires Rosetta 2 translation or virtualization, adding overhead.
150 GB/s vs ~90 GB/s bandwidth: The M3 Pro's LPDDR5 on-package memory provides dramatically higher bandwidth than the i7-13700K's off-die DDR5-5600 (~89 GB/s). This benefits GPU rendering, video transcoding, and memory-bound workloads.
Neural Engine: The 16-core Neural Engine handles Core ML inference workloads (image recognition, LLM inference) at 18 TOPS with vastly better power efficiency than running the same work on CPU or GPU. The i7-13700K has no dedicated NPU (Neural Processing Unit) — Intel's Meteor Lake added one in 2023.
Use this table as a starting point. Extend it with your own research:
| Attribute | Intel (i9-14900K) | AMD (Ryzen 9 7950X) | ARM (Cortex-X4) | Apple Silicon (M4) | RISC-V (SiFive P670) |
|---|---|---|---|---|---|
| ISA | x86-64 | x86-64 | ARMv9-A | ARMv9-A | RV64GC |
| Transistors | ~25B | ~13B (CCD) + ~6B (IOD) | ~300M (core) | 28B | ~150M |
| Process Node | Intel 7 (10nm) | TSMC 5nm / 6nm IOD | TSMC 4nm | TSMC 3nm | TSMC 7nm |
| TDP / Power | 125W–253W | 170W | 1–5W (mobile) | ~30W (M4 Pro) | ~0.5W |
| Perf / Watt | Moderate | Good | Excellent | Exceptional | Good |
| Memory | DDR5-5600 | DDR5-5200 | LPDDR5X | LPDDR5 unified | DDR4 |
| Primary Use Case | Desktop gaming/workstation | Desktop gaming/workstation | Smartphones (Google Pixel, Galaxy) | Mac laptops/desktops | Embedded/IoT/edge AI |
Consider these five sequential instructions on a 5-stage RISC pipeline (MIPS-like):
I1: ADD R1, R2, R3 # R1 = R2 + R3
I2: SUB R4, R1, R5 # R4 = R1 - R5 ← reads R1 (written by I1)
I3: AND R6, R4, R7 # R6 = R4 & R7 ← reads R4 (written by I2)
I4: LW R8, 0(R6) # R8 = Memory[R6]← reads R6 (written by I3)
I5: ADD R9, R8, R1 # R9 = R8 + R1 ← reads R8 (written by I4), reads R1 (written by I1)
| Dependency | Instructions | Gap (cycles) | Resolvable by Forwarding? | Stalls Needed |
|---|---|---|---|---|
| R1: I1 → I2 | ADD writes R1; SUB reads R1 | 1 cycle apart | Yes (EX/MEM → EX forward) | 0 stalls |
| R4: I2 → I3 | SUB writes R4; AND reads R4 | 1 cycle apart | Yes (EX/MEM → EX forward) | 0 stalls |
| R6: I3 → I4 | AND writes R6; LW reads R6 | 1 cycle apart | Yes (EX/MEM → EX forward — address calc) | 0 stalls |
| R8: I4 → I5 | LW writes R8; ADD reads R8 | 1 cycle apart | Partial — Load-Use hazard | 1 stall required |
| R1: I1 → I5 | ADD writes R1; ADD reads R1 | 4 cycles apart | Yes (already in register file by WB) | 0 stalls |
Key finding: Only the I4 → I5 load-use hazard requires an unavoidable stall. All other RAW hazards are resolved by forwarding (EX/MEM → EX path). Without forwarding, I1→I2 and I2→I3 would each require 2 stall cycles, and I3→I4 would require 1 stall — a total of 5 wasted cycles reduced to 1.
Pipeline execution timeline with forwarding + 1 load-use stall:
| Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| I1 | IF | ID | EX | MEM | WB | |||||
| I2 | IF | ID | EX | MEM | WB | |||||
| I3 | IF | ID | EX | MEM | WB | |||||
| I4 | IF | ID | EX | MEM | WB | |||||
| bubble | IF | stall | ||||||||
| I5 | IF | ID | EX | MEM |
Bold = forwarding; stall = 1 inserted bubble due to load-use hazard
Given: A 4-way set associative cache with:
Step 1: Number of cache lines total
Total lines = Cache size / Line size = 32,768 / 64 = 512 lines
Step 2: Number of sets
Sets = Total lines / Associativity = 512 / 4 = 128 sets
Step 3: Offset bits (bits needed to address one byte within a 64-byte line)
Offset bits = log₂(64) = 6 bits
Step 4: Index bits (bits needed to select one of 128 sets)
Index bits = log₂(128) = 7 bits
Step 5: Tag bits (remaining bits identify which memory block)
Tag bits = 64 - 7 - 6 = 51 bits
Summary Table:
| Parameter | Calculation | Result |
|---|---|---|
| Total cache lines | 32,768 / 64 | 512 lines |
| Number of sets | 512 / 4 | 128 sets |
| Offset bits | log₂(64) | 6 bits |
| Index bits | log₂(128) | 7 bits |
| Tag bits | 64 − 7 − 6 | 51 bits |
| Tag storage overhead | 512 lines × 51 bits | ~3.2 KB |
| Valid + dirty bits | 512 × 2 bits | 128 bytes |
Verification: 6 + 7 + 51 = 64 bits ✓
This course covered the complete picture of how modern processors work:
| Lesson | Core Concept | Key Number |
|---|---|---|
| What is Computer Architecture | von Neumann model, ISA abstraction | Harvard vs. von Neumann |
| CPU Pipeline Stages | 5-stage RISC pipeline (IF/ID/EX/MEM/WB) | Ideal CPI = 1.0 |
| Pipeline Hazards & Solutions | Structural, Data, Control hazards; forwarding; branch prediction | 95–99% branch prediction accuracy |
| Cache Memory & Mapping | Locality, hierarchy, direct/set-associative/full mapping | 300× CPU-to-RAM speed gap |
| Virtual Memory & Paging | Pages, frames, page tables, TLB, page faults | 4-level page table on x86-64 |
| I/O Systems, Interrupts & DMA | Polling, interrupt-driven, DMA; IRQ, ISR, IDT | DMA: 0 CPU cycles for bulk transfer |
| Parallel Processing & Multicore | Flynn taxonomy, Amdahl's Law, SIMD, cache coherence | 10× max speedup with 10% serial code |
| Modern CPU Architectures | x86-64, ARM, RISC-V, process nodes, chiplets | 3nm = 292M transistors/mm² |
The unifying insight: Every architectural decision in a CPU — pipeline depth, cache size, branch predictor complexity, core count, ISA choice — is a trade-off. Performance vs. power. Throughput vs. latency. Complexity vs. reliability. The engineer's job is to understand which trade-offs serve the target use case.
A smartphone processor and a server processor can both be "great CPUs" while making almost entirely opposite architectural choices. The measure of mastery is knowing why those choices differ — and this course has given you exactly that foundation.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises