Picture a professional chef in a restaurant kitchen. The stove is right in front of them — that's the CPU. Behind them, within arm's reach, is a prep counter with salt, olive oil, garlic, and fresh herbs — the ingredients used in every dish. That counter is the cache. In the walk-in fridge across the room, there's everything else — that's RAM. And in the dry storage pantry down the hall? That's the hard drive.

The chef doesn't walk to the pantry every time they need olive oil. They keep the most-used ingredients on the counter. When the counter gets full, they put back what they haven't touched in a while and bring out what the next recipe needs.

This is precisely how CPU cache works — and it's not just an analogy. It's the only reason modern software runs at bearable speeds.

The Speed Gap That Made Cache Necessary

Modern CPUs execute instructions in under a nanosecond. The fastest LPDDR5X RAM responds to a request in about 100 nanoseconds. That's a 300× speed gap.

If a CPU running at 3 GHz had to wait for RAM on every operation:

Memory access time: ~100ns
Clock cycle time: ~0.33ns
Cycles wasted per access: ~300 cycles

A simple loop that adds up an array would spend 99.9% of its time waiting for RAM. The CPU — capable of completing billions of operations per second — would idle almost constantly. This is called the memory wall.

Cache memory bridges that gap:

Memory Level	Capacity	Latency	Location
CPU Registers	~1 KB	0 cycles	On-die, inside core
L1 Cache	32–64 KB	1–4 cycles	On-die, per core
L2 Cache	256 KB – 1 MB	5–12 cycles	On-die, per core
L3 Cache	8–32 MB	20–40 cycles	On-die, shared
Main RAM (DDR5)	8–128 GB	~100–200 cycles	Off-die, DIMM slots
NVMe SSD	512 GB – 4 TB	~100,000 cycles	PCIe slot

The Principle of Locality: Why Cache Works

Cache would be useless if programs accessed memory randomly. Fortunately, programs exhibit two powerful patterns called locality of reference:

Temporal Locality: If you access a memory location now, you'll probably access it again soon.

Example: A loop counter is read and written every iteration

Spatial Locality: If you access a memory location, you'll probably access nearby locations soon.

Example: Iterating through an array accesses consecutive addresses

Because of these patterns, a small cache can satisfy the vast majority of memory requests. A well-tuned program with good cache behavior might achieve a 95–99% hit rate on L1 cache — meaning only 1–5% of accesses need to go to L2 or beyond.

Cache Lines: The Unit of Transfer

When the cache fetches data from RAM, it doesn't grab just the one byte you requested. It fetches an entire cache line — typically 64 bytes in modern x86 CPUs.

Why? Because of spatial locality. If you're reading byte 400, you'll probably need bytes 401, 402, ... 463 very soon. Fetching 64 bytes at once amortizes the memory latency over many future hits.

Every cache line is stored with:

The data (64 bytes)
A tag (which memory address this line corresponds to)
Status bits: valid, dirty (has been written), accessed

The Three Cache Mapping Strategies

The central design question of cache architecture: when a memory address maps to the cache, which cache line does it go into?

1. Direct Mapped Cache

Each memory address maps to exactly one cache line. The cache line index is determined by bits from the middle of the address.

Address breakdown: | Tag | Index | Offset |

Offset (6 bits for 64B cache line): which byte within the line
Index (determines which cache line): log₂(number of lines) bits
Tag (remaining bits): which memory block is stored in this line

Advantage: Simple, fast lookup — just compute the index, check the tag.

Disadvantage: Conflict misses. If two frequently accessed addresses map to the same cache line, they'll evict each other repeatedly — even if the cache has room. A program that alternates between two arrays whose addresses are multiples of the cache size can get 0% hit rate despite having a large cache.

2. Fully Associative Cache

Any memory block can be stored in any cache line. On a lookup, all lines are searched simultaneously (using parallel comparators).

Advantage: No conflict misses. The cache is always as efficient as it can be.

Disadvantage: Hardware complexity scales linearly with cache size. Checking all ~32,000 entries in a 32KB L1 cache simultaneously requires enormous silicon area. Only practical for small caches like the TLB (Translation Lookaside Buffer, typically 32–128 entries).

3. N-Way Set Associative Cache (The Standard)

The practical compromise. The cache is divided into sets, each containing N cache lines (called ways). A memory address maps to exactly one set (via index bits), but can go into any of the N lines within that set.

Address breakdown: | Tag | Set Index | Offset |

For a 4-way set associative, 32KB cache with 64B lines:

Number of lines = 32,768 / 64 = 512 lines
Number of sets = 512 / 4 = 128 sets
Set index bits = log₂(128) = 7 bits
Offset bits = log₂(64) = 6 bits
Tag bits = 64 − 7 − 6 = 51 bits

Modern CPUs use:

L1 cache: 4-way or 8-way set associative
L2 cache: 8-way or 16-way set associative
L3 cache: 16-way or 32-way set associative

Replacement Policies

When a set is full and a new line must be loaded, which existing line gets evicted?

LRU (Least Recently Used): Evict the line not accessed for the longest time. Best performance but expensive to track perfectly for N>4 ways.
PLRU (Pseudo-LRU): An approximation of LRU using tree bits. Used in most real CPUs.
FIFO (First In, First Out): Evict the line that's been in the cache the longest. Simple, but worse than LRU for looping patterns.
Random: Pick a random victim. Surprisingly competitive with LRU in some workloads and trivial to implement.

Cache Miss Types

When data isn't found in cache (a cache miss), the reason matters for optimization:

Compulsory miss (cold miss): First access ever to a line — unavoidable
Capacity miss: Cache is too small to hold the working set — fix with larger cache or smaller working set
Conflict miss: Two lines compete for the same set despite available space — fix with higher associativity

Write Policies

When the CPU writes data, what happens to the cache and RAM?

Write-Through: Every write updates both cache and RAM immediately.

Pros: RAM is always consistent; simple coherence
Cons: Every write generates a memory bus transaction — slow

Write-Back: Writes go only to cache. The line is marked dirty. RAM is updated only when the line is evicted.

Pros: Much faster — consecutive writes to the same line cause only one memory write
Cons: More complex; risk of data loss on power failure; cache coherence complications

Modern CPUs use write-back for L1/L2/L3 with write-combining buffers to batch writes for uncacheable regions (framebuffers, DMA regions).

Average Memory Access Time (AMAT)

The key performance metric for cache systems:

$$\text{AMAT} = \text{Hit Time} + (\text{Miss Rate} \times \text{Miss Penalty})$$

Example (2-level cache hierarchy):

L1 hit time = 4 cycles, L1 miss rate = 5%
L2 hit time = 12 cycles, L2 miss rate = 20%
RAM penalty = 200 cycles

AMAT = 4 + (0.05 × (12 + (0.20 × 200))) AMAT = 4 + (0.05 × (12 + 40)) AMAT = 4 + (0.05 × 52) AMAT = 6.6 cycles

Without L2: AMAT = 4 + (0.05 × 200) = 14 cycles — more than 2× worse.

Cache Coherence in Multicore CPUs: MESI Protocol

In a multicore CPU, each core has its own L1 and L2 cache. If Core 1 writes to a variable and Core 2 reads it, which value does Core 2 see?

The MESI protocol manages this by assigning each cache line one of four states:

Modified (M): This core has the only copy; it's been modified (dirty)
Exclusive (E): This core has the only copy; it's clean
Shared (S): Multiple cores have clean copies
Invalid (I): This line is not valid in this cache

When one core writes to a Shared line, it broadcasts an invalidation on the cache bus, forcing all other cores to discard their copies. This prevents stale reads.

Summary

Cache memory exists because of the 300× speed gap between CPU and RAM. It works because of temporal and spatial locality. The three mapping strategies — direct mapped, fully associative, and N-way set associative — represent a trade-off between hardware simplicity and miss rate, with set associative being the universal standard. The MESI protocol ensures multicore systems maintain cache coherence without requiring expensive write-through policies.

Getting cache behavior right is often the difference between a program that runs in 1 second and one that runs in 10 seconds.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

35 minLesson 11 of 16

Course Contents(16 lessons)

▾

Chapter 1: Foundations

What Is Computer Architecture? Von Neumann vs Harvard20 min

Number Systems: Binary, Octal, Hexadecimal28 min

Data Representation: Integers, Floats, and IEEE 75430 min

Chapter 2: Digital Logic

Boolean Algebra and Logic Gates32 min

Combinational Circuits: Adders, Multiplexers, Decoders28 min

Sequential Circuits: Flip-Flops, Registers, Counters30 min

Chapter 3: CPU Architecture

ALU, Registers, and the Datapath32 min

Instruction Set Architecture: RISC vs CISC35 min

CPU Pipeline: The 5-Stage Execution Engine35 min

Pipeline Hazards and Modern Solutions30 min

Chapter 4: Memory Systems

Cache Memory: Mapping, Associativity, Replacement35 min

Virtual Memory, Page Tables, and TLB32 min

Chapter 5: I/O and Advanced Topics

I/O Systems, Interrupts, and DMA28 min

Parallel Processing: Multicore and Flynn's Taxonomy30 min

Modern CPU Architectures: ARM, x86-64, Apple Silicon28 min

Chapter 6: Final Project

Final Project: Analyze and Compare CPU Architectures45 min