AiTechWorlds
AiTechWorlds
Picture a professional chef in a restaurant kitchen. The stove is right in front of them — that's the CPU. Behind them, within arm's reach, is a prep counter with salt, olive oil, garlic, and fresh herbs — the ingredients used in every dish. That counter is the cache. In the walk-in fridge across the room, there's everything else — that's RAM. And in the dry storage pantry down the hall? That's the hard drive.
The chef doesn't walk to the pantry every time they need olive oil. They keep the most-used ingredients on the counter. When the counter gets full, they put back what they haven't touched in a while and bring out what the next recipe needs.
This is precisely how CPU cache works — and it's not just an analogy. It's the only reason modern software runs at bearable speeds.
Modern CPUs execute instructions in under a nanosecond. The fastest LPDDR5X RAM responds to a request in about 100 nanoseconds. That's a 300× speed gap.
If a CPU running at 3 GHz had to wait for RAM on every operation:
A simple loop that adds up an array would spend 99.9% of its time waiting for RAM. The CPU — capable of completing billions of operations per second — would idle almost constantly. This is called the memory wall.
Cache memory bridges that gap:
| Memory Level | Capacity | Latency | Location |
|---|---|---|---|
| CPU Registers | ~1 KB | 0 cycles | On-die, inside core |
| L1 Cache | 32–64 KB | 1–4 cycles | On-die, per core |
| L2 Cache | 256 KB – 1 MB | 5–12 cycles | On-die, per core |
| L3 Cache | 8–32 MB | 20–40 cycles | On-die, shared |
| Main RAM (DDR5) | 8–128 GB | ~100–200 cycles | Off-die, DIMM slots |
| NVMe SSD | 512 GB – 4 TB | ~100,000 cycles | PCIe slot |
Cache would be useless if programs accessed memory randomly. Fortunately, programs exhibit two powerful patterns called locality of reference:
Temporal Locality: If you access a memory location now, you'll probably access it again soon.
Spatial Locality: If you access a memory location, you'll probably access nearby locations soon.
Because of these patterns, a small cache can satisfy the vast majority of memory requests. A well-tuned program with good cache behavior might achieve a 95–99% hit rate on L1 cache — meaning only 1–5% of accesses need to go to L2 or beyond.
When the cache fetches data from RAM, it doesn't grab just the one byte you requested. It fetches an entire cache line — typically 64 bytes in modern x86 CPUs.
Why? Because of spatial locality. If you're reading byte 400, you'll probably need bytes 401, 402, ... 463 very soon. Fetching 64 bytes at once amortizes the memory latency over many future hits.
Every cache line is stored with:
The central design question of cache architecture: when a memory address maps to the cache, which cache line does it go into?
Each memory address maps to exactly one cache line. The cache line index is determined by bits from the middle of the address.
Address breakdown:
| Tag | Index | Offset |
Advantage: Simple, fast lookup — just compute the index, check the tag.
Disadvantage: Conflict misses. If two frequently accessed addresses map to the same cache line, they'll evict each other repeatedly — even if the cache has room. A program that alternates between two arrays whose addresses are multiples of the cache size can get 0% hit rate despite having a large cache.
Any memory block can be stored in any cache line. On a lookup, all lines are searched simultaneously (using parallel comparators).
Advantage: No conflict misses. The cache is always as efficient as it can be.
Disadvantage: Hardware complexity scales linearly with cache size. Checking all ~32,000 entries in a 32KB L1 cache simultaneously requires enormous silicon area. Only practical for small caches like the TLB (Translation Lookaside Buffer, typically 32–128 entries).
The practical compromise. The cache is divided into sets, each containing N cache lines (called ways). A memory address maps to exactly one set (via index bits), but can go into any of the N lines within that set.
Address breakdown:
| Tag | Set Index | Offset |
For a 4-way set associative, 32KB cache with 64B lines:
Modern CPUs use:
When a set is full and a new line must be loaded, which existing line gets evicted?
When data isn't found in cache (a cache miss), the reason matters for optimization:
When the CPU writes data, what happens to the cache and RAM?
Write-Through: Every write updates both cache and RAM immediately.
Write-Back: Writes go only to cache. The line is marked dirty. RAM is updated only when the line is evicted.
Modern CPUs use write-back for L1/L2/L3 with write-combining buffers to batch writes for uncacheable regions (framebuffers, DMA regions).
The key performance metric for cache systems:
$$\text{AMAT} = \text{Hit Time} + (\text{Miss Rate} \times \text{Miss Penalty})$$
Example (2-level cache hierarchy):
AMAT = 4 + (0.05 × (12 + (0.20 × 200))) AMAT = 4 + (0.05 × (12 + 40)) AMAT = 4 + (0.05 × 52) AMAT = 6.6 cycles
Without L2: AMAT = 4 + (0.05 × 200) = 14 cycles — more than 2× worse.
In a multicore CPU, each core has its own L1 and L2 cache. If Core 1 writes to a variable and Core 2 reads it, which value does Core 2 see?
The MESI protocol manages this by assigning each cache line one of four states:
When one core writes to a Shared line, it broadcasts an invalidation on the cache bus, forcing all other cores to discard their copies. This prevents stale reads.
Cache memory exists because of the 300× speed gap between CPU and RAM. It works because of temporal and spatial locality. The three mapping strategies — direct mapped, fully associative, and N-way set associative — represent a trade-off between hardware simplicity and miss rate, with set associative being the universal standard. The MESI protocol ensures multicore systems maintain cache coherence without requiring expensive write-through policies.
Getting cache behavior right is often the difference between a program that runs in 1 second and one that runs in 10 seconds.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises