You add items to an online shopping cart. The browser recalculates the subtotal. At some point, two numbers — let's say $29.99 and $49.99 — need to be added together. That addition does not happen in the browser's JavaScript engine. It doesn't happen in the operating system. It doesn't even happen in the processor's control logic.

It happens in a circuit called the Arithmetic Logic Unit — a collection of adders, logical operators, and multiplexers packed into a small region of silicon, operating in under 1 nanosecond. Every calculation your computer has ever performed passed through this circuit. Every if-statement was a comparison here. Every loop counter was incremented here.

The ALU is the computational heart of the CPU. The registers are its working memory. Together, connected by a web of wires called the datapath, they execute every instruction you've ever run.

Why the Datapath Architecture Matters

A CPU's performance depends not just on what it can compute, but on how fast data moves between computation and storage.

Too few registers → data must be fetched from slow RAM constantly
Poor ALU design → common operations take many cycles
Inefficient datapath routing → functional units sit idle waiting for data

Understanding the ALU-register-datapath triad reveals why modern CPUs are designed the way they are — and why x86 and ARM make different design choices.

The ALU: What It Does

The Arithmetic Logic Unit is a combinational circuit (no clock, no memory) that performs:

Arithmetic Operations

ADD: A + B (using carry lookahead adder)
SUB: A − B (implemented as A + two's complement of B)
MUL: A × B (separate multiplier circuit, or implemented as repeated addition)
DIV: A ÷ B (separate divider circuit — most complex, takes many cycles)
NEG: −A (two's complement negation)
INC / DEC: A+1, A−1 (fast increment/decrement)

Logical Operations

AND: A AND B (bitwise — used for masking)
OR: A OR B (bitwise — used for setting flags)
XOR: A XOR B (bitwise — used for toggling bits, parity)
NOT: NOT A (bitwise inversion)

Shift Operations

SHL (Shift Left Logical): A × 2 per shift position
SHR (Shift Right Logical): A ÷ 2 per shift position (fills with 0)
SAR (Shift Right Arithmetic): A ÷ 2 per shift position (preserves sign bit)
ROL / ROR: Rotate left/right (used in cryptography, e.g., AES, SHA)

How the ALU Selects Operations

The CPU's control unit sends a multi-bit control signal (called ALUOp or ALUControl) to the ALU. The ALU uses a multiplexer tree to route inputs through the selected operation and send the result to the output.

On a MIPS-style ALU, a 3-bit control selects from 8 operations. Modern x86-64 ALUs decode far more operations through micro-operation translation.

ALU Status Flags

After every ALU operation, flag bits (also called condition codes) are updated in the flags register (FLAGS in x86, CPSR in ARM). These flags are read by branch instructions.

Flag	Name	Meaning	Set When
Z	Zero	Result is zero	`5 - 5 = 0`
N (or S)	Negative / Sign	Result is negative	MSB of result = 1
C	Carry	Unsigned overflow	Addition carry-out = 1
V (or O)	Overflow	Signed overflow	Sign of result is wrong
P	Parity	Even number of 1-bits	Used in x86 for legacy code

Example: After CMP RAX, RBX (which computes RAX − RBX and discards the result):

If RAX = RBX → Z=1, used by JE (jump if equal)
If RAX < RBX (signed) → N ≠ V, used by JL (jump if less)
If RAX < RBX (unsigned) → C=1, used by JB (jump if below)

CPU Registers

Registers are the fastest storage in the entire computer — built from D flip-flops directly on the CPU die, accessed in less than 1 clock cycle. They are orders of magnitude faster than even L1 cache.

General-Purpose Registers (x86-64)

Intel/AMD x86-64 has 16 general-purpose 64-bit registers:

Register	Historical Name	Common Use
RAX	Accumulator	Function return values, multiplication
RBX	Base	General use, sometimes base pointer
RCX	Counter	Loop counters, shift amounts
RDX	Data	I/O operations, division remainder
RSI	Source Index	String/memory source pointer
RDI	Destination Index	String/memory destination pointer; first function argument (Linux)
RSP	Stack Pointer	Top of current stack frame
RBP	Base Pointer	Stack frame base
R8–R15	Extended	Additional general-purpose

Each 64-bit register can be accessed as a 32-bit (EAX), 16-bit (AX), or 8-bit (AH/AL) sub-register for backward compatibility.

Special-Purpose Registers

Register	x86-64 Name	ARM Name	Purpose	Size
Program Counter	RIP	PC	Address of next instruction	64-bit
Stack Pointer	RSP	SP (X31)	Top of stack	64-bit
Flags / Status	RFLAGS	CPSR/NZCV	Condition codes, mode bits	64-bit
Instruction Register	(internal)	(internal)	Currently executing instruction	Variable

ARM vs. x86 Register Philosophy

ARM (used in Apple Silicon, Android phones, Raspberry Pi) takes a different approach:

31 general-purpose 64-bit registers (X0–X30) vs. x86's 16
More registers = less spilling to memory = faster code
RISC design philosophy: simple, uniform, plenty of registers
Apple M3 Pro actually has 31 GP registers + 32 floating-point/SIMD registers (V0–V31)

The Datapath: How Data Flows

The datapath is the collection of registers, functional units (ALU, multiplier, FPU), multiplexers, and buses that move data during instruction execution.

A simplified MIPS-style single-cycle datapath executes an instruction like ADD R1, R2, R3 (R1 = R2 + R3) in these steps:

Instruction Fetch (IF): PC → Instruction Memory → Instruction Register
Instruction Decode (ID): Decode opcode, read R2 and R3 from Register File
Execute (EX): Send R2 and R3 to ALU with ALUOp = ADD
Memory Access (MEM): (no memory access for register-to-register operations)
Write Back (WB): Write ALU result back to R1 in Register File; PC = PC + 4

Single-Cycle vs. Multi-Cycle Datapath

Single-Cycle

Every instruction completes in exactly one clock cycle. The clock period must be long enough for the slowest instruction (typically LOAD, which must complete fetch → decode → ALU → memory → writeback).

Pro: Simple design, easy to reason about
Con: Fast instructions (ADD) waste time waiting for slow ones (MUL, DIV)
Used in: Simple embedded processors, educational MIPS implementations

Multi-Cycle

Different instructions take different numbers of cycles. A simple ADD takes 4 cycles; a LOAD takes 5 cycles; a MUL might take 7 cycles.

Pro: Clock period set by the slowest stage, not the slowest instruction
Con: Control logic is more complex
Used in: Early microprocessors, still common in low-power embedded designs

Pipelined (Preview)

Modern CPUs overlap multiple instructions using a pipeline — while one instruction is in the Execute stage, the next is in the Decode stage, and the one after that is being Fetched. This dramatically improves throughput.

Intel's Core pipeline is 14–20+ stages. This is why a 5 GHz CPU can issue an instruction every 200 picoseconds even though individual instructions take nanoseconds to complete.

ALU and Register File Summary Table

Register Type	x86-64 Name	ARM (AArch64) Name	Purpose	Size
General Purpose	RAX–RDI, R8–R15	X0–X30	Integer computation, arguments	64-bit
Stack Pointer	RSP	SP	Stack management	64-bit
Frame Pointer	RBP	X29 (FP)	Stack frame base	64-bit
Program Counter	RIP	PC	Next instruction address	64-bit
Flags / Status	RFLAGS	NZCV	Condition codes	64-bit / 32-bit
FP / SIMD	XMM0–XMM31 (256/512-bit)	V0–V31 (128-bit NEON)	Floating-point and vector	128–512-bit
Link Register	(return address on stack)	X30 (LR)	Stores function return address	64-bit

Key Takeaways

The ALU is a combinational circuit performing all arithmetic (ADD, SUB, MUL, DIV) and logic (AND, OR, XOR, NOT, shifts)
The ALU sets status flags (Z, N, C, V) after each operation — flags drive every conditional branch
Registers are the fastest storage on the CPU — x86-64 has 16 GP 64-bit registers; ARM has 31
The Program Counter (RIP/PC) tracks the next instruction; the Stack Pointer (RSP/SP) tracks the top of the call stack; RFLAGS/NZCV holds condition codes
The datapath connects registers, ALU, and memory through multiplexers and buses; data flows: Register File → ALU → (optional Memory) → Register File
Single-cycle datapaths are simple but wasteful; multi-cycle and pipelined designs improve efficiency
Modern CPUs (Intel Core, Apple M-series) are deeply pipelined — multiple instructions execute simultaneously in different pipeline stages

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

32 minLesson 7 of 16

Course Contents(16 lessons)

▾

Chapter 1: Foundations

What Is Computer Architecture? Von Neumann vs Harvard20 min

Number Systems: Binary, Octal, Hexadecimal28 min

Data Representation: Integers, Floats, and IEEE 75430 min

Chapter 2: Digital Logic

Boolean Algebra and Logic Gates32 min

Combinational Circuits: Adders, Multiplexers, Decoders28 min

Sequential Circuits: Flip-Flops, Registers, Counters30 min

Chapter 3: CPU Architecture

ALU, Registers, and the Datapath32 min

Instruction Set Architecture: RISC vs CISC35 min

CPU Pipeline: The 5-Stage Execution Engine35 min

Pipeline Hazards and Modern Solutions30 min

Chapter 4: Memory Systems

Cache Memory: Mapping, Associativity, Replacement35 min

Virtual Memory, Page Tables, and TLB32 min

Chapter 5: I/O and Advanced Topics

I/O Systems, Interrupts, and DMA28 min

Parallel Processing: Multicore and Flynn's Taxonomy30 min

Modern CPU Architectures: ARM, x86-64, Apple Silicon28 min

Chapter 6: Final Project

Final Project: Analyze and Compare CPU Architectures45 min