AiTechWorlds
AiTechWorlds
You add items to an online shopping cart. The browser recalculates the subtotal. At some point, two numbers — let's say $29.99 and $49.99 — need to be added together. That addition does not happen in the browser's JavaScript engine. It doesn't happen in the operating system. It doesn't even happen in the processor's control logic.
It happens in a circuit called the Arithmetic Logic Unit — a collection of adders, logical operators, and multiplexers packed into a small region of silicon, operating in under 1 nanosecond. Every calculation your computer has ever performed passed through this circuit. Every if-statement was a comparison here. Every loop counter was incremented here.
The ALU is the computational heart of the CPU. The registers are its working memory. Together, connected by a web of wires called the datapath, they execute every instruction you've ever run.
A CPU's performance depends not just on what it can compute, but on how fast data moves between computation and storage.
Understanding the ALU-register-datapath triad reveals why modern CPUs are designed the way they are — and why x86 and ARM make different design choices.
The Arithmetic Logic Unit is a combinational circuit (no clock, no memory) that performs:
The CPU's control unit sends a multi-bit control signal (called ALUOp or ALUControl) to the ALU. The ALU uses a multiplexer tree to route inputs through the selected operation and send the result to the output.
On a MIPS-style ALU, a 3-bit control selects from 8 operations. Modern x86-64 ALUs decode far more operations through micro-operation translation.
After every ALU operation, flag bits (also called condition codes) are updated in the flags register (FLAGS in x86, CPSR in ARM). These flags are read by branch instructions.
| Flag | Name | Meaning | Set When |
|---|---|---|---|
| Z | Zero | Result is zero | 5 - 5 = 0 |
| N (or S) | Negative / Sign | Result is negative | MSB of result = 1 |
| C | Carry | Unsigned overflow | Addition carry-out = 1 |
| V (or O) | Overflow | Signed overflow | Sign of result is wrong |
| P | Parity | Even number of 1-bits | Used in x86 for legacy code |
Example: After CMP RAX, RBX (which computes RAX − RBX and discards the result):
JE (jump if equal)JL (jump if less)JB (jump if below)Registers are the fastest storage in the entire computer — built from D flip-flops directly on the CPU die, accessed in less than 1 clock cycle. They are orders of magnitude faster than even L1 cache.
Intel/AMD x86-64 has 16 general-purpose 64-bit registers:
| Register | Historical Name | Common Use |
|---|---|---|
| RAX | Accumulator | Function return values, multiplication |
| RBX | Base | General use, sometimes base pointer |
| RCX | Counter | Loop counters, shift amounts |
| RDX | Data | I/O operations, division remainder |
| RSI | Source Index | String/memory source pointer |
| RDI | Destination Index | String/memory destination pointer; first function argument (Linux) |
| RSP | Stack Pointer | Top of current stack frame |
| RBP | Base Pointer | Stack frame base |
| R8–R15 | Extended | Additional general-purpose |
Each 64-bit register can be accessed as a 32-bit (EAX), 16-bit (AX), or 8-bit (AH/AL) sub-register for backward compatibility.
| Register | x86-64 Name | ARM Name | Purpose | Size |
|---|---|---|---|---|
| Program Counter | RIP | PC | Address of next instruction | 64-bit |
| Stack Pointer | RSP | SP (X31) | Top of stack | 64-bit |
| Flags / Status | RFLAGS | CPSR/NZCV | Condition codes, mode bits | 64-bit |
| Instruction Register | (internal) | (internal) | Currently executing instruction | Variable |
ARM (used in Apple Silicon, Android phones, Raspberry Pi) takes a different approach:
The datapath is the collection of registers, functional units (ALU, multiplier, FPU), multiplexers, and buses that move data during instruction execution.
A simplified MIPS-style single-cycle datapath executes an instruction like ADD R1, R2, R3 (R1 = R2 + R3) in these steps:
Every instruction completes in exactly one clock cycle. The clock period must be long enough for the slowest instruction (typically LOAD, which must complete fetch → decode → ALU → memory → writeback).
Different instructions take different numbers of cycles. A simple ADD takes 4 cycles; a LOAD takes 5 cycles; a MUL might take 7 cycles.
Modern CPUs overlap multiple instructions using a pipeline — while one instruction is in the Execute stage, the next is in the Decode stage, and the one after that is being Fetched. This dramatically improves throughput.
Intel's Core pipeline is 14–20+ stages. This is why a 5 GHz CPU can issue an instruction every 200 picoseconds even though individual instructions take nanoseconds to complete.
| Register Type | x86-64 Name | ARM (AArch64) Name | Purpose | Size |
|---|---|---|---|---|
| General Purpose | RAX–RDI, R8–R15 | X0–X30 | Integer computation, arguments | 64-bit |
| Stack Pointer | RSP | SP | Stack management | 64-bit |
| Frame Pointer | RBP | X29 (FP) | Stack frame base | 64-bit |
| Program Counter | RIP | PC | Next instruction address | 64-bit |
| Flags / Status | RFLAGS | NZCV | Condition codes | 64-bit / 32-bit |
| FP / SIMD | XMM0–XMM31 (256/512-bit) | V0–V31 (128-bit NEON) | Floating-point and vector | 128–512-bit |
| Link Register | (return address on stack) | X30 (LR) | Stores function return address | 64-bit |
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises