Right now, your keyboard is electrically idle. The browser rendering your screen, the music player streaming audio, the antivirus scanning files in the background — none of them are thinking about your keyboard. They're doing their own work.

Then you press a key. Within microseconds, the CPU stops what it's doing, saves its entire state, handles the keypress, sends the character to the correct application, and resumes exactly where it left off — without any application having to continuously check "did the user press a key?"

This mechanism — the interrupt — is one of the most elegant inventions in computer architecture. Without it, every program would have to constantly poll every device, wasting CPU cycles on nothing. With it, the CPU can work productively until a device actually needs attention.

Understanding I/O architecture means understanding how a CPU that can execute billions of instructions per second efficiently manages hundreds of slow, asynchronous peripheral devices.

I/O Device Categories

Modern computers connect to an enormous variety of peripheral devices, grouped into three categories:

Character devices: transmit data one byte at a time — keyboard, mouse, serial ports, audio. Low bandwidth, latency-sensitive.
Block devices: transfer data in fixed-size blocks — HDD, SSD, USB drives. High bandwidth, random access.
Network devices: transmit variable-length packets — Ethernet, Wi-Fi, Bluetooth. Burst traffic, requires buffering.

Each category has different requirements that shape which I/O mechanism is most appropriate.

Three Mechanisms for I/O

Mechanism 1: Programmed I/O (Polling)

The CPU actively checks whether a device is ready by repeatedly reading a status register:

loop:
    read device status register
    if NOT_READY: goto loop
    read/write data register

Real-world analogy: Calling a restaurant every 5 minutes to ask if your table is ready.

Pros: Simple to implement; predictable timing; no hardware interrupt circuitry needed.

Cons: Wastes CPU cycles when the device is slow. A CPU polling a printer that takes 30 seconds to process a page burns 90 billion clock cycles doing nothing but checking. Completely unusable for multitasking systems.

When used today: In extremely latency-sensitive scenarios where interrupt overhead is unacceptable, like high-frequency trading network stacks (kernel-bypass networking with DPDK), or in simple embedded microcontrollers where context switching doesn't exist.

Mechanism 2: Interrupt-Driven I/O

The device signals the CPU using a dedicated hardware line — the Interrupt Request (IRQ). The CPU works on other tasks and only responds when the interrupt fires:

The hardware mechanism in detail:

IRQ Lines: Physical wires connecting devices to the interrupt controller (Intel 8259A historically; APIC — Advanced Programmable Interrupt Controller — in modern systems)
IDT (Interrupt Descriptor Table): a 256-entry table in memory, each entry pointing to an ISR for that interrupt vector number. Set up by the OS during boot.
ISR (Interrupt Service Routine): kernel code that handles the specific device interrupt
IRET instruction: returns from interrupt, restoring the CPU to exactly the state before the interrupt

Interrupt Latency: Time from device assertion to first ISR instruction. Typically 1–10 microseconds on modern Linux with NOHZ kernel; can be as low as 250 nanoseconds with PREEMPT_RT patches.

Interrupt Priorities: The APIC supports 256 priority levels. Higher-priority interrupts can preempt lower-priority ISRs (nested interrupts). The CPU's EFLAGS/RFLAGS register contains an interrupt enable bit (IF); CLI and STI instructions disable/enable interrupts.

Pros: CPU-efficient; enables true multitasking; device latency is excellent.

Cons: Interrupt overhead (~1–10 μs) is problematic for very high-frequency events (e.g., 100 Gbps Ethernet generates millions of interrupts/sec — too many for per-packet interrupts). Solution: interrupt coalescing (batch multiple events into one interrupt).

Mechanism 3: DMA — Direct Memory Access

For large data transfers (reading a 4 KB disk sector, receiving a network packet), interrupt-driven I/O still requires the CPU to copy data byte-by-byte from the device's I/O registers to RAM. At 1 byte per copy instruction, transferring 4 KB takes ~4,000 CPU instructions — wasted.

DMA solves this: the DMA controller (a dedicated hardware engine) performs the data transfer entirely without CPU involvement.

DMA operation steps:

CPU programs the DMA controller: source address, destination address, byte count, transfer direction
CPU resumes normal work
DMA controller takes bus mastership (briefly pauses CPU's memory access — called cycle stealing)
DMA controller transfers data block directly between device and RAM
DMA controller fires a completion interrupt
CPU reads a small status value — data is already in RAM

Modern DMA:

NVMe SSD: Uses PCIe DMA to transfer data at up to 14 GB/s (PCIe 4.0 ×4) directly to RAM
GPU: PCIe DMA transfers textures and compute data between CPU RAM and GPU VRAM
Network card (NIC): Receives packets directly into pre-allocated RAM buffers via DMA (kernel networking stack, DPDK)

I/O Method	CPU Usage	Latency	Complexity	Best For
Polling	100% (busy wait)	Lowest (~ns)	Minimal	Real-time, embedded, kernel-bypass
Interrupt-Driven	Low (only during ISR)	Low (~μs)	Moderate	Keyboards, mice, low-bandwidth serial
DMA	Minimal (only setup + completion)	Low (~μs)	Higher	Disk, NIC, GPU, audio — any bulk transfer

I/O Port vs Memory-Mapped I/O

How does software talk to device registers?

Port-Mapped I/O (PMIO): Devices occupy a separate I/O address space accessed via special CPU instructions (IN/OUT on x86). Classic PC design — the keyboard controller is at I/O port 0x60.

Memory-Mapped I/O (MMIO): Device registers are mapped into the regular memory address space. Software accesses them via normal load/store instructions to specific physical addresses. Used by virtually all modern buses (PCIe, ARM's entire peripheral ecosystem).

MMIO advantages: No special instructions needed; C pointer dereferencing works; cache control attributes (write-combining, uncacheable) apply naturally via page table bits.

PCIe: The Modern High-Speed I/O Bus

Peripheral Component Interconnect Express (PCIe) is the dominant high-speed I/O interconnect in modern PCs and servers, introduced in 2003 to replace the parallel PCI bus.

PCIe is a serial, point-to-point, lane-based protocol:

Each lane is a pair of differential signal wires (one transmit, one receive)
Slots are designated ×1, ×4, ×8, ×16 (number of parallel lanes)
PCIe 3.0: 8 GT/s per lane → ×16 = 128 GT/s (~16 GB/s)
PCIe 4.0: 16 GT/s per lane → ×16 = 256 GT/s (~32 GB/s)
PCIe 5.0: 32 GT/s per lane → ×16 = 512 GT/s (~64 GB/s)
PCIe 6.0 (2022): 64 GT/s per lane → ×16 = 1,024 GT/s (~128 GB/s)

PCIe uses DMA natively — all GPU, NVMe SSD, and 10+ GbE NIC transfers happen via PCIe DMA with no CPU involvement in the data path.

USB: Universal Serial Bus

USB provides the standard external connectivity interface:

Version	Release	Max Bandwidth	Connector
USB 2.0	2000	480 Mbps	Type-A, Mini, Micro
USB 3.2 Gen 1	2013	5 Gbps	Type-A, Type-C
USB 3.2 Gen 2	2017	10 Gbps	Type-A, Type-C
USB 3.2 Gen 2×2	2019	20 Gbps	Type-C only
USB4 Gen 2×2	2019	20 Gbps	Type-C only
USB4 Gen 3×2	2019	40 Gbps	Type-C only
USB4 v2	2022	80 Gbps	Type-C only

USB is a host-centric protocol — devices cannot initiate transfers; only the host controller (in the CPU chipset or SoC) can. Devices signal the need for attention via an interrupt endpoint.

Putting It Together: A Disk Read

When your program calls read() on a file, here's what happens at the hardware level:

Program calls read() → system call → OS kernel takes control
Kernel checks page cache — if data is cached in RAM, return immediately (no I/O)
Kernel programs the NVMe DMA controller: physical RAM destination, LBA (disk block address), byte count
Kernel suspends the process and schedules another process — no busy waiting
NVMe SSD receives the command, reads NAND flash (takes ~100 μs for random read)
NVMe DMA engine transfers 4 KB directly to the kernel's buffer in RAM
NVMe fires a PCIe MSI-X interrupt (Message Signaled Interrupt — a write to a special MMIO address)
CPU's APIC receives the interrupt, triggers the NVMe ISR
ISR marks the I/O complete, wakes the suspended process
Process resumes, data is in RAM — the read() call returns

Total time: ~100 μs (SSD latency) + ~5 μs (OS overhead) = ~105 μs CPU time spent on I/O: < 1 μs — everything else was done by DMA hardware

Summary

I/O architecture solves the fundamental mismatch between fast CPUs and slow peripherals. Polling is simple but wasteful. Interrupts let the CPU work productively until a device needs attention. DMA eliminates CPU involvement in bulk data transfers entirely, enabling multi-GB/s throughput without loading down the processor. Modern systems combine all three: polling for ultra-low-latency paths, interrupts for device signaling and completion notification, and DMA for all bulk data movement over PCIe.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

28 minLesson 13 of 16

Course Contents(16 lessons)

▾

Chapter 1: Foundations

What Is Computer Architecture? Von Neumann vs Harvard20 min

Number Systems: Binary, Octal, Hexadecimal28 min

Data Representation: Integers, Floats, and IEEE 75430 min

Chapter 2: Digital Logic

Boolean Algebra and Logic Gates32 min

Combinational Circuits: Adders, Multiplexers, Decoders28 min

Sequential Circuits: Flip-Flops, Registers, Counters30 min

Chapter 3: CPU Architecture

ALU, Registers, and the Datapath32 min

Instruction Set Architecture: RISC vs CISC35 min

CPU Pipeline: The 5-Stage Execution Engine35 min

Pipeline Hazards and Modern Solutions30 min

Chapter 4: Memory Systems

Cache Memory: Mapping, Associativity, Replacement35 min

Virtual Memory, Page Tables, and TLB32 min

Chapter 5: I/O and Advanced Topics

I/O Systems, Interrupts, and DMA28 min

Parallel Processing: Multicore and Flynn's Taxonomy30 min

Modern CPU Architectures: ARM, x86-64, Apple Silicon28 min

Chapter 6: Final Project

Final Project: Analyze and Compare CPU Architectures45 min