AiTechWorlds
AiTechWorlds
Right now, your keyboard is electrically idle. The browser rendering your screen, the music player streaming audio, the antivirus scanning files in the background — none of them are thinking about your keyboard. They're doing their own work.
Then you press a key. Within microseconds, the CPU stops what it's doing, saves its entire state, handles the keypress, sends the character to the correct application, and resumes exactly where it left off — without any application having to continuously check "did the user press a key?"
This mechanism — the interrupt — is one of the most elegant inventions in computer architecture. Without it, every program would have to constantly poll every device, wasting CPU cycles on nothing. With it, the CPU can work productively until a device actually needs attention.
Understanding I/O architecture means understanding how a CPU that can execute billions of instructions per second efficiently manages hundreds of slow, asynchronous peripheral devices.
Modern computers connect to an enormous variety of peripheral devices, grouped into three categories:
Each category has different requirements that shape which I/O mechanism is most appropriate.
The CPU actively checks whether a device is ready by repeatedly reading a status register:
loop:
read device status register
if NOT_READY: goto loop
read/write data register
Real-world analogy: Calling a restaurant every 5 minutes to ask if your table is ready.
Pros: Simple to implement; predictable timing; no hardware interrupt circuitry needed.
Cons: Wastes CPU cycles when the device is slow. A CPU polling a printer that takes 30 seconds to process a page burns 90 billion clock cycles doing nothing but checking. Completely unusable for multitasking systems.
When used today: In extremely latency-sensitive scenarios where interrupt overhead is unacceptable, like high-frequency trading network stacks (kernel-bypass networking with DPDK), or in simple embedded microcontrollers where context switching doesn't exist.
The device signals the CPU using a dedicated hardware line — the Interrupt Request (IRQ). The CPU works on other tasks and only responds when the interrupt fires:
The hardware mechanism in detail:
Interrupt Latency: Time from device assertion to first ISR instruction. Typically 1–10 microseconds on modern Linux with NOHZ kernel; can be as low as 250 nanoseconds with PREEMPT_RT patches.
Interrupt Priorities: The APIC supports 256 priority levels. Higher-priority interrupts can preempt lower-priority ISRs (nested interrupts). The CPU's EFLAGS/RFLAGS register contains an interrupt enable bit (IF); CLI and STI instructions disable/enable interrupts.
Pros: CPU-efficient; enables true multitasking; device latency is excellent.
Cons: Interrupt overhead (~1–10 μs) is problematic for very high-frequency events (e.g., 100 Gbps Ethernet generates millions of interrupts/sec — too many for per-packet interrupts). Solution: interrupt coalescing (batch multiple events into one interrupt).
For large data transfers (reading a 4 KB disk sector, receiving a network packet), interrupt-driven I/O still requires the CPU to copy data byte-by-byte from the device's I/O registers to RAM. At 1 byte per copy instruction, transferring 4 KB takes ~4,000 CPU instructions — wasted.
DMA solves this: the DMA controller (a dedicated hardware engine) performs the data transfer entirely without CPU involvement.
DMA operation steps:
Modern DMA:
| I/O Method | CPU Usage | Latency | Complexity | Best For |
|---|---|---|---|---|
| Polling | 100% (busy wait) | Lowest (~ns) | Minimal | Real-time, embedded, kernel-bypass |
| Interrupt-Driven | Low (only during ISR) | Low (~μs) | Moderate | Keyboards, mice, low-bandwidth serial |
| DMA | Minimal (only setup + completion) | Low (~μs) | Higher | Disk, NIC, GPU, audio — any bulk transfer |
How does software talk to device registers?
Port-Mapped I/O (PMIO): Devices occupy a separate I/O address space accessed via special CPU instructions (IN/OUT on x86). Classic PC design — the keyboard controller is at I/O port 0x60.
Memory-Mapped I/O (MMIO): Device registers are mapped into the regular memory address space. Software accesses them via normal load/store instructions to specific physical addresses. Used by virtually all modern buses (PCIe, ARM's entire peripheral ecosystem).
MMIO advantages: No special instructions needed; C pointer dereferencing works; cache control attributes (write-combining, uncacheable) apply naturally via page table bits.
Peripheral Component Interconnect Express (PCIe) is the dominant high-speed I/O interconnect in modern PCs and servers, introduced in 2003 to replace the parallel PCI bus.
PCIe is a serial, point-to-point, lane-based protocol:
PCIe uses DMA natively — all GPU, NVMe SSD, and 10+ GbE NIC transfers happen via PCIe DMA with no CPU involvement in the data path.
USB provides the standard external connectivity interface:
| Version | Release | Max Bandwidth | Connector |
|---|---|---|---|
| USB 2.0 | 2000 | 480 Mbps | Type-A, Mini, Micro |
| USB 3.2 Gen 1 | 2013 | 5 Gbps | Type-A, Type-C |
| USB 3.2 Gen 2 | 2017 | 10 Gbps | Type-A, Type-C |
| USB 3.2 Gen 2×2 | 2019 | 20 Gbps | Type-C only |
| USB4 Gen 2×2 | 2019 | 20 Gbps | Type-C only |
| USB4 Gen 3×2 | 2019 | 40 Gbps | Type-C only |
| USB4 v2 | 2022 | 80 Gbps | Type-C only |
USB is a host-centric protocol — devices cannot initiate transfers; only the host controller (in the CPU chipset or SoC) can. Devices signal the need for attention via an interrupt endpoint.
When your program calls read() on a file, here's what happens at the hardware level:
read() → system call → OS kernel takes controlread() call returnsTotal time: ~100 μs (SSD latency) + ~5 μs (OS overhead) = ~105 μs CPU time spent on I/O: < 1 μs — everything else was done by DMA hardware
I/O architecture solves the fundamental mismatch between fast CPUs and slow peripherals. Polling is simple but wasteful. Interrupts let the CPU work productively until a device needs attention. DMA eliminates CPU involvement in bulk data transfers entirely, enabling multi-GB/s throughput without loading down the processor. Modern systems combine all three: polling for ultra-low-latency paths, interrupts for device signaling and completion notification, and DMA for all bulk data movement over PCIe.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises