AiTechWorlds
AiTechWorlds
Your keyboard generates an interrupt signal every time a key is pressed — but also periodically even when idle, to report its status. Your network interface card generates an interrupt for every arriving Ethernet frame. Your CPU's local timer fires an interrupt every millisecond to give the scheduler a chance to preempt the current process. On a busy server, the interrupt rate reaches hundreds of thousands per second.
Without a mechanism to handle this efficiently, the CPU would need to constantly poll every device to check for activity — burning cycles asking "anything for me?" thousands of times per second. That approach existed in early computers and was called polling. It worked until device counts and interrupt rates made it untenable.
Interrupts inverted the relationship: instead of the CPU asking devices if they need attention, devices tell the CPU when they do. This asynchronous notification mechanism is the foundation of responsive, efficient I/O. But building an interrupt handling system that is both fast and correct is one of the harder problems in systems programming — and Linux's solution, refined over three decades, is worth understanding in detail.
Hardware Interrupt Requests are asynchronous signals generated by physical devices. They arrive at the CPU with no relationship to what code is currently executing. There are two sub-categories:
Maskable interrupts can be temporarily disabled by the CPU. When the kernel executes cli (Clear Interrupt Flag) on x86-64, the CPU stops accepting maskable interrupts — they remain pending until sti (Set Interrupt Flag) re-enables them. This is used in critical sections of interrupt handlers where re-entrant interrupts would corrupt state.
Non-maskable interrupts (NMI) cannot be disabled. They are reserved for genuinely critical hardware events: uncorrectable memory errors (ECC), hardware watchdog timeouts, and power failure signals. On modern x86 systems, the Machine Check Architecture (MCA) uses NMIs to report hardware errors. The handler for NMIs must be extraordinarily careful — it cannot assume any kernel state is consistent.
Traps are synchronous — they occur at a specific point in the executing instruction stream. System calls (syscall instruction), breakpoints (int3), and explicit software interrupts (int 0x80, the legacy 32-bit syscall mechanism) are all traps. They behave like interrupts but are deliberately triggered by the running code.
Exceptions are synchronous signals from the CPU itself, generated when the processor cannot complete an instruction normally:
SIGFPEPage faults are the most frequent exception in a normal Linux system and are performance-critical — they occur every time a process accesses memory that has been swapped out or has not yet been faulted in (demand paging).
On modern x86-64 systems, interrupts are managed by the Advanced Programmable Interrupt Controller (APIC) subsystem.
Local APIC: each CPU core has its own Local APIC. It handles per-core interrupts: the local timer (generates scheduling ticks), performance monitoring interrupts, and Inter-Processor Interrupts (IPIs — how one core signals another to flush TLB entries or trigger a task migration).
I/O APIC: one or more I/O APICs sit on the motherboard and receive interrupt lines from devices (keyboard, USB controller, NVMe drive). The I/O APIC's redirection table maps each device interrupt to a specific vector number and routes it to one or more CPU Local APICs. cat /proc/interrupts shows the routing: which IRQ went to which CPU, how many times.
Linux's IRQ balancing daemon (irqbalance) dynamically adjusts I/O APIC routing to distribute interrupt load across cores. On a heavily loaded server, a single-core receiving all network interrupts becomes a bottleneck — irqbalance distributes them.
The IDT is an array of 256 gate descriptors. Each entry describes how to handle one interrupt vector. The CPU's IDTR register holds the base address and limit of this table.
Each IDT entry contains: the handler address, the code segment selector, the privilege level required to trigger this gate (prevents user code from issuing arbitrary software interrupts), and an interrupt stack table index (IST — allows using a separate known-good stack for NMI and double faults).
Linux divides interrupt processing into two distinct phases. This is the most important design decision in Linux's interrupt handling and the source of its scalability.
When a hardware interrupt fires, the CPU jumps to the IDT handler. Linux's common entry code disables the interrupt being processed (to prevent re-entrancy), saves registers, and calls the registered irq_handler_t.
The top half executes in interrupt context — a special state with strict rules:
GFP_KERNEL)Because other interrupts may be masked during this window, a slow top half increases overall interrupt latency for the entire system. The canonical top half does exactly two things: saves the data the hardware has ready (a network packet buffer, a keyboard scancode), and signals the hardware that the interrupt has been received (the "ACK" — required or the device will not generate future interrupts).
All actual processing — parsing network packets, updating disk I/O accounting, running timer callbacks — happens in the bottom half, where normal kernel context rules apply.
Linux has three mechanisms for bottom-half processing:
Softirqs: the oldest and fastest mechanism. There are 10 fixed softirq types, compiled into the kernel. NET_RX_SOFTIRQ processes received network packets. BLOCK_SOFTIRQ handles block I/O completions. TIMER_SOFTIRQ runs expired timers. Softirqs can run on multiple CPUs simultaneously (they are designed for this) and run in ksoftirqd kernel threads or inline after the hardirq handler returns.
Tasklets: built on top of softirqs. Unlike softirqs, a given tasklet only runs on one CPU at a time (serialized), making them easier to write correctly. They are deprecated in newer kernels in favor of workqueues.
Workqueues: run in a kernel thread context (a real process, with a task_struct). Can sleep, can block on I/O, can allocate memory. The system_wq workqueue is the default. Drivers use schedule_work() to queue functions for deferred execution.
The split exists because interrupt latency is a cascading problem: if a top half takes 500µs, interrupts of that type (and often others) are delayed by 500µs systemwide. The two-half design keeps the critical path minimal.
Standard Linux has a maximum interrupt latency of roughly 100µs under load, but spikes to milliseconds are possible when the kernel holds spinlocks or executes in non-preemptible code paths.
The PREEMPT_RT patch (now partially merged into Linux mainline starting with 5.x series) converts most spinlocks to RT-mutexes (which can be preempted), makes hardirq handlers run in kernel threads (allowing preemption), and makes timer interrupts preemptible. This reduces worst-case latency to under 50µs on typical hardware.
PREEMPT_RT is used in:
The cyclictest tool measures scheduling latency by running a high-priority thread that measures the difference between requested and actual wakeup time. A well-configured RT Linux system shows maximum latency under 100µs; a standard kernel may show 1ms+ spikes.
| Interrupt Type | Source | Maskable? | Handler Context | Linux Mechanism | Typical Latency |
|---|---|---|---|---|---|
| Hardware IRQ (maskable) | NIC, keyboard, USB, timer | Yes (cli/sti) | Hardirq (interrupt context) | request_irq(), request_threaded_irq() | 1–100 µs |
| NMI | Hardware watchdog, ECC error | No | NMI context (strict) | register_nmi_handler() | < 1 µs (priority) |
| Page Fault (exception) | MMU on bad memory access | N/A (synchronous) | Exception context | do_page_fault() → handle_mm_fault() | 1–100 µs (+ disk I/O if swap) |
| General Protection Fault | Privilege violation | N/A (synchronous) | Exception context | Deliver SIGSEGV to process | < 1 µs |
| Softirq | Raised by top half | Effectively (disabled per-CPU) | Softirq context (no sleep) | raise_softirq(), ksoftirqd | 10–200 µs |
| Workqueue | Kernel code | N/A (thread) | Process context (can sleep) | schedule_work() | Milliseconds (thread-scheduled) |
The interrupt subsystem is where hardware asynchrony meets software concurrency — and where the costs of "handling everything" become visible. Every interrupt that fires displaces whatever was executing, invalidates portions of the instruction cache, and forces the CPU to reload state. At 100,000 interrupts per second on a network-heavy server, this overhead is measurable.
The top-half / bottom-half split is the architectural response to this: keep the latency-critical path minimal, defer everything else. The further distinction between softirqs (fast, no sleep), tasklets (serialized), and workqueues (full process context) gives driver authors a spectrum of options matched to their actual needs.
Knowing how to read /proc/interrupts, how to identify interrupt storms with watch -n 1 cat /proc/interrupts, and how to tune IRQ affinity with irqbalance or manual /proc/irq/N/smp_affinity settings is practical knowledge for any serious Linux systems work.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises