ext4 is used on hundreds of millions of Linux systems. It descended from ext3 (2001), which descended from ext2 (1993), which descended from ext (1992). Understanding ext4 means understanding 30 years of filesystem engineering decisions — each one made to solve a real problem.

The original ext (1992) was a hack. Remy Card wrote it in a few weeks to give Linux a writable filesystem faster than MINIX. It had a fixed block size, a flat inode table, and could manage at most 2GB. It crashed badly and had no crash recovery. ext2 (1993) fixed the structural problems with a block group layout that is still visible in ext4 today. ext3 (2001) added journaling to survive crashes. ext4 (2008) added extents, 48-bit block numbers, and nanosecond timestamps — making the 30-year-old architecture scale to 1 exabyte.

Every ext4 improvement was driven by a specific production failure. The block count limit of ext2 hit real databases. The journaling overhead of ext3's ordered mode hit real mail servers. The double-indirect pointer overhead hit real video storage systems. To read ext4's feature list is to read a decade of Linux production operations, encoded in on-disk format bits.

On-Disk Layout: Block Groups

ext4 divides a disk into fixed-size block groups. Each block group is self-contained, containing its own metadata and data blocks. This design reduces fragmentation (related data stays close on disk) and improves parallelism (multiple processes can allocate in different block groups simultaneously).

Default block group size: 32768 blocks × 4096 bytes = 128MB.

Structure Within Each Block Group

Block Group N:
+------------------+------------------+------------------+------------------+------------------+------------------+
|   Superblock     |  Group Descriptor|  Block Bitmap    |  Inode Bitmap    |  Inode Table     |  Data Blocks     |
|   (copy 0/1/2)   |  Table           |  (1 block)       |  (1 block)       |  (N blocks)      |  (rest)          |
+------------------+------------------+------------------+------------------+------------------+------------------+

Superblock: Global filesystem metadata. Block size, inode count, block count, UUID, feature flags, last mount time, error count. Only group 0 has the primary; other groups (if the sparse_super2 feature is enabled) hold backup copies at sparse intervals.
Group Descriptor: Per-group metadata: location of block bitmap, inode bitmap, inode table, and free counts.
Block Bitmap: One bit per block in this group. 1 = allocated, 0 = free. One 4KB block can track 32768 blocks (128MB).
Inode Bitmap: One bit per inode slot. 1 = inode in use.
Inode Table: Array of struct ext4_inode (256 bytes each in ext4). A 128MB block group with the default 16 inodes/KB ratio has 16384 inodes, requiring 64 blocks (256KB) for the inode table.
Data Blocks: The rest of the block group, available for file and directory data.

Flexible Block Groups

ext4's flex_bg feature (enabled by default) merges the metadata of multiple block groups into a single contiguous region. This clusters all bitmaps and inode tables together, improving sequential metadata I/O and reducing the overhead of small file creation.

dumpe2fs /dev/sda1 | grep -i "block group\|first block\|blocks per group"
# Shows the layout parameters

debugfs /dev/sda1
> stats      # show filesystem statistics
> ls /       # list root directory
> stat <2>   # show inode 2 (root directory inode)

Inodes in ext4

An inode (index node) contains all metadata about a file except its name. The name lives in the directory entry. This separation enables hard links: multiple names (directory entries) pointing to the same inode.

ext4 Inode Size: 256 Bytes

ext2 used 128-byte inodes. ext4 expanded to 256 bytes, using the extra 128 bytes for:

Nanosecond timestamps: i_atime_extra, i_mtime_extra, i_ctime_extra — 32-bit nanosecond fractions appended to the 32-bit second fields. This eliminates the Y2038 problem for timestamps on ext4 (the 32-bit second value wraps in 2038, but ext4 uses a 34-bit extended field that wraps in 2446).
Extended attributes: Inline xattrs (SELinux labels, POSIX ACLs) stored directly in the inode, avoiding extra block reads for common small xattrs.
Version fields: For NFS and inode versioning.

What the Inode Contains

struct ext4_inode (256 bytes):
  i_mode:       16 bits  — file type + permission bits (rwxrwxrwx)
  i_uid/i_gid:  16+16    — owner/group (lower 16 bits; high 16 in i_uid_high)
  i_size_lo:    32 bits  — file size in bytes (lower 32; high 32 in i_size_high)
  i_atime:      32 bits  — last access time (seconds since epoch)
  i_ctime:      32 bits  — inode change time
  i_mtime:      32 bits  — data modification time
  i_dtime:      32 bits  — deletion time
  i_links_count:16 bits  — number of hard links
  i_blocks_lo:  32 bits  — 512-byte block count (legacy)
  i_flags:      32 bits  — EXT4_EXTENTS_FL, EXT4_INLINE_DATA_FL, etc.
  i_block[15]:  60 bytes — block pointers OR extent tree header
  ...extra fields for nanoseconds, version, xattrs...

stat /etc/passwd
# Shows: inode number, size, blocks, permissions, all three timestamps
# "Blocks: 8" means 8 × 512-byte blocks = 4KB (one block allocated)

Block Addressing: Extents Replace Indirect Blocks

The ext2/ext3 Indirect Block Problem

ext2 stored file blocks using 15 pointers in the inode:

12 direct pointers (→ data blocks directly)
1 single indirect pointer (→ a block of pointers → data blocks): adds 1024 blocks
1 double indirect (→ block of single-indirect blocks): adds 1024² blocks
1 triple indirect: adds 1024³ blocks

For a 1GB file: the triple indirect chain requires reading 3 extra blocks before reaching data. For a 100GB file with a fully random access pattern: catastrophic overhead.

ext4 Extents: Contiguous Runs

An extent describes a contiguous run of blocks:

struct ext4_extent {
    __le32 ee_block;    // first logical block number this extent covers
    __le16 ee_len;      // number of blocks in this extent (max 32768 = 128MB)
    __le16 ee_start_hi; // high 16 bits of physical block number
    __le32 ee_start_lo; // low 32 bits of physical block number
};

Inline extents: The 60 bytes of i_block in the inode that previously held indirect block pointers now hold an extent tree header plus up to 4 extent entries. A file whose data fits in 4 contiguous runs has its entire block map stored inline in the inode — zero additional block reads.

For a 1GB sequential file: one extent: {ee_block=0, ee_len=256, ee_start=12345}. The kernel reads this one extent record and knows exactly where every byte lives. Compare to ext2's triple-indirect chain of 3 extra block reads.

Extent tree for fragmented files: When a file has more than 4 extents, the inode's i_block holds an extent tree root, and additional extent index nodes are allocated from the block group. The tree is a B+ tree — leaf nodes hold extents, interior nodes hold indexes.

Journaling: Surviving Crashes

The Consistency Problem

Without journaling: a crash while writing a new file (allocate inode + update block bitmap + write directory entry + write data) can leave the filesystem in an inconsistent state. The inode is allocated but the directory entry is missing. The bitmap says the block is used but nothing points to it. Without journaling, recovery requires fsck — a full filesystem scan that takes minutes to hours on large filesystems.

The Write-Ahead Journal

ext3/ext4 uses a write-ahead journal: before making any metadata change to the filesystem, write a record of the intended change to the journal (a circular log). On crash recovery, fsck replays the journal — the filesystem either has the complete change (if the journal entry was committed) or neither part of it.

The journal in ext4 is typically inode 8, occupying a contiguous region of 128MB by default.

Three Journal Modes

# Mount with specific journal mode:
mount -o data=ordered /dev/sda1 /mnt
tune2fs -E mount_opts=data=journal /dev/sda1  # set default

Mode	What Goes in Journal	Crash Safety	Performance	Use Case
writeback	Metadata only; data written whenever	Metadata consistent; data may be stale	Fastest	Trusted local systems, best throughput
ordered (default)	Metadata only; data written BEFORE metadata	Metadata consistent; data consistent if written	Good balance	Default for most Linux deployments
journal	Both metadata AND data	Strongest: both always consistent	Slowest (double writes)	Databases needing strict consistency

Ordered mode (default): The kernel ensures that file data is written to disk before the corresponding journal commit record. This means a new file's data is on disk before its directory entry is committed. On crash, you may lose recent writes, but you will never see a file with garbage data — either the data is there completely or the file doesn't appear.

ext4 Feature Comparison

Feature	ext2	ext3	ext4	Description	Benefit
Journaling	No	Yes	Yes (improved)	Write-ahead log for crash recovery	No fsck on clean unmount
Extents	No	No	Yes	Contiguous block runs	Less fragmentation, faster large file I/O
48-bit block numbers	No	No	Yes	Supports up to 1 exabyte volumes	Enterprise storage
Large inodes (256B)	No	No	Yes	Extra space for timestamps + xattrs	Nanosecond timestamps, inline xattrs
Flexible block groups	No	No	Yes	Cluster metadata across groups	Faster metadata ops
Persistent preallocation	No	No	Yes	fallocate() reserves space	Video recording, databases
Online defragmentation	No	No	Yes	e4defrag without unmounting	Maintenance without downtime
Directory HTree indexing	No	Optional	Yes	Hash tree for large dirs	O(log n) vs O(n) directory lookup
Nanosecond timestamps	No	No	Yes	crtime (creation time) added	Forensics, accurate build systems
Inline data	No	No	Yes	Files < 60B stored in inode	Tiny file performance

Practical ext4 Diagnostics

# Filesystem-level info:
dumpe2fs /dev/sda1 | less              # complete on-disk layout info
tune2fs -l /dev/sda1                  # summary of superblock fields

# Inode-level exploration:
debugfs /dev/sda1                     # interactive debugger
> stat <1234>                         # show inode 1234 fields
> extents <1234>                      # show extent tree for inode 1234
> bmap <1234> 0                       # show physical block for logical block 0 of inode 1234

# Fragmentation analysis:
e4defrag -c /home                     # check fragmentation without defragging
filefrag -v /var/log/syslog           # show extents for a specific file

# Block group statistics:
dumpe2fs /dev/sda1 | grep "Block group" | head -20

Key Takeaways

ext4's design is the result of three decades of iterative problem-solving. Block groups solve physical locality. Extents solve metadata overhead for large files. Journaling solves crash consistency. The 256-byte inode solves timestamp precision and inline attribute storage. Each feature was added because the previous design hit a real wall in production.

The journaling mode decision has real consequences: ordered mode (the default) is the right choice for almost all workloads. Writeback mode offers a measurable throughput improvement but exposes you to data loss on crash. Journal mode is rarely justified because databases — the workloads that most need write ordering guarantees — bypass the page cache and manage their own consistency via fsync() and internal WAL mechanisms anyway.

Understanding ext4 at this level matters when you are diagnosing slow file creation (inode table layout, block group saturation), recovering from a corrupted filesystem (knowing which structures to reconstruct), or choosing filesystem parameters for a new deployment (inode density, block size, journal size). The tools are there: debugfs lets you walk the on-disk structures interactively, and dumpe2fs exposes every superblock and group descriptor field. The filesystem is not a black box — it is a precisely specified on-disk data structure that you can read and reason about.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

35 minLesson 13 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min