AiTechWorlds
AiTechWorlds
ext4 is used on hundreds of millions of Linux systems. It descended from ext3 (2001), which descended from ext2 (1993), which descended from ext (1992). Understanding ext4 means understanding 30 years of filesystem engineering decisions — each one made to solve a real problem.
The original ext (1992) was a hack. Remy Card wrote it in a few weeks to give Linux a writable filesystem faster than MINIX. It had a fixed block size, a flat inode table, and could manage at most 2GB. It crashed badly and had no crash recovery. ext2 (1993) fixed the structural problems with a block group layout that is still visible in ext4 today. ext3 (2001) added journaling to survive crashes. ext4 (2008) added extents, 48-bit block numbers, and nanosecond timestamps — making the 30-year-old architecture scale to 1 exabyte.
Every ext4 improvement was driven by a specific production failure. The block count limit of ext2 hit real databases. The journaling overhead of ext3's ordered mode hit real mail servers. The double-indirect pointer overhead hit real video storage systems. To read ext4's feature list is to read a decade of Linux production operations, encoded in on-disk format bits.
ext4 divides a disk into fixed-size block groups. Each block group is self-contained, containing its own metadata and data blocks. This design reduces fragmentation (related data stays close on disk) and improves parallelism (multiple processes can allocate in different block groups simultaneously).
Default block group size: 32768 blocks × 4096 bytes = 128MB.
Block Group N:
+------------------+------------------+------------------+------------------+------------------+------------------+
| Superblock | Group Descriptor| Block Bitmap | Inode Bitmap | Inode Table | Data Blocks |
| (copy 0/1/2) | Table | (1 block) | (1 block) | (N blocks) | (rest) |
+------------------+------------------+------------------+------------------+------------------+------------------+
sparse_super2 feature is enabled) hold backup copies at sparse intervals.struct ext4_inode (256 bytes each in ext4). A 128MB block group with the default 16 inodes/KB ratio has 16384 inodes, requiring 64 blocks (256KB) for the inode table.ext4's flex_bg feature (enabled by default) merges the metadata of multiple block groups into a single contiguous region. This clusters all bitmaps and inode tables together, improving sequential metadata I/O and reducing the overhead of small file creation.
dumpe2fs /dev/sda1 | grep -i "block group\|first block\|blocks per group"
# Shows the layout parameters
debugfs /dev/sda1
> stats # show filesystem statistics
> ls / # list root directory
> stat <2> # show inode 2 (root directory inode)
An inode (index node) contains all metadata about a file except its name. The name lives in the directory entry. This separation enables hard links: multiple names (directory entries) pointing to the same inode.
ext2 used 128-byte inodes. ext4 expanded to 256 bytes, using the extra 128 bytes for:
i_atime_extra, i_mtime_extra, i_ctime_extra — 32-bit nanosecond fractions appended to the 32-bit second fields. This eliminates the Y2038 problem for timestamps on ext4 (the 32-bit second value wraps in 2038, but ext4 uses a 34-bit extended field that wraps in 2446).struct ext4_inode (256 bytes):
i_mode: 16 bits — file type + permission bits (rwxrwxrwx)
i_uid/i_gid: 16+16 — owner/group (lower 16 bits; high 16 in i_uid_high)
i_size_lo: 32 bits — file size in bytes (lower 32; high 32 in i_size_high)
i_atime: 32 bits — last access time (seconds since epoch)
i_ctime: 32 bits — inode change time
i_mtime: 32 bits — data modification time
i_dtime: 32 bits — deletion time
i_links_count:16 bits — number of hard links
i_blocks_lo: 32 bits — 512-byte block count (legacy)
i_flags: 32 bits — EXT4_EXTENTS_FL, EXT4_INLINE_DATA_FL, etc.
i_block[15]: 60 bytes — block pointers OR extent tree header
...extra fields for nanoseconds, version, xattrs...
stat /etc/passwd
# Shows: inode number, size, blocks, permissions, all three timestamps
# "Blocks: 8" means 8 × 512-byte blocks = 4KB (one block allocated)
ext2 stored file blocks using 15 pointers in the inode:
For a 1GB file: the triple indirect chain requires reading 3 extra blocks before reaching data. For a 100GB file with a fully random access pattern: catastrophic overhead.
An extent describes a contiguous run of blocks:
struct ext4_extent {
__le32 ee_block; // first logical block number this extent covers
__le16 ee_len; // number of blocks in this extent (max 32768 = 128MB)
__le16 ee_start_hi; // high 16 bits of physical block number
__le32 ee_start_lo; // low 32 bits of physical block number
};
Inline extents: The 60 bytes of i_block in the inode that previously held indirect block pointers now hold an extent tree header plus up to 4 extent entries. A file whose data fits in 4 contiguous runs has its entire block map stored inline in the inode — zero additional block reads.
For a 1GB sequential file: one extent: {ee_block=0, ee_len=256, ee_start=12345}. The kernel reads this one extent record and knows exactly where every byte lives. Compare to ext2's triple-indirect chain of 3 extra block reads.
Extent tree for fragmented files: When a file has more than 4 extents, the inode's i_block holds an extent tree root, and additional extent index nodes are allocated from the block group. The tree is a B+ tree — leaf nodes hold extents, interior nodes hold indexes.
Without journaling: a crash while writing a new file (allocate inode + update block bitmap + write directory entry + write data) can leave the filesystem in an inconsistent state. The inode is allocated but the directory entry is missing. The bitmap says the block is used but nothing points to it. Without journaling, recovery requires fsck — a full filesystem scan that takes minutes to hours on large filesystems.
ext3/ext4 uses a write-ahead journal: before making any metadata change to the filesystem, write a record of the intended change to the journal (a circular log). On crash recovery, fsck replays the journal — the filesystem either has the complete change (if the journal entry was committed) or neither part of it.
The journal in ext4 is typically inode 8, occupying a contiguous region of 128MB by default.
# Mount with specific journal mode:
mount -o data=ordered /dev/sda1 /mnt
tune2fs -E mount_opts=data=journal /dev/sda1 # set default
| Mode | What Goes in Journal | Crash Safety | Performance | Use Case |
|---|---|---|---|---|
| writeback | Metadata only; data written whenever | Metadata consistent; data may be stale | Fastest | Trusted local systems, best throughput |
| ordered (default) | Metadata only; data written BEFORE metadata | Metadata consistent; data consistent if written | Good balance | Default for most Linux deployments |
| journal | Both metadata AND data | Strongest: both always consistent | Slowest (double writes) | Databases needing strict consistency |
Ordered mode (default): The kernel ensures that file data is written to disk before the corresponding journal commit record. This means a new file's data is on disk before its directory entry is committed. On crash, you may lose recent writes, but you will never see a file with garbage data — either the data is there completely or the file doesn't appear.
| Feature | ext2 | ext3 | ext4 | Description | Benefit |
|---|---|---|---|---|---|
| Journaling | No | Yes | Yes (improved) | Write-ahead log for crash recovery | No fsck on clean unmount |
| Extents | No | No | Yes | Contiguous block runs | Less fragmentation, faster large file I/O |
| 48-bit block numbers | No | No | Yes | Supports up to 1 exabyte volumes | Enterprise storage |
| Large inodes (256B) | No | No | Yes | Extra space for timestamps + xattrs | Nanosecond timestamps, inline xattrs |
| Flexible block groups | No | No | Yes | Cluster metadata across groups | Faster metadata ops |
| Persistent preallocation | No | No | Yes | fallocate() reserves space | Video recording, databases |
| Online defragmentation | No | No | Yes | e4defrag without unmounting | Maintenance without downtime |
| Directory HTree indexing | No | Optional | Yes | Hash tree for large dirs | O(log n) vs O(n) directory lookup |
| Nanosecond timestamps | No | No | Yes | crtime (creation time) added | Forensics, accurate build systems |
| Inline data | No | No | Yes | Files < 60B stored in inode | Tiny file performance |
# Filesystem-level info:
dumpe2fs /dev/sda1 | less # complete on-disk layout info
tune2fs -l /dev/sda1 # summary of superblock fields
# Inode-level exploration:
debugfs /dev/sda1 # interactive debugger
> stat <1234> # show inode 1234 fields
> extents <1234> # show extent tree for inode 1234
> bmap <1234> 0 # show physical block for logical block 0 of inode 1234
# Fragmentation analysis:
e4defrag -c /home # check fragmentation without defragging
filefrag -v /var/log/syslog # show extents for a specific file
# Block group statistics:
dumpe2fs /dev/sda1 | grep "Block group" | head -20
ext4's design is the result of three decades of iterative problem-solving. Block groups solve physical locality. Extents solve metadata overhead for large files. Journaling solves crash consistency. The 256-byte inode solves timestamp precision and inline attribute storage. Each feature was added because the previous design hit a real wall in production.
The journaling mode decision has real consequences: ordered mode (the default) is the right choice for almost all workloads. Writeback mode offers a measurable throughput improvement but exposes you to data loss on crash. Journal mode is rarely justified because databases — the workloads that most need write ordering guarantees — bypass the page cache and manage their own consistency via fsync() and internal WAL mechanisms anyway.
Understanding ext4 at this level matters when you are diagnosing slow file creation (inode table layout, block group saturation), recovering from a corrupted filesystem (knowing which structures to reconstruct), or choosing filesystem parameters for a new deployment (inode density, block size, journal size). The tools are there: debugfs lets you walk the on-disk structures interactively, and dumpe2fs exposes every superblock and group descriptor field. The filesystem is not a black box — it is a precisely specified on-disk data structure that you can read and reason about.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises