AiTechWorlds
AiTechWorlds
Your Linux system might have ext4 on the SSD, NTFS on a USB drive, tmpfs in /tmp, procfs in /proc, and NFS over the network — all accessible with the same open(), read(), write() calls. How? The Virtual Filesystem Switch (VFS).
In 1984, Sun Microsystems needed to mount NFS network filesystems on the same system as local Unix filesystems. Their engineers added an indirection layer between the system call interface and the individual filesystem implementations. Every filesystem operation would go through a set of function pointers — a virtual dispatch table — and each filesystem would implement the operations differently. The VFS was born.
Linux adopted and extended this design. Today, the Linux VFS is one of the most elegant pieces of systems engineering in any open-source project — a clean abstraction maintained across hundreds of different filesystem implementations, from the ancient ext2 (1993) to the bleeding-edge io_uring interface. The Unix philosophy "everything is a file" is not a metaphor. It is implemented in the VFS, which makes /proc/cpuinfo, tcp://google.com:80, and /dev/sda addressable with the same four system calls.
The VFS defines four core objects. Every filesystem must implement operations on these objects. The kernel interacts with all filesystems through these interfaces.
A struct super_block represents one mounted filesystem instance. It is the root of all metadata for that mount.
Key fields:
s_blocksize: block size (4096 for most ext4/xfs)s_op: pointer to struct super_operations (sync_fs, write_super, statfs, put_super)s_root: pointer to the root dentry of this mounts_type: pointer to struct file_system_type (the driver)When you run mount /dev/sda1 /mnt/data, the kernel calls the filesystem's fill_super() function, which populates the super_block by reading the on-disk superblock. The super_block persists in memory for the lifetime of the mount.
A struct inode represents one file, directory, symlink, or special file (device node, socket, FIFO). It is the kernel's in-memory view of the file's metadata.
Key fields:
i_ino: inode number (unique within a filesystem)i_mode: file type + permissions (e.g., 0100644 = regular file, rw-r--r--)i_uid, i_gid: owner and groupi_size: file size in bytesi_atime, i_mtime, i_ctime: access, modification, and change timestamps (nanosecond precision in Linux 6.x)i_nlink: number of hard linksi_op: pointer to struct inode_operations (create, link, unlink, mkdir, lookup, rename, readlink)i_fop: pointer to struct file_operations for files of this typei_mapping: pointer to struct address_space — connects the inode to the page cacheThe inode is loaded from disk on demand and cached in the inode cache (an LRU cache managed by iput() and iget()). Many inodes for popular files stay warm in the inode cache permanently.
A struct dentry (directory entry) maps a file name to an inode. The separation of dentry from inode is deliberate: one inode can have multiple dentries (hard links), and directory traversal produces a chain of dentries without touching inode data until necessary.
Key fields:
d_name: the filename component (e.g., "file.txt")d_inode: pointer to the inode this name resolves to (NULL if negative — name doesn't exist)d_parent: pointer to parent dentryd_op: pointer to struct dentry_operations (d_compare for case-sensitivity, d_hash)The dcache (dentry cache) is a global hash table of recently resolved name-to-inode mappings. Path lookup is the most common VFS operation, and the dcache makes it fast: instead of re-reading directory blocks from disk for every open(), the kernel checks the dcache first. On a warm production server, the dcache holds millions of entries and satisfies almost all path lookups without touching disk.
cat /proc/sys/fs/dentry-state
# dentries used, unused, in_use, dummy
# Typical: 5000000+ dentries on busy servers
sysctl -w fs.dentry-state # read-only, but shows current usage
A struct file represents an open file instance. Unlike inodes (one per file on disk) and dentries (one per name), there is one struct file per open file descriptor per process.
Key fields:
f_pos: current file position (seek pointer)f_flags: O_RDONLY, O_WRONLY, O_NONBLOCK, etc.f_op: pointer to struct file_operations (read, write, seek, mmap, ioctl, poll, fsync)f_inode: back-pointer to the inodef_path: the dentry + mount point that resolved to this fileWhen read() is called, the VFS looks up the struct file for the given fd (via the process's files_struct → fd table), then calls f_op->read(). For ext4, this calls the ext4 read implementation; for a socket, it reads from the socket buffer; for /proc/cpuinfo, it runs a kernel function that generates the text on demand.
For open("/home/user/file.txt", O_RDONLY):
Each dcache hit avoids one directory block read from disk. On a warm system with hundreds of open() calls per second to the same directory tree, the dcache turns what would be multiple disk reads per open() into a sub-microsecond hash table lookup.
Each process has a mount namespace — its own view of which filesystems are mounted where. Processes in different mount namespaces can see different filesystem trees simultaneously.
ls /proc/self/ns/mnt # symbolic link to this process's mount namespace
readlink /proc/self/ns/mnt
# mnt:[4026531840] — the namespace inode number
# Create new mount namespace:
unshare --mount bash # new shell with private mount namespace
Docker containers run in separate mount namespaces. The container's /proc, /sys, and /dev are separate tmpfs/devtmpfs mounts. From the host, cat /proc/<container-pid>/mounts shows the container's mount namespace.
mount --bind /data/postgres /var/lib/postgresql
# /var/lib/postgresql now shows the same filesystem as /data/postgres
# Same inode numbers, same files — just two names for the same mount
Bind mounts create a second attachment point for an already-mounted filesystem. They are heavily used in container runtimes to selectively expose host directories into container mount namespaces without full filesystem privilege.
| Filesystem | Primary Use | Key Feature | Mount Example |
|---|---|---|---|
| ext4 | General Linux root/data | Journaling, extents, very stable | Most / partitions |
| xfs | High-performance data, large files | 64-bit, online growth, excellent parallelism | RHEL default for data |
| btrfs | Modern Linux, NAS | Copy-on-write, snapshots, subvolumes, checksums | openSUSE default |
| tmpfs | /tmp, /run, /dev/shm | RAM-backed, no disk I/O, survives rm -rf /tmp/* but lost on reboot | mount -t tmpfs tmpfs /tmp |
| procfs | /proc | Kernel data structures exposed as files | Mounted at boot |
| sysfs | /sys | Device tree, kernel object attributes | Mounted at boot |
| NFS | Network storage | Stateless protocol, cache coherency, UID mapping | mount -t nfs server:/share /mnt |
| CIFS/SMB | Windows shares | SMB protocol, Windows ACL mapping | Windows interoperability |
| overlayfs | Container layers | Union mount: upper + lower read-only layers | Docker image layers |
When a process reads a file, the data goes into the page cache — a region of physical memory managed by the VFS. Subsequent reads to the same file data are served from the page cache without disk I/O.
The page cache is not per-process. It is global. All processes reading the same file share the same physical pages. This is why the free command shows large "cache" values on Linux: the kernel aggressively uses free RAM for the page cache because evicting cold cache pages when needed costs only a TLB flush, while having them available saves disk I/O.
free -h
# total used free shared buff/cache available
# Mem: 62G 12G 2.3G 1.1G 48G 49G
# The 48G "buff/cache" is the page cache — available to applications immediately
echo 3 > /proc/sys/vm/drop_caches # drop page cache (NEVER do this in production)
Write-back: Writes go into the page cache as "dirty" pages. The writeback threads (formerly pdflush) flush dirty pages to disk asynchronously, bounded by:
vm.dirty_ratio (default 20%): start synchronous write-back when dirty pages hit this % of RAMvm.dirty_background_ratio (default 10%): start background write-back at this level| VFS Object | Represents | Key Operations | Lives In | Created When |
|---|---|---|---|---|
| superblock | Mounted filesystem | sync_fs, statfs, put_super | Memory (1 per mount) | mount() syscall |
| inode | File/dir/device (on-disk entity) | create, lookup, mkdir, unlink, rename | Inode cache (LRU) | First access or create |
| dentry | Name-to-inode mapping | d_compare, d_hash, d_delete | dcache (hash table) | Path component lookup |
| file | Open file handle (per process) | read, write, seek, mmap, ioctl | Process file table | open() syscall |
The VFS is the proof that "everything is a file" is an engineering achievement, not just a philosophy. It works because the four VFS objects — superblock, inode, dentry, file — form a complete, composable model of any storage abstraction. A kernel developer writing a new filesystem only needs to implement the operations on these four objects; the entire syscall interface, path lookup, mount handling, and page cache integration come for free.
The dcache is the performance heart of the VFS. Path lookup frequency on a busy web server — every HTTP request triggers multiple open() calls, which trigger multiple dcache lookups — would saturate the disk without it. Monitor cat /proc/sys/fs/dentry-state and watch the unused (freeable) count: if it drops to zero and dentry allocations start failing, the system needs more memory or a smaller VFS workload. In practice, the dcache gracefully evicts cold entries under memory pressure through the same LRU machinery that manages the page cache — another example of the kernel's unified memory management treating all cached kernel objects as reclaimable pages.
Get this course's notes on Telegram!
Free cheat sheets, summaries & practice exercises