Your Linux system might have ext4 on the SSD, NTFS on a USB drive, tmpfs in /tmp, procfs in /proc, and NFS over the network — all accessible with the same open(), read(), write() calls. How? The Virtual Filesystem Switch (VFS).

In 1984, Sun Microsystems needed to mount NFS network filesystems on the same system as local Unix filesystems. Their engineers added an indirection layer between the system call interface and the individual filesystem implementations. Every filesystem operation would go through a set of function pointers — a virtual dispatch table — and each filesystem would implement the operations differently. The VFS was born.

Linux adopted and extended this design. Today, the Linux VFS is one of the most elegant pieces of systems engineering in any open-source project — a clean abstraction maintained across hundreds of different filesystem implementations, from the ancient ext2 (1993) to the bleeding-edge io_uring interface. The Unix philosophy "everything is a file" is not a metaphor. It is implemented in the VFS, which makes /proc/cpuinfo, tcp://google.com:80, and /dev/sda addressable with the same four system calls.

VFS Data Structures

The VFS defines four core objects. Every filesystem must implement operations on these objects. The kernel interacts with all filesystems through these interfaces.

superblock

A struct super_block represents one mounted filesystem instance. It is the root of all metadata for that mount.

Key fields:

s_blocksize: block size (4096 for most ext4/xfs)
s_op: pointer to struct super_operations (sync_fs, write_super, statfs, put_super)
s_root: pointer to the root dentry of this mount
s_type: pointer to struct file_system_type (the driver)

When you run mount /dev/sda1 /mnt/data, the kernel calls the filesystem's fill_super() function, which populates the super_block by reading the on-disk superblock. The super_block persists in memory for the lifetime of the mount.

inode

A struct inode represents one file, directory, symlink, or special file (device node, socket, FIFO). It is the kernel's in-memory view of the file's metadata.

Key fields:

i_ino: inode number (unique within a filesystem)
i_mode: file type + permissions (e.g., 0100644 = regular file, rw-r--r--)
i_uid, i_gid: owner and group
i_size: file size in bytes
i_atime, i_mtime, i_ctime: access, modification, and change timestamps (nanosecond precision in Linux 6.x)
i_nlink: number of hard links
i_op: pointer to struct inode_operations (create, link, unlink, mkdir, lookup, rename, readlink)
i_fop: pointer to struct file_operations for files of this type
i_mapping: pointer to struct address_space — connects the inode to the page cache

The inode is loaded from disk on demand and cached in the inode cache (an LRU cache managed by iput() and iget()). Many inodes for popular files stay warm in the inode cache permanently.

dentry

A struct dentry (directory entry) maps a file name to an inode. The separation of dentry from inode is deliberate: one inode can have multiple dentries (hard links), and directory traversal produces a chain of dentries without touching inode data until necessary.

Key fields:

d_name: the filename component (e.g., "file.txt")
d_inode: pointer to the inode this name resolves to (NULL if negative — name doesn't exist)
d_parent: pointer to parent dentry
d_op: pointer to struct dentry_operations (d_compare for case-sensitivity, d_hash)

The dcache (dentry cache) is a global hash table of recently resolved name-to-inode mappings. Path lookup is the most common VFS operation, and the dcache makes it fast: instead of re-reading directory blocks from disk for every open(), the kernel checks the dcache first. On a warm production server, the dcache holds millions of entries and satisfies almost all path lookups without touching disk.

cat /proc/sys/fs/dentry-state
# dentries used, unused, in_use, dummy
# Typical: 5000000+ dentries on busy servers

sysctl -w fs.dentry-state   # read-only, but shows current usage

file

A struct file represents an open file instance. Unlike inodes (one per file on disk) and dentries (one per name), there is one struct file per open file descriptor per process.

Key fields:

f_pos: current file position (seek pointer)
f_flags: O_RDONLY, O_WRONLY, O_NONBLOCK, etc.
f_op: pointer to struct file_operations (read, write, seek, mmap, ioctl, poll, fsync)
f_inode: back-pointer to the inode
f_path: the dentry + mount point that resolved to this file

When read() is called, the VFS looks up the struct file for the given fd (via the process's files_struct → fd table), then calls f_op->read(). For ext4, this calls the ext4 read implementation; for a socket, it reads from the socket buffer; for /proc/cpuinfo, it runs a kernel function that generates the text on demand.

File Lookup Path

For open("/home/user/file.txt", O_RDONLY):

Each dcache hit avoids one directory block read from disk. On a warm system with hundreds of open() calls per second to the same directory tree, the dcache turns what would be multiple disk reads per open() into a sub-microsecond hash table lookup.

Mount Namespaces and Bind Mounts

Mount Namespaces

Each process has a mount namespace — its own view of which filesystems are mounted where. Processes in different mount namespaces can see different filesystem trees simultaneously.

ls /proc/self/ns/mnt    # symbolic link to this process's mount namespace
readlink /proc/self/ns/mnt
# mnt:[4026531840]   — the namespace inode number

# Create new mount namespace:
unshare --mount bash    # new shell with private mount namespace

Docker containers run in separate mount namespaces. The container's /proc, /sys, and /dev are separate tmpfs/devtmpfs mounts. From the host, cat /proc/<container-pid>/mounts shows the container's mount namespace.

Bind Mounts

mount --bind /data/postgres /var/lib/postgresql
# /var/lib/postgresql now shows the same filesystem as /data/postgres
# Same inode numbers, same files — just two names for the same mount

Bind mounts create a second attachment point for an already-mounted filesystem. They are heavily used in container runtimes to selectively expose host directories into container mount namespaces without full filesystem privilege.

Filesystem Types in Linux 6.x

Filesystem	Primary Use	Key Feature	Mount Example
ext4	General Linux root/data	Journaling, extents, very stable	Most / partitions
xfs	High-performance data, large files	64-bit, online growth, excellent parallelism	RHEL default for data
btrfs	Modern Linux, NAS	Copy-on-write, snapshots, subvolumes, checksums	openSUSE default
tmpfs	/tmp, /run, /dev/shm	RAM-backed, no disk I/O, survives `rm -rf /tmp/*` but lost on reboot	`mount -t tmpfs tmpfs /tmp`
procfs	/proc	Kernel data structures exposed as files	Mounted at boot
sysfs	/sys	Device tree, kernel object attributes	Mounted at boot
NFS	Network storage	Stateless protocol, cache coherency, UID mapping	`mount -t nfs server:/share /mnt`
CIFS/SMB	Windows shares	SMB protocol, Windows ACL mapping	Windows interoperability
overlayfs	Container layers	Union mount: upper + lower read-only layers	Docker image layers

The Page Cache

When a process reads a file, the data goes into the page cache — a region of physical memory managed by the VFS. Subsequent reads to the same file data are served from the page cache without disk I/O.

The page cache is not per-process. It is global. All processes reading the same file share the same physical pages. This is why the free command shows large "cache" values on Linux: the kernel aggressively uses free RAM for the page cache because evicting cold cache pages when needed costs only a TLB flush, while having them available saves disk I/O.

free -h
#               total        used        free      shared  buff/cache   available
# Mem:            62G         12G        2.3G        1.1G         48G         49G
# The 48G "buff/cache" is the page cache — available to applications immediately

echo 3 > /proc/sys/vm/drop_caches  # drop page cache (NEVER do this in production)

Write-back: Writes go into the page cache as "dirty" pages. The writeback threads (formerly pdflush) flush dirty pages to disk asynchronously, bounded by:

vm.dirty_ratio (default 20%): start synchronous write-back when dirty pages hit this % of RAM
vm.dirty_background_ratio (default 10%): start background write-back at this level

VFS Object Summary

VFS Object	Represents	Key Operations	Lives In	Created When
superblock	Mounted filesystem	sync_fs, statfs, put_super	Memory (1 per mount)	`mount()` syscall
inode	File/dir/device (on-disk entity)	create, lookup, mkdir, unlink, rename	Inode cache (LRU)	First access or create
dentry	Name-to-inode mapping	d_compare, d_hash, d_delete	dcache (hash table)	Path component lookup
file	Open file handle (per process)	read, write, seek, mmap, ioctl	Process file table	`open()` syscall

VFS Architecture Diagram

Key Takeaways

The VFS is the proof that "everything is a file" is an engineering achievement, not just a philosophy. It works because the four VFS objects — superblock, inode, dentry, file — form a complete, composable model of any storage abstraction. A kernel developer writing a new filesystem only needs to implement the operations on these four objects; the entire syscall interface, path lookup, mount handling, and page cache integration come for free.

The dcache is the performance heart of the VFS. Path lookup frequency on a busy web server — every HTTP request triggers multiple open() calls, which trigger multiple dcache lookups — would saturate the disk without it. Monitor cat /proc/sys/fs/dentry-state and watch the unused (freeable) count: if it drops to zero and dentry allocations start failing, the system needs more memory or a smaller VFS workload. In practice, the dcache gracefully evicts cold entries under memory pressure through the same LRU machinery that manages the page cache — another example of the kernel's unified memory management treating all cached kernel objects as reclaimable pages.

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →

30 minLesson 12 of 16

Course Contents(16 lessons)

▾

Chapter 1: OS Architecture Internals

OS Kernel Architecture: Monolithic vs Microkernel vs Hybrid25 min

System Calls: The Bridge Between User Space and Kernel28 min

Interrupt Handling: Hardware Interrupts to Kernel Handlers28 min

Chapter 2: Process and Thread Internals

Process Internals: task_struct, PCB, and Kernel Data Structures32 min

Thread Models: POSIX Threads and Kernel Thread Implementation30 min

Context Switching: How the CPU Switches Between Tasks25 min

Chapter 3: Scheduling Internals

CPU Scheduling Deep Dive: Real Algorithms in Production35 min

Linux CFS: The Completely Fair Scheduler Explained30 min

Chapter 4: Memory Management Internals

Linux Memory Management: Zones, Buddy System, Slab Allocator32 min

Demand Paging: Page Fault Handling in Linux35 min

Virtual Memory Areas: mmap, Stack, Heap Internals28 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems30 min

ext4 Internals: Inodes, Extents, and Journaling35 min

Chapter 6: Synchronization and Security

Kernel Synchronization: Spinlocks, Mutexes, RCU32 min

OS Security: Capabilities, Namespaces, cgroups, SELinux28 min

Chapter 7: Final Project

Final Project: OS Internals Analysis and Simulation45 min

Chapter 5: File System Internals

VFS Layer: How Linux Abstracts File Systems

VFS: The Virtual Filesystem Switch

One Interface to Rule Every Filesystem

VFS Data Structures

The VFS defines four core objects. Every filesystem must implement operations on these objects. The kernel interacts with all filesystems through these interfaces.

superblock

A struct super_block represents one mounted filesystem instance. It is the root of all metadata for that mount.

Key fields:

s_blocksize: block size (4096 for most ext4/xfs)
s_op: pointer to struct super_operations (sync_fs, write_super, statfs, put_super)
s_root: pointer to the root dentry of this mount
s_type: pointer to struct file_system_type (the driver)

inode

A struct inode represents one file, directory, symlink, or special file (device node, socket, FIFO). It is the kernel's in-memory view of the file's metadata.

Key fields:

i_ino: inode number (unique within a filesystem)
i_mode: file type + permissions (e.g., 0100644 = regular file, rw-r--r--)
i_uid, i_gid: owner and group
i_size: file size in bytes
i_atime, i_mtime, i_ctime: access, modification, and change timestamps (nanosecond precision in Linux 6.x)
i_nlink: number of hard links
i_op: pointer to struct inode_operations (create, link, unlink, mkdir, lookup, rename, readlink)
i_fop: pointer to struct file_operations for files of this type
i_mapping: pointer to struct address_space — connects the inode to the page cache

The inode is loaded from disk on demand and cached in the inode cache (an LRU cache managed by iput() and iget()). Many inodes for popular files stay warm in the inode cache permanently.

dentry

Key fields:

d_name: the filename component (e.g., "file.txt")
d_inode: pointer to the inode this name resolves to (NULL if negative — name doesn't exist)
d_parent: pointer to parent dentry
d_op: pointer to struct dentry_operations (d_compare for case-sensitivity, d_hash)

cat /proc/sys/fs/dentry-state
# dentries used, unused, in_use, dummy
# Typical: 5000000+ dentries on busy servers

sysctl -w fs.dentry-state   # read-only, but shows current usage

file

A struct file represents an open file instance. Unlike inodes (one per file on disk) and dentries (one per name), there is one struct file per open file descriptor per process.

Key fields:

f_pos: current file position (seek pointer)
f_flags: O_RDONLY, O_WRONLY, O_NONBLOCK, etc.
f_op: pointer to struct file_operations (read, write, seek, mmap, ioctl, poll, fsync)
f_inode: back-pointer to the inode
f_path: the dentry + mount point that resolved to this file

File Lookup Path

For open("/home/user/file.txt", O_RDONLY):

Mount Namespaces and Bind Mounts

Mount Namespaces

Each process has a mount namespace — its own view of which filesystems are mounted where. Processes in different mount namespaces can see different filesystem trees simultaneously.

ls /proc/self/ns/mnt    # symbolic link to this process's mount namespace
readlink /proc/self/ns/mnt
# mnt:[4026531840]   — the namespace inode number

# Create new mount namespace:
unshare --mount bash    # new shell with private mount namespace

Bind Mounts

mount --bind /data/postgres /var/lib/postgresql
# /var/lib/postgresql now shows the same filesystem as /data/postgres
# Same inode numbers, same files — just two names for the same mount

Filesystem Types in Linux 6.x

Filesystem	Primary Use	Key Feature	Mount Example
ext4	General Linux root/data	Journaling, extents, very stable	Most / partitions
xfs	High-performance data, large files	64-bit, online growth, excellent parallelism	RHEL default for data
btrfs	Modern Linux, NAS	Copy-on-write, snapshots, subvolumes, checksums	openSUSE default
tmpfs	/tmp, /run, /dev/shm	RAM-backed, no disk I/O, survives `rm -rf /tmp/*` but lost on reboot	`mount -t tmpfs tmpfs /tmp`
procfs	/proc	Kernel data structures exposed as files	Mounted at boot
sysfs	/sys	Device tree, kernel object attributes	Mounted at boot
NFS	Network storage	Stateless protocol, cache coherency, UID mapping	`mount -t nfs server:/share /mnt`
CIFS/SMB	Windows shares	SMB protocol, Windows ACL mapping	Windows interoperability
overlayfs	Container layers	Union mount: upper + lower read-only layers	Docker image layers

The Page Cache

free -h
#               total        used        free      shared  buff/cache   available
# Mem:            62G         12G        2.3G        1.1G         48G         49G
# The 48G "buff/cache" is the page cache — available to applications immediately

echo 3 > /proc/sys/vm/drop_caches  # drop page cache (NEVER do this in production)

Write-back: Writes go into the page cache as "dirty" pages. The writeback threads (formerly pdflush) flush dirty pages to disk asynchronously, bounded by:

vm.dirty_ratio (default 20%): start synchronous write-back when dirty pages hit this % of RAM
vm.dirty_background_ratio (default 10%): start background write-back at this level

VFS Object Summary

VFS Object	Represents	Key Operations	Lives In	Created When
superblock	Mounted filesystem	sync_fs, statfs, put_super	Memory (1 per mount)	`mount()` syscall
inode	File/dir/device (on-disk entity)	create, lookup, mkdir, unlink, rename	Inode cache (LRU)	First access or create
dentry	Name-to-inode mapping	d_compare, d_hash, d_delete	dcache (hash table)	Path component lookup
file	Open file handle (per process)	read, write, seek, mmap, ioctl	Process file table	`open()` syscall

VFS Architecture Diagram

Key Takeaways

💬 DiscussionPowered by GitHub Discussions

📱

Get this course's notes on Telegram!

Free cheat sheets, summaries & practice exercises

Get Notes Free →