Linux – How do pdflush, kjournald, swapd, etc interoperate

kernelkjournaldlinuxstorage

Recently saw a question that sparked this thought. Couldn't really find an answer here or via the Google machine. Basically, I'm interested in knowing how the kernel I/O architecture is layered. For example, does kjournald dispatch to pdflush or the other way around? My assumption is that pdflush (being more generic to mass storage I/O) would sit at a lower level and trigger the SCSI/ATA/whatever commands necessary to actually perform the writes, and kjournald handles higher level filesystem data structures before writing. I could see it the other way around as well, though, with kjournald directly interfacing with the filesystem data structures and pdflush waking up every now and then to write dirty pagecache pages to the device through kjournald. It's also possible that the two don't interact at all for some other reason.

Basically: I need some way to visualize (graph or just an explanation) the basic architecture used for dispatching I/O to mass storage within the Linux kernel.

Best Answer

Before we discuss the specifics regarding pdflush, kjournald, andkswapd`, let's first get a little background on the context of what exactly we're talking about in terms of the Linux Kernel.

The GNU/Linux architecture

The architecture of GNU/Linux can be thought of as 2 spaces:

  • User
  • Kernel

Between the User Space and Kernel Space sits the GNU C Library (glibc). This provides the system call interface that connects the kernel to the user-space applications.

The Kernel Space can be further subdivided into 3 levels:

  • System Call Interface
  • Architectural Independent Kernel Code
  • Architectural Dependent Code

System Call Interface as its name implies, provide an interface between the glibc and the kernel. The Architectural Independent Kernel Code is comprised of the logical units such as the VFS (Virtual File System) and the VMM (Virtual Memory Management). The Architectural Dependent Code is the components that are processor and platform-specific code for a given hardware architecture.

Diagram of GNU/Linux Architecture

                                 ss of gnu/linux arch.

For the rest of this article, we'll be focusing our attention on the VFS and VMM logical units within the Kernel Space.

Subsystems of the GNU/Linux Kernel

                                    ss of kernel com

VFS Subsystem

With a high level concept of how the GNU/Linux kernel is structured we can delve a little deeper into the VFS subsystem. This component is responsible for providing access to the various block storage devices which ultimately map down to a filesystem (ext3/ext4/etc.) on a physical device (HDD/etc.).

Diagram of VFS

ss of vfs

This diagram shows how a write() from a user's process traverses the VFS and ultimately works its way down to the device driver where it's written to the physical storage medium. This is the first place where we encounter pdflush. This is a daemon which is responsible for flushing dirty data and metadata buffer blocks to the storage medium in the background. The diagram doesn't show this but there is another daemon, kjournald, which sits along side pdflush, performing a similar task writing dirty journal blocks to disk. NOTE: Journal blocks is how filesystems like ext4 & JFS keep track of changes to the disk in a file, prior to those changes taking place.

The above details are discussed further in this paper.

Overview of write() steps

To provide a simple overview of the I/O sybsystem operations, we'll use an example where the function write() is called by a User Space application.

  1. A process requests to write a file through the write() system call.
  2. The kernel updates the page cache mapped to the file.
  3. A pdflush kernel thread takes care of flushing the page cache to disk.
  4. The file system layer puts each block buffer together to a bio struct (refer to 1.4.3, “Block layer” on page 23) and submits a write request to the block device layer.
  5. The block device layer gets requests from upper layers and performs an I/O elevator operation and puts the requests into the I/O request queue.
  6. A device driver such as SCSI or other device specific drivers will take care of write operation.
  7. A disk device firmware performs hardware operations like seek head, rotation, and data transfer to the sector on the platter.

VMM Subsystem

Continuing our deeper dive, we can now look into the VMM subsystem. This component is responsible for maintaining consistency between main memory (RAM), swap, and the physical storage medium. The primary mechanism for maintaining consistency is bdflush. As pages of memory are deemed dirty they need to be synchronized with the data that's on the storage medium. bdflush will coordinate with pdflush daemons to synchronize this data with the storage medium.

Diagram of VMM

                ss of VMM

Swap

When system memory becomes scarce or the kernel swap timer expires, the kswapd daemon will attempt to free up pages. So long as the number of free pages remains above free_pages_high, kswapd will do nothing. However, if the number of free pages drops below, then kswapd will start the page reclaming process. After kswapd has marked pages for relocation, bdflush will take care to synchronize any outstanding changes to the storage medium, through the pdflush daemons.

References & Further Readings

Related Question