Recently saw a question that sparked this thought. Couldn't really find an answer here or via the Google machine. Basically, I'm interested in knowing how the kernel I/O architecture is layered. For example, does kjournald
dispatch to pdflush
or the other way around? My assumption is that pdflush
(being more generic to mass storage I/O) would sit at a lower level and trigger the SCSI/ATA/whatever commands necessary to actually perform the writes, and kjournald
handles higher level filesystem data structures before writing. I could see it the other way around as well, though, with kjournald
directly interfacing with the filesystem data structures and pdflush
waking up every now and then to write dirty pagecache pages to the device through kjournald
. It's also possible that the two don't interact at all for some other reason.
Basically: I need some way to visualize (graph or just an explanation) the basic architecture used for dispatching I/O to mass storage within the Linux kernel.
Best Answer
Before we discuss the specifics regarding
pdflush
,kjournald, and
kswapd`, let's first get a little background on the context of what exactly we're talking about in terms of the Linux Kernel.The GNU/Linux architecture
The architecture of GNU/Linux can be thought of as 2 spaces:
Between the User Space and Kernel Space sits the GNU C Library (
glibc
). This provides the system call interface that connects the kernel to the user-space applications.The Kernel Space can be further subdivided into 3 levels:
System Call Interface as its name implies, provide an interface between the
glibc
and the kernel. The Architectural Independent Kernel Code is comprised of the logical units such as the VFS (Virtual File System) and the VMM (Virtual Memory Management). The Architectural Dependent Code is the components that are processor and platform-specific code for a given hardware architecture.Diagram of GNU/Linux Architecture
For the rest of this article, we'll be focusing our attention on the VFS and VMM logical units within the Kernel Space.
Subsystems of the GNU/Linux Kernel
VFS Subsystem
With a high level concept of how the GNU/Linux kernel is structured we can delve a little deeper into the VFS subsystem. This component is responsible for providing access to the various block storage devices which ultimately map down to a filesystem (ext3/ext4/etc.) on a physical device (HDD/etc.).
Diagram of VFS
This diagram shows how a
write()
from a user's process traverses the VFS and ultimately works its way down to the device driver where it's written to the physical storage medium. This is the first place where we encounterpdflush
. This is a daemon which is responsible for flushing dirty data and metadata buffer blocks to the storage medium in the background. The diagram doesn't show this but there is another daemon,kjournald
, which sits along sidepdflush
, performing a similar task writing dirty journal blocks to disk. NOTE: Journal blocks is how filesystems like ext4 & JFS keep track of changes to the disk in a file, prior to those changes taking place.The above details are discussed further in this paper.
Overview of
write()
stepsTo provide a simple overview of the I/O sybsystem operations, we'll use an example where the function
write()
is called by a User Space application.write()
system call.bio struct
(refer to 1.4.3, “Block layer” on page 23) and submits a write request to the block device layer.VMM Subsystem
Continuing our deeper dive, we can now look into the VMM subsystem. This component is responsible for maintaining consistency between main memory (RAM), swap, and the physical storage medium. The primary mechanism for maintaining consistency is
bdflush
. As pages of memory are deemed dirty they need to be synchronized with the data that's on the storage medium.bdflush
will coordinate withpdflush
daemons to synchronize this data with the storage medium.Diagram of VMM
Swap
When system memory becomes scarce or the kernel swap timer expires, the
kswapd
daemon will attempt to free up pages. So long as the number of free pages remains abovefree_pages_high
,kswapd
will do nothing. However, if the number of free pages drops below, thenkswapd
will start the page reclaming process. Afterkswapd
has marked pages for relocation,bdflush
will take care to synchronize any outstanding changes to the storage medium, through thepdflush
daemons.References & Further Readings