Block Device Cache vs Filesystem – Understanding the Differences

block-devicebuffercachelinux

Block devices provide buffering. This means that write() on a block device can return success, before the kernel has written the data to the device. A program can wait for all the buffered writes by calling fsync().

I have used dd (or cat) to write a filesystem image to a device. These commands do not call fsync() by default.

Next, suppose that I want to mount the written block device as a filesystem.

I suppose it is safest to e.g. use the sync command before mounting it. But what if I do not sync the block device? Is it possible that the filesystem might try to read some blocks, which have not yet been written to the device? Then could it read the old contents of the device, and not the correct data from the filesystem image?

My primary interest is in Linux behaviour. (And StackExchange encourages me to ask one specific question. I can upvote any alternative or historical behaviour as well though :-).

Best Answer

When the program closes the block device file, Linux flushes the associated cache, forcing the program to wait. This only applies to the last close() however. It will not happen if something else still has the block device open. Including if any partition of the same block device is still open.

So in the general case, it is best to sync the device somehow.

And to be safe, the way you should sync the device, is to run your dd command using the option conv=fsync. Without this, the kernel will not return write errors. So you would only notice an error if you looked in the kernel log (dmesg).

As well as waiting for all the cached writes, the last close() also drops all of the cache (kill_bdev()). I have verified this for myself, by watching the output of the free command.

linux-4.20/fs/block_dev.c:1778

static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
{
    struct gendisk *disk = bdev->bd_disk;
    struct block_device *victim = NULL;

    mutex_lock_nested(&bdev->bd_mutex, for_part);
    if (for_part)
        bdev->bd_part_count--;

    if (!--bdev->bd_openers) {
        WARN_ON_ONCE(bdev->bd_holders);
        sync_blockdev(bdev);
        kill_bdev(bdev);

In case you are not familiar with C code, the last block above is equivalent to this:

    bdev->bd_openers = bdev->bd_openers - 1;
    if (bdev->bd_openers == 0) {
        WARN_ON_ONCE(bdev->bd_holders);
        sync_blockdev(bdev);
        kill_bdev(bdev);

Related Solutions

GNU/Linux – Overlay Block Device and Stackable Block Device

You can do that with the device mapper and its snapshot target.

Basically, you'd do the same as what LVM does when you create a writable snapshot.

dev=/dev/read-only-device
ovl=/path/to/overlay.file
newdevname=newdevice
size=$(blockdev --getsz "$dev")

loop=$(losetup -f --show -- "$ovl")
printf '%s\n' "0 $size snapshot $dev $loop P 8" |
  dmsetup create "$newdevname"

Then you can access the overlayed device as /dev/mapper/newdevice.

If you also need access to the original device at the same time, you can do:

printf '%s\n' "0 $size snapshot-origin $dev" |
  dmsetup create originaldevice

And access it over /dev/mapper/originaldevice.

You can write to that device, then in addition to the chunks written to the snapshot device, the overlay file will contain a copy of the chunks that have been overwritten when writing to the snapshot-origin.

The overlay file can be a sparse file. (for instance, create it as truncate -s10G the-file), and doesn't have to be as large as the original device. You can tell how full it is with dmsetup status "$newdevname".

Note: There are size and contents reqirements on a snapshot device.

Linux – Create a write-cache loop device for a much larger block device

You can use either dm-snapshot or NBD in copy-on-write mode.

The dm-snapshot solution is provided here (sorry for not repeating it):

https://raid.wiki.kernel.org/index.php/Recovering_a_failed_software_RAID#Making_the_harddisks_read-only_using_an_overlay_file

As for NBD, you can install nbd-server and nbd-client, and then use it like this:

mount /mnt/storage # something with some free space
losetup --read-only /dev/sda1 /dev/loop0 # to ensure it's readonly
ln -s /dev/loop0 /mnt/storage/loop0
nbd-server 127.0.0.1@4242 /mnt/storage/loop0 -c

The symlink is necessary because nbd-server insists storing the temporary write cache file to the same location as the file it is serving. So without the link it would end up in /dev/ which is not useful at all.

Finally connect to it with the client:

nbd-client 127.0.0.1 4242 /dev/nbd0

The only problem with this NBD solution is that it uses quite a lot of RAM (depending on your device size), regardless of temporary storage being available. Since fsck itself is also quite RAM hungry at times, it's possible to run out if you don't have a lot of RAM installed.

Best Answer

Related Solutions

GNU/Linux – Overlay Block Device and Stackable Block Device

Linux – Create a write-cache loop device for a much larger block device

Related Question