Linux – Do Filesystems Inform Block Devices When Blocks are No Longer Required?

block-devicelinux-kernel

Traditionally storage devices (hard drives) were assumed to have no mechanism to "delete" data beyond simply overwriting it. I can see a few theoretical scenarios where it would be useful for block devices to be informed that their underlying storage is no longer required but I don't see any mechanism to do so.

Use cases:

Almost all modern SSDs use Wear Leveling to extend their life. This is achievable by simply having more blocks of internal storage than than the reported size and cycling between them. But if the SSD was told that blocks were no longer required it would give a much larger pool to cycle through.
File systems created in RAM (NOT including tmpfs). Where files are deleted, the underlying ramdisk cannot return the allocated space to free RAM if the file system can't report the space is not required.
Swap solutions such those using zram would need to inform the block device when pages are no longer used in swap space or they would leave a significant amount of "junk" sitting in RAM.

This looks like a similar concept to FALLOC_FL_PUNCH_HOLE. But from what I can read there, that is purely for de-allocating space from a file in a file system. That is to say, a user space application can inform a file system that space is not needed. But that's not the same as a file system informing a block device that space isn't needed or is it?

So is it a case that in each scenario there is a work around, or us there a mechanism that allows filesystems and swap to inform block devices when blocks are no longer needed?

Best Answer

On Linux, file systems can inform the block layer that one or more blocks are no longer required, using blkdev_issue_discard. In practice file systems use this to discard blocks when the corresponding behaviour is requested, typically by mounting a file system with a “discard” option. Intermediate layers also use this request to propagate discards, e.g. in the MD layer.

This isn’t done by default; the ext4 manpage says “it is off by default until sufficient testing has been done”, but as TooTea reminded me, many SSDs don’t cope well with constant discards so the recommended approach is to periodically run fstrim instead. Most file systems’ default behaviour is to internally mark blocks as unused when the corresponding content is deleted, without informing underlying layers of this fact. This is what allows file contents to be recovered after accidental deletion, whether by using file system-specific “undelete” utilities, or block device exploration tools such as PhotoRec. Marking unused blocks without further processing also allows file deletions to be performed quickly.

The fact that, absent explicit discards, file systems don’t do much processing when blocks are no longer necessary has meant that thin provisioning involves more work than might have been hoped. Thus, Xen includes specific support for thin provisioning of Ext3 file systems — the block layer there “knows” about the file system it’s storing, and exploits that to identify blocks which aren’t needed, without the file system explicitly informing it of anything. On VMware, thin provisioning, or rather, identifying unused blocks to reduce a thin provisioned block device’s storage requirements, requires zeroing out unused blocks and running an analysis tool. SAN-based thin provisioning systems have similar support. (With discard support, thin provisioning becomes much easier — thin provisioned volumes advertise support for trimming, and the file systems do the rest.)

FALLOC_FL_PUNCH_HOLE is, as you describe, a file system-level operation, but when the underlying file system supports it and is mounted with the appropriate discard option, it will result in blocks being discarded.

Related Solutions

Linux – the appropriate value of vm.swappiness when using zram

Short answer:vm.swappiness=100is appropriate value for zram(At least on Debian Stretch with Linux 4.9 , I believe that is best value )

I already test vm.swappiness=100 for me.

I think you can do some simple test to sure which value is best for you.

Also I made another simple program for test this question. x On my machine a very low vm.swappiness value(such as vm.swappiness=1 ) will cause obvious responsiveness problem.

About SwapCached in /proc/meminfo:

First,try vm.page-cluster=0,this maybe can reduce some useless SwapCached from swap-in.

SwapCached can speed up zram same as non-zram swap device

SwapCached is can reuse(free) when necessary:

./linux-4.9/mm$ grep -rn delete_from_swap_cache
memory-failure.c:715:   delete_from_swap_cache(p);
shmem.c:1115:       delete_from_swap_cache(*pagep);
shmem.c:1645:            * unaccounting, now delete_from_swap_cache() will do
shmem.c:1652:               delete_from_swap_cache(page);
shmem.c:1668:       delete_from_swap_cache(page);
vmscan.c:673:       __delete_from_swap_cache(page);
swap_state.c:137:void __delete_from_swap_cache(struct page *page)
swap_state.c:218:void delete_from_swap_cache(struct page *page)
swap_state.c:227:   __delete_from_swap_cache(page);
swapfile.c:947:         delete_from_swap_cache(page);
swapfile.c:987: delete_from_swap_cache(page);
swapfile.c:1023:            delete_from_swap_cache(page);
swapfile.c:1571:            delete_from_swap_cache(page);
./linux-4.9/mm$

GNU/Linux – Overlay Block Device and Stackable Block Device

You can do that with the device mapper and its snapshot target.

Basically, you'd do the same as what LVM does when you create a writable snapshot.

dev=/dev/read-only-device
ovl=/path/to/overlay.file
newdevname=newdevice
size=$(blockdev --getsz "$dev")

loop=$(losetup -f --show -- "$ovl")
printf '%s\n' "0 $size snapshot $dev $loop P 8" |
  dmsetup create "$newdevname"

Then you can access the overlayed device as /dev/mapper/newdevice.

If you also need access to the original device at the same time, you can do:

printf '%s\n' "0 $size snapshot-origin $dev" |
  dmsetup create originaldevice

And access it over /dev/mapper/originaldevice.

You can write to that device, then in addition to the chunks written to the snapshot device, the overlay file will contain a copy of the chunks that have been overwritten when writing to the snapshot-origin.

The overlay file can be a sparse file. (for instance, create it as truncate -s10G the-file), and doesn't have to be as large as the original device. You can tell how full it is with dmsetup status "$newdevname".

Note: There are size and contents reqirements on a snapshot device.

Best Answer

Related Solutions

Linux – the appropriate value of vm.swappiness when using zram

GNU/Linux – Overlay Block Device and Stackable Block Device

Related Question