Linux – Do Filesystems Inform Block Devices When Blocks are No Longer Required?

block-devicelinux-kernel

Traditionally storage devices (hard drives) were assumed to have no mechanism to "delete" data beyond simply overwriting it. I can see a few theoretical scenarios where it would be useful for block devices to be informed that their underlying storage is no longer required but I don't see any mechanism to do so.

Use cases:

  • Almost all modern SSDs use Wear Leveling to extend their life. This is achievable by simply having more blocks of internal storage than than the reported size and cycling between them. But if the SSD was told that blocks were no longer required it would give a much larger pool to cycle through.
  • File systems created in RAM (NOT including tmpfs). Where files are deleted, the underlying ramdisk cannot return the allocated space to free RAM if the file system can't report the space is not required.
  • Swap solutions such those using zram would need to inform the block device when pages are no longer used in swap space or they would leave a significant amount of "junk" sitting in RAM.

This looks like a similar concept to FALLOC_FL_PUNCH_HOLE. But from what I can read there, that is purely for de-allocating space from a file in a file system. That is to say, a user space application can inform a file system that space is not needed. But that's not the same as a file system informing a block device that space isn't needed or is it?

So is it a case that in each scenario there is a work around, or us there a mechanism that allows filesystems and swap to inform block devices when blocks are no longer needed?

Best Answer

On Linux, file systems can inform the block layer that one or more blocks are no longer required, using blkdev_issue_discard. In practice file systems use this to discard blocks when the corresponding behaviour is requested, typically by mounting a file system with a “discard” option. Intermediate layers also use this request to propagate discards, e.g. in the MD layer.

This isn’t done by default; the ext4 manpage says “it is off by default until sufficient testing has been done”, but as TooTea reminded me, many SSDs don’t cope well with constant discards so the recommended approach is to periodically run fstrim instead. Most file systems’ default behaviour is to internally mark blocks as unused when the corresponding content is deleted, without informing underlying layers of this fact. This is what allows file contents to be recovered after accidental deletion, whether by using file system-specific “undelete” utilities, or block device exploration tools such as PhotoRec. Marking unused blocks without further processing also allows file deletions to be performed quickly.

The fact that, absent explicit discards, file systems don’t do much processing when blocks are no longer necessary has meant that thin provisioning involves more work than might have been hoped. Thus, Xen includes specific support for thin provisioning of Ext3 file systems — the block layer there “knows” about the file system it’s storing, and exploits that to identify blocks which aren’t needed, without the file system explicitly informing it of anything. On VMware, thin provisioning, or rather, identifying unused blocks to reduce a thin provisioned block device’s storage requirements, requires zeroing out unused blocks and running an analysis tool. SAN-based thin provisioning systems have similar support. (With discard support, thin provisioning becomes much easier — thin provisioned volumes advertise support for trimming, and the file systems do the rest.)

FALLOC_FL_PUNCH_HOLE is, as you describe, a file system-level operation, but when the underlying file system supports it and is mounted with the appropriate discard option, it will result in blocks being discarded.

Related Question