Linux – Silent disk errors and reliability of Linux swap

error handlinglinuxlinux-kernelswap

My understanding is that hard drives and SSDs implement some basic error correction inside the drive, and most RAID configurations e.g. mdadm will depend on this to decide when a drive has failed to correct an error and needs to be taken offline. However, this depends on the storage being 100% accurate in its error diagnosis. That's not so, and a common configuration like a two-drive RAID-1 mirror will be vulnerable: suppose some bits on one drive are silently corrupted and the drive does not report a read error. Thus, file systems like btrfs and ZFS implement their own checksums, so as not to trust buggy drive firmwares, glitchy SATA cables, and so on.

Similarly, RAM can also have reliability problems and thus we have ECC RAM to solve this problem.

My question is this: what's the canonical way to protect the Linux swap file from silent corruption / bit rot not caught by drive firmware on a two-disk configuration (i.e. using mainline kernel drivers)? It seems to me that a configuration that lacks end-to-end protection here (such as that provided by btrfs) somewhat negates the peace of mind brought by ECC RAM. Yet I cannot think of a good way:

  • btrfs does not support swapfiles at all. You could set up a loop device from a btrfs file and make a swap on that. But that has problems:
  • ZFS on Linux allows using a ZVOL as swap, which I guess could work: http://zfsonlinux.org/faq.html#CanIUseaZVOLforSwap – however, from my reading, ZFS is normally demanding on memory, and getting it working in a swap-only application sounds like some work figuring it out. I think this is not my first choice. Why you would have to use some out-of-tree kernel module just to have a reliable swap is beyond me – surely there is a way to accomplish this with most modern Linux distributions / kernels in this day & age?
  • There was actually a thread on a Linux kernel mailing list with patches to enable checksums within the memory manager itself, for exactly the reasons I discuss in this question: http://thread.gmane.org/gmane.linux.kernel/989246 – unfortunately, as far as I can tell, the patch died and never made it upstream for reasons unknown to me. Too bad, it sounded like a nice feature. On the other hand, if you put swap on a RAID-1 – if the corruption is beyond the ability of the checksum to repair, you'd want the memory manager to try to read from the other drive before panicking or whatever, which is probably outside the scope of what a memory manager should do.

In summary:

  • RAM has ECC to correct errors
  • Files on permanent storage have btrfs to correct errors
  • Swap has ??? <— this is my question

Best Answer

We trust the integrity of the data retrieved from swap because the storage hardware has checksums, CRCs, and such.

In one of the comments above, you say:

true, but it won't protect against bit flips outside of the disk itself

"It" meaning the disk's checksums here.

That is true, but SATA uses 32-bit CRCs for commands and data. Thus, you have a 1 in 4 billion chance of corrupting data undetectably between the disk and the SATA controller. That means that a continuous error source could introduce an error as often as every 125 MiB transferred, but a rare, random error source like cosmic rays would cause undetectable errors at a vanishingly small rate.

Realize also that if you've got a source that causes an undetected error at a rate anywhere near one per 125 MiB transferred, performance will be terrible because of the high number of detected errors requiring re-transfer. Monitoring and logging will probably alert you to the problem in time to avoid undetected corruption.

As for the storage medium's checksums, every SATA (and before it, PATA) disk uses per-sector checksums of some kind. One of the characteristic features of "enterprise" hard disks is larger sectors protected by additional data integrity features, greatly reducing the chance of an undetected error.

Without such measures, there would be no point to the spare sector pool in every hard drive: the drive itself could not detect a bad sector, so it could never swap fresh sectors in.

In another comment, you ask:

if SATA is so trustworthy, why are there checksummed file systems like ZFS, btrfs, ReFS?

Generally speaking, we aren't asking swap to store data long-term. The limit on swap storage is the system's uptime, and most data in swap doesn't last nearly that long, since most data that goes through your system's virtual memory system belongs to much shorter-lived processes.

On top of that, uptimes have generally gotten shorter over the years, what with the increased frequency of kernel and libc updates, virtualization, cloud architectures, etc.

Furthermore, most data in swap is inherently disused in a well-managed system, being one that doesn't run itself out of main RAM. In such a system, the only things that end up in swap are pages that the program doesn't use often, if ever. This is more common than you might guess. Most dynamic libraries that your programs link to have routines in them that your program doesn't use, but they had to be loaded into RAM by the dynamic linker. When the OS sees that you aren't using all of the program text in the library, it swaps it out, making room for code and data that your programs are using. If such swapped-out memory pages are corrupted, who would ever know?

Contrast this with the likes of ZFS where we expect the data to be durably and persistently stored, so that it lasts not only beyond the system's current uptime, but also beyond the life of the individual storage devices that comprise the storage system. ZFS and such are solving a problem with a time scale roughly two orders of magnitude longer than the problem solved by swap. We therefore have much higher corruption detection requirements for ZFS than for Linux swap.

ZFS and such differ from swap in another key way here: we don't RAID swap filesystems together. When multiple swap devices are in use on a single machine, it's a JBOD scheme, not like RAID-0 or higher. (e.g. macOS's chained swap files scheme, Linux's swapon, etc.) Since the swap devices are independent, rather than interdependent as with RAID, we don't need extensive checksumming because replacing a swap device doesn't involve looking at other interdependent swap devices for the data that should go on the replacement device. In ZFS terms, we don't resilver swap devices from redundant copies on other storage devices.

All of this does mean that you must use a reliable swap device. I once used a $20 external USB HDD enclosure to rescue an ailing ZFS pool, only to discover that the enclosure was itself unreliable, introducing errors of its own into the process. ZFS's strong checksumming saved me here. You can't get away with such cavalier treatment of storage media with a swap file. If the swap device is dying, and is thus approaching that worst case where it could inject an undetectable error every 125 MiB transferred, you simply have to replace it, ASAP.

The overall sense of paranoia in this question devolves to an instance of the Byzantine generals problem. Read up on that, ponder the 1982 date on the academic paper describing the problem to the computer science world, and then decide whether you, in 2019, have fresh thoughts to add to this problem. And if not, then perhaps you will just use the technology designed by three decades of CS graduates who all know about the Byzantine Generals Problem.

This is well-trod ground. You probably can't come up with an idea, objection, or solution that hasn't already been discussed to death in the computer science journals.

SATA is certainly not utterly reliable, but unless you are going to join academia or one of the the kernel development teams, you are not going to be in a position to add materially to the state of the art here. These problems are already well in hand, as you've already noted: ZFS, btrfs, ReFS... As an OS user, you simply have to trust that the OS's creators are taking care of these problems for you, because they also know about the Byzantine Generals.

It is currently not practical to put your swap file on top of ZFS or Btrfs, but if the above doesn't reassure you, you could at least put it atop xfs or ext4. That would be better than using a dedicated swap partition.