Linux – How to Replace a Failing Disk in RAID1 Btrfs

btrfshard-disklinux

I have a two-disk btrfs filesystem, with the data and metadata both in RAID1 (via the btrfs feature, not mdraid). The disks are USB3 drives, with dm-crypt on top. One of the disks is failing (it has several thousand bad sectors, and writes often timeout). I have obtained a third USB drive to replace the failing one, how can I replace it?

Best Answer

This has turned out to be a royal PITA. First, it's important to note that btrfs now has a proper replace command, which is very much better than add new, remove failing.

First, start by partitioning the new disk and setting up dm-crypt on it. Go ahead and unlock it.

If your disk wasn't having writes time out (which takes 360s each, apparently!) you could do a simple:

btrfs replace start -r /dev/mapper/luks-BAD-disk-uuid              \
                       /dev/mapper/luks-NEW-disk-uuid /mount/path

However, that will wind up doing somewhat routine writes to the bad disk, and if those cause a timeout—you'll see around 30s of fast copy, followed by 6–12 minutes of idle, waiting for timeout.

In order to avoid writes to it, it's possible to set up a snapshot using device-mapper. Reads will go to the underlying bad device (which is mostly OK with reads); writes will go to the copy-on-write (COW) storage. First, you need a suitably large block device for the COW storage. I created a new logical volume for it (Watt-sdj1_dmsnap). Any block device should work—even a loop device should be fine. I personally suggest a persistent one, just in case something goes wrong, but if you're living dangerously and have sufficient RAM, a RAM disk would work.

Mine would up requiring ~1.7GB of COW space (to move 2.24 TiB off a 3TB drive). I'd recommend being generous with the COW space; running out would probably be a bad thing, and you get to free it all up once you're done.

Next, you need to unmount the btrfs filesystem if it was mounted, and lock (stop) the dm-crypt device. I'm putting the snapshot below the encryption because I don't want unencrypted data written to disk.

In my case, the partition is /dev/sdj1. First, to avoid any mistakes, set it read-only:

blockdev --setro /dev/sdj1
blockdev --setro /dev/sdj

(You can later set it back with --setrw). Now, actually set up the snapshot:

dmsetup create sdj_divert --table "0 $(blockdev --getsz /dev/sdj1) snapshot /dev/sdj1 /dev/mapper/Watt-sdj1_dmsnap PO 8"

To quickly explain what that means, a device-mapper table has the format: start-sector number-of-sectors target-type target-arguments. The start sector is 0; the number of sectors is the same as the size of sdj1 (we want to do the entire thing after all); the target type is snapshot. The snapshot target takes several arguments: source-dev cow-dev mode chunk-size. We're giving a source device of /dev/sdj1; the COW device is that logical volume I created; the mode PO means persistent (metadata is written to disk, so it can be set back up after a reboot) and overflow (if we write to much to the snapshot, recovery is possible). The chunk size is how granular the snapshot is; if we write even one byte, the entire chunk around that byte will be copied over (and consume space in the snapshot). 8 is 4K, so there won't be alignment issues.

Now, finally, unlock the device again—but instead of unlocking /dev/sdj1, unlock /dev/mapper/sdj_divert. Then go ahead and mount the btrfs filesystem again.

You can check the snapshot usage with dmsetup status sdj_divert; that should give something like (but with a much lower number before the slash):

0 5860524928 snapshot 914216/545259520 3568

The first three things are the start-sector, number-of-sectors, and target-type. The next number is the number of sectors used (before the slash) and then the number of sectors total (after the slash). So that's a fraction of the space used. The final number is the number of sectors used for metadata, which is already included in the number used.

Now, finally, you can use that simple btrfs replace start command at the top of the answer. That will return immediately; watch the status by running btrfs replace status /mount/path.

When the replace is done, confirm the bad device has been dropped out of the filesystem (e.g., btrfs fi show /mount/path) and then you can lock/close the failing drive and then remove the snapshot (dmsetup remove sdj_divert). Then you can free up the COW space (after wiping it, if you're paranoid).

There is one final, technically optional, step: my replacement device is larger, but btrfs isn't yet using the extra space. In order to make it available to btrfs, look find the devid in the btrfs fi show output, and then run

btrfs fi resize DEVID:max /mount/path

that should be nearly instant.

Related Question