Linux – Btrfs on top of a mdadm raid10, or btrfs raid10 on bare devices

btrfslinuxmdadm

I have a RAID10 managed by the mdadm and I have EXT4 filesystem on top of it. However, I like BTRFS and would like to convert the EXT4 filesystem to BTRFS, but I was thinking about performance and maintainability. For an example with BTRFS, I can't easily see the status when I remove/add another disk to the array like I can with mdadm (or perhaps I just do not know how – I searched through the BTRFS docs and could not find this).

So, from your experience, what is better choice:

To simply just convert the EXT4 filesystem and let mdadm manage the RAID10?
To get rid of mdadm, and let BTRFS do everything?

Best Answer

Let Btrfs do everything.

For one thing, Btrfs has its own integrated mirroring code which can be smarter than madm.

Of course if a disk fails hard in a mirrored pair in an madm raid10, you can replace the bad disk and move on with your life (albeit after a distressingly complex set of shell commands). The problem is if your disk fails a bit more softly: if a few blocks just give back the wrong bits instead of giving the appropriate error codes for a bad block, then when reading the data you will randomly get bad data. Btrfs is smarter than that: it checksums every bit of data. To be honest I don't know if it's more correct to say "every BTree node" or "every block", but the point is that when it reads some data from a mirrored array, it checks the checksum before giving it back to your userland process. If the checksum doesn't match, it consults the other mirror in the array first, and if that gives the correct checksum, then it will alert you that your disk has started to silently fail.

The Btrfs wiki specifically mentions your question:

If Btrfs were to rely on device mapper or MD for mirroring, it would not be able to resolve checksum failures by checking the mirrored copy. The lower layers don't know the checksum or granularity of the filesystem blocks, and so they are not able to verify the data they return.

Finally, even without this substantial advantage, the command-line workflow for dealing with removed or added Btrfs devices is super simple. I'm not even sure I could get the degraded-mount-then-fix-your-filesystem shell commands right, but for Btrfs it's very clearly documented on the multiple devices page as:

mount -o degraded /dev/sdb /mnt
btrfs device delete missing /mnt

At this point if you have enough space on your remaining disks, you can always just btrfs rebalance and be done with it; no need to replace the mirror, as you would absolutely need to do with madm! And if you want to replace it, you can do btrfs device add first.

Related Solutions

Linux – Does RAID1 increase performance with Linux mdadm

Yes, Linux implementation of RAID1 speeds up disk read operations by a factor of two as long as two separate disk read operations are performed at the same time. That means reading one 10GB file won't be any faster on RAID1 than on a single disk, but reading two distinct 10GB files*will be faster.

To demonstrate it, just read some data with dd. Before performing anything, clear the disk read cache with sync && echo 3 > /proc/sys/vm/drop_caches. Otherwise hdparm will claim super fast reads.

Single file:

# COUNT=1000; dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT &
(...)
10485760000 bytes (10 GB) copied, 65,9659 s, 159 MB/s

Two files:

# COUNT=1000; dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT &; dd if=/dev/md127 of=/dev/null bs=10M count=$COUNT skip=$COUNT &
(...)
10485760000 bytes (10 GB) copied, 64,9794 s, 161 MB/s
10485760000 bytes (10 GB) copied, 68,6484 s, 153 MB/s

Reading 10 GB of data took 65 seconds whereas reading 10 GB + 10 GB = 20 GB data took 68.7 seconds in total, which means multiple disk reads benefit greatly from RAID1 on Linux. skip=$COUNT part is very important. The second process reads 10 GB of data from the 10 GB offset.

Jared's answer and ssh's comments refering to http://www.unicom.com/node/459 are wrong. The benchmark from there proves disk reads don't benefit from RAID1. However, the test was performed with bonnie++ benchmarking tool which doesn't perform two separate reads at one time. The author explictly states bonnie++ is not usable for benchmarking RAID arrays (refer to readme).

Linux BTRFS – convert to single with failed drive

Alright, I figured it out with the help of this Trello link. In case anyone else wants to do this, here's the procedure.

Procedure

From a RAID1 array of two disks, one /dev/sda which is faulty and another /dev/sdc known-good:

Disable auto-mounting of this array in /etc/fstab, reboot. Basically, we want btrfs to forget this array exists, as there's a bug where it'll still try to use one of the drives if it's unplugged.
Now that your array is unmounted, execute:

echo 1 | sudo tee /sys/block/sda/device/delete

replacing sda with the faulty device name. This causes the disk to spin down (you should verify this in dmesg) and become inaccessible to the kernel.

Alternatively: just take the drive out of the computer before booting! I chose not to opt for this method, as the above works fine for me.
Mount your array, with -o degraded mode.
Begin a rebalancing operation with sudo btrfs balance start -f -mconvert=single -dconvert=single /mountpoint. This will reorganise the extents on the known-good drive, converting them to single (non-RAID). This will take almost a day to complete, depending on the speed of your drive and size of your array. (mine had ~700 GiB, and rebalanced at a rate of 1 1GiB chunk per minute) Luckily, this operation can be paused, and will keep the array online while it occurs.
Once this is done, you can issue sudo btrfs device remove missing /mountpoint to remove the 'missing' faulty device.
Begin a second rebalance with sudo btrfs balance start -mconvert=dup /mountpoint to restore metadata redundancy. This takes a few minutes on my system.
You're done! Your array is now single mode, with all redundancy removed.
Take your faulty drive outside, and beat it with a hammer.

Troubleshooting

Help, btrfs tried to write to my faulty disk, errored out, and forced it readonly!
- Did you follow step 1, and reboot before continuing? It's likely that btrfs still thinks the drive you spun down is present. Rebooting will cause btrfs to forget any errors, and will let you continue.