A ZFS or LVM or MD redundant heterogeneous storage proposal

hard drivelvmraid-1software-raidzfs

I have the same problem most people have: how to create a reliable personal storage solution with the fact that:

  1. Hard drives fail with alarming regularity. Losing files is unacceptable.
  2. I will buy a new HDD from time to time. Inevitably, the best price/GB is a different size than the last HDD purchase.
  3. 2 means that over time I have a heterogeneous collection of disks. I want to use them all, and failed disks will generally be replaced by larger disks.

  4. Data integrity and reliability is more important to me than speed.

So after banging my head against this problem for a few days (and in the back of my head for years) I propose the following solution. I will describe a solution that I have tested based on native linux ZFS which is available in an Ubuntu PPA, but LVM, MD, and btrfs can be used to achieve the same. For this I will use RAID1 (ZFS mirror vdevs).

  1. Given your set of drives, group them into two sets of disks, such that the capacity of each set is as near to the other as possible.
  2. Partition the larger disks such that there is a partition exactly the same size as one of the smaller disks, in the other group.
  3. Create mirror vdevs such that each disk has its mirror on another disk.

For example, consider a disk set of a new 2TB drive, an older 750GB drive, an 2 older 400GB drives, and one older 500GB drive. The optimal mirrored partitioning has 2TB of usable space and is described in the following diagram where ':' separates partitions and '|' separates disks:

+------------------------------------------------------------------+
| 2TB (sda1)        : (sda2)       : (sda3)       : (sda4)         |
+------------------------------------------------------------------+--+
| 750 GB (sdb)      | 400 GB (sdc) | 400 GB (sdd) | 500 GB (sde1)  :XX|
+---------------------------------------------------------------------+

Create your zpool as

zpool create archive mirror /dev/sda1 /dev/sdb mirror /dev/sda2 /dev/sdc mirror /dev/sda3 /dev/sdd mirror /dev/sda4 /dev/sde1

This creates 4 mirrored vdevs. If any one of the disks failed, it can be replaced (with any size disk) and partitioned to recreate the missing partitions. It's important that ZFS vdevs can be added to a pool but not removed. So if at all possible, when one purchases a new drive, you want to rearrange the existing vdevs. Let's say the next purchase was a 3TB drive. Your optimal configuration is 3.5TB usable, as described in the following diagram. This is now 5 vdev pairs. This can be achieved by appropriate partitioning and successively failing and repartitioning the drives.

+--------------------------------------------------------------+-------------+
| 3 TB (sdf1)       : (sdf2)      : (sdf3)      : (sdf4)       | 500GB (sde) |
+--------------------------------------------------------------+-------------+-+
| 2TB (sda1)        | 400GB (sdb) | 400GB (sdc) | 750GB (sdd1) : (sdd2)      :X| 
+------------------------------------------------------------------------------+

Maintaining this pairing of mirrored drives could also be done with LVM or with MD RAID, the idea being to make sure each drive always has a mirror drive or parition. Because everything is mirrored, we are free to fail drives and rearrange paritions when drives are added or removed. Using LVM or MD it would be possible to remove drives and shrink the array, if desired, at the expense of less sophisticated recovery tools in ZFS compared to BTRFS.

Any comments on this procedure? A good script could handle the lossless allocation and rearrangement of drives. Any comments on LVM vs. MD vs. ZFS? Any comments on performance of the resulting weirdly partitioned array? Will data arrangement across multiple partitions on the same drive cause excessive head seeking and early failure?

BTRFS devs: everyone wants this and LVM or MD are not technically necessary (and in my opinion, sub-optimal). Making it easy to maintain a redundant heterogeneous array would be a killer feature for btrfs. It's a hack on LVM/MD/ZFS as it is. Minimizing resliver/resync is massively desirable.

Yes, this is obviously a poor-man's Drobo. One shouldn't need dedicated hardware for that…

Best Answer

I have tested this with ZFS and write performance is about half what it should be, because ZFS distributes reads and writes over all vdevs (therefore dividing I/O to several places on the same disk). Thus, the speed is limited by the speed of the disk with the most partitions. Read speed seems to be equal to the disk bandwidth. Note a pair of ZFS partitions on two disks has roughly double the read speed of either single disk, because it can read from the disks in parallel.

Using MD LINEAR arrays or LVM to create the two halves results in twice the write performance compared to the above ZFS proposal, but has the disadvantage that LVM and MD have no idea where the data is stored. In the event of a disk failure or upgrade, one side of the array must be entirely destroyed and resynced/reslivered, followed by the other side. (e.g. the resync/resliver has to copy 2*(size of array))

Therefore it seems then that the optimal solution is to create a single ZFS mirror vdev across two LVM or MD LINEAR devices which combine the disks into equal-sized "halves". This has roughly twice the read bandwidth of any one disk, and write bandwidth is equal to the individual disk bandwidths.

Using BTRFS raid1 instead of ZFS also works, but has half the read bandwidth because ZFS distributes its reads to double the bandwidth, while it appears BTRFS does not (according to my tests). BTRFS has the advantage that partitions can be shrunk, while they cannot with ZFS (so if after a failure you have lots of empty space, with BTRFS it's possible to rebuild a smaller redundant array by shrinking the filesystem, then rearranging the disks).

This is tedious to do by hand but easy with some good scripts.

Related Question