Linux – How to replace a disk in a non-redundant ZFS pool

linuxzfs

I've been doing a bit of reading, and it looks like ZFS doesn't like disks being removed from non-redundant arrays:

You can use the zpool detach command to detach a device from a mirrored storage pool. For example:
# zpool detach zeepool c2t1d0
However, this operation is refused if there are no other valid replicas of the data. For example:
# zpool detach newpool c1t2d0
cannot detach c1t2d0: only applicable to mirror and replacing vdevs

The basic problem is understandable: removing the only copy of a piece of data (whether metadata or payload data) from an array would render that data unavailable.

The examples for replacing devices in a ZFS storage pool give a basic step-by-step description for how to replace a device in a storage pool: offline the disk, remove the disk, insert the replacement disk, run zpool replace to inform ZFS of the change and online the disk. This obviously requires that the array does not depend on the disk being replaced, hence the array must have redundancy; if it does depend on the drive in question, this approach presents the same problem as above.

What is the recommended way of replacing a disk in a non-redundant ZFS array?

Assume that the existing disk is working properly, and assume that the replacement disk is at least the same size as the disk being replaced. (If the existing disk has failed, clearly all one could do is add a new disk and restore all files affected by the disk failure from backup.)

Best Answer

Don't know if things were that different in `13 but 'zfs replace' works on non-redundant pools. You just run the 1 command instead of detaching first.

d1 is 1G, d2 is 2G, both are empty files in /tmp:

/t/test #> zpool create test /tmp/test/d1
/t/test #> zpool set autoexpand=on test
/t/test #> zpool status
  pool: test
 state: ONLINE
  scan: none requested
config:

    NAME            STATE     READ WRITE CKSUM
    test            ONLINE       0     0     0
      /tmp/test/d1  ONLINE       0     0     0

errors: No known data errors
/t/test #> zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
test   1008M   258K  1008M         -     0%     0%  1.00x  ONLINE  -


/t/test #> zpool replace test /tmp/test/d1 /tmp/test/d2


/t/test #> zpool status
  pool: test
 state: ONLINE
  scan: resilvered 61K in 0h0m with 0 errors on Sun Sep 18 18:55:32 2016
config:

    NAME            STATE     READ WRITE CKSUM
    test            ONLINE       0     0     0
      /tmp/test/d2  ONLINE       0     0     0

errors: No known data errors
/t/test #> zpool list
NAME    SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
test   1.98G   408K  1.98G         -     0%     0%  1.00x  ONLINE  -

Related Solutions

Freebsd – Replacing disk when using FreeBSD ZFS zroot (ZFS on partition)

First: remember to take the new drive offline and be sure that it's not mounted or in use in any way.

Copy partition tables from old disk ada0 to new disk ada3:

% doas gpart backup ada0 | doas gpart restore -F ada3

Now ada3 has same three partitions as ada0:

% doas gpart show ada3
=>        40  3907029088  ada3  GPT  (1.8T)
          40        1024     1  freebsd-boot  (512K)
        1064         984        - free -  (492K)
        2048     4194304     2  freebsd-swap  (2.0G)
     4196352  3902832640     3  freebsd-zfs  (1.8T)
  3907028992         136        - free -  (68K)

Remove old ZFS metadata (notice partition p3):

% doas dd if=/dev/zero of=/dev/ada3p3

Replace drive (notice partition p3):

% doas zpool replace -f zroot 15120424524672854601 /dev/ada3p3
Make sure to wait until resilver is done before rebooting.

If you boot from pool 'zroot', you may need to update
boot code on newly attached disk '/dev/ada3p3'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

        gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

Run the mentioned command to update boot information on the new disk:

% doas gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada3
partcode written to ada3p1
bootcode written to ada3

UUID's are now different:

% gpart list ada0 | grep uuid | sort
   rawuuid: 7f842536-bcd0-11e8-b271-00259014958c
   rawuuid: 7fbe27a9-bcd0-11e8-b271-00259014958c
   rawuuid: 7fe24f3e-bcd0-11e8-b271-00259014958c
% gpart list ada3 | grep uuid | sort
   rawuuid: 9c629875-c369-11e8-a2b0-00259014958c
   rawuuid: 9c63d063-c369-11e8-a2b0-00259014958c
   rawuuid: 9c66f76e-c369-11e8-a2b0-00259014958c
% gpart list ada0 | grep efimedia | sort
   efimedia: HD(1,GPT,7f842536-bcd0-11e8-b271-00259014958c,0x28,0x400)
   efimedia: HD(2,GPT,7fbe27a9-bcd0-11e8-b271-00259014958c,0x800,0x400000)
   efimedia: HD(3,GPT,7fe24f3e-bcd0-11e8-b271-00259014958c,0x400800,0xe8a08000)
% gpart list ada3 | grep efimedia | sort
   efimedia: HD(1,GPT,9c629875-c369-11e8-a2b0-00259014958c,0x28,0x400)
   efimedia: HD(2,GPT,9c63d063-c369-11e8-a2b0-00259014958c,0x800,0x400000)
   efimedia: HD(3,GPT,9c66f76e-c369-11e8-a2b0-00259014958c,0x400800,0xe8a08000)

Drive is now resilvering:

% zpool status zroot
  pool: zroot
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Sep 29 01:01:24 2018
        64.7G scanned out of 76.8G at 162M/s, 0h1m to go
        15.7G resilvered, 84.22% done
config:

        NAME                        STATE     READ WRITE CKSUM
        zroot                       DEGRADED     0     0     0
          raidz2-0                  DEGRADED     0     0     0
            ada0p3                  ONLINE       0     0     0
            ada1p3                  ONLINE       0     0     0
            ada2p3                  ONLINE       0     0     0
            replacing-3             OFFLINE      0     0     0
              15120424524672854601  OFFLINE      0     0     0  was /dev/ada3p3/old
              ada3p3                ONLINE       0     0     0

After resilver:

% zpool status zroot
  pool: zroot
 state: ONLINE
  scan: resilvered 18.6G in 0h7m with 0 errors on Sat Sep 29 01:09:22 2018
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
            ada2p3  ONLINE       0     0     0
            ada3p3  ONLINE       0     0     0

errors: No known data errors

Adding disks to ZFS pool

There are basically two ways of growing a ZFS pool.

Add more vdevs

This is what user1133275 is suggesting in their answer. It's done with zpool add (which has basically the same syntax as zpool create does for specifying storage), and it works well for what it does.

ZFS won't rebalance your stored data automatically, but it will start to write any new data to the new vdevs until the new vdev has about the same usage as the existing one(s).

Once you've added a vdev to a pool, you basically cannot remove it without recreating the pool from scratch.

All vdevs in a pool need to be above their respective redundancy thresholds for the pool to be importable. In other words, every vdev needs to be at least DEGRADED for the pool to function.

Replace disks with larger ones

This is what you're discussing in your question. It's the normal way of growing a ZFS pool when you have a pool layout that you are happy with.

To replace a device with a new one, the new device needs to be at least as large as the old one.

Operationally, you'd hook up the new disk along with the old, and then zpool replace the old disk with the new one. (This creates a temporary replacing device which becomes a parent to the old and new disk; when the resilver completes, the replacing device is removed from the device tree and it looks like the new device was there all along.) Once the resilver completes, the old disk can be removed from the system.

Once all disks in a vdev are replaced by larger ones, you can expand the pool by running zpool online -e or by having the autoexpand property set to on (though I wouldn't really recommend the latter; pool expansion should be a conscious decision).

So which way is better?

That basically depends on your pool. As mentioned, the downside to having multiple vdevs is that they all need to be functional, so by adding vdevs you are actually, in a sense, reducing your safety margin. The upside, though, is that it's much easier to grow the pool piecemeal. Replacing devices in-place is basically the opposite; you don't need to keep as many vdevs functioning, but it isn't as easy to grow a pool piecemeal.

For me, frankly, assuming for a second that you're using rotational hard disks (since this seems like bulk storage), 20 TB is still well within reason for a single vdev pool. My suggestion in your situation would be to get six drives of the 8 TB variety, and to set those up in a single raidz2 vdev. Doing so gives you a net storage capacity of around 32 TB, thus leaving you with an initial about 35% free, and the ability to lose any two drives before any of your data is at significant risk. You could also consider running eight 6 TB drives for a net storage capacity of around 36 TB and starting out at 45% free. (I'd consider 6-8 drives to be slightly on the large end for raidz1, but fine for raidz2.) Then plan to replace those drives either on a 4-5 year schedule (due to wear) or whenever the pool goes above about 80% full (because ZFS is much, much happier when it has good headroom). If your figures are accurate, you should be replacing those drives due to wear well before your pool starts getting full, while still allowing for a reasonable amount of unexpected growth in storage needs. When you replace the drives, you can decide whether you're happy with the pool size you've got based on then-current usage, or if you want to get larger drives and expand the pool.