What’s an effective offsite backup strategy for a ZFS mirrored pool

backupzfs

I use a ZFS pool consisting of two mirrored disks. To get offsite backups going, I've purchased two more disks.

My initial plan was to create the offsite backup by attaching a third disk to the mirror, waiting for ZFS to resilver, then detach the drive and carry it offsite. This works well enough, but I've been surprised that it appears to perform a full resilver every time a disk is attached (I'd read, and possibly misunderstood, that each attach would result in an incremental or delta resilver). This results in backups taking longer than is acceptable.

My requirements are to have an offsite copy of my zpool and all its snapshots that I can rotate daily. This means the resilvering needs to take at most 24 hours–currently it's close to that, but our plans of growing our pool will push it beyond that timeframe.

How can I keep offsite backups that don't require a full resilvering operation? Should I be using a different filesystem on my backup drives (e.g. exporting an image rather than having them be part of the ZFS pool)? Should I have the backups in a separate pool and send new snapshots to it as they are created?

Best Answer

After much tinkering and experimentation I've found a solution, albeit with a fairly large tradeoff.

First off, the options I had to rule out:

  • Having a second offsite ZFS server with a mirrored pool wasn't an option due to cost. Had it been an option this would by far have been the best approach, utilizing ZFS send / receive to ship snapshots to the remote pool.

  • Having a second onsite ZFS mirrored pool, which I could remove disks from to take home. This is more feasible than the first option, but I would need the second pool to always have two disks onsite (or to use two data-copies on a single onsite disk). At present I have four disks, and no more space for a fifth in the server. This would be a fair approach but still not ideal.

  • Using ZFS attach and detach to rotate the backup disk into and out of the mirrored pool. This works well, but has to perform a full resilver every time the disk is added. This takes unacceptably long, and so I couldn't rely on this.

My solution is similar to using attach and detach, however it uses online and offline. This has the advantage of performing a delta resilvering versus a full resilvering, but the drawback that the pool always reports a DEGRADED state (the pool always has two disks; the rotating offsite disks are marked offline when they are in remote storage and resilver and then come online when they are onsite).

So, a quick recap and overview of my setup:

I have one ZFS server and four identical disks. ZFS is setup to use a mirrored pool. Two of the four disks are permanent members of this pool. The other two disks rotate; one is always in offsite storage, the other is part of the pool to act as a ready-to-go backup.

When it comes time to rotate the backups:

  • I wait for a zfs scrub to complete to reasonably assure the backup disk is error free

  • I zfs offline the disk which will be taken remote. After its offline'd I hdparm -Y /dev/id to spin it down. After a minute I partially remove the disk sled (just enough to ensure its lost power) and then give it another minute before fully pulling the drive to guarantee it has stopped spinning. The disk goes in a static bag and then a protective case and goes offsite.

  • I bring in the other offsite disk. It gets installed in the hotswap tray and spins up. I use zfs online to restore the disk to the pool and kick off a partial resilvering to make it concurrent.

This system guarantees that at any given time I have two ONLINE mirror disks and one OFFLINE remote disk (which has been scrubbed). The fourth disk is either being resilvered or online, which has the benefit that in case a running drive fails it's probably the pool will still consistent of two online disks.

It's worked well for the past couple weeks, but I'd still consider this a hackish approach. I'll follow up if I run into any major issues.


Update: After running with this for a couple months I've found that in my real-world use the resilvering is taking the same time for either detach/attach and offline/online. In my testing I don't think I was running a scrub--my hunch is that if a drive is offline for a scrub then it requires a full resilver.

Related Question