How to one-way mirror an entire zfs pool to another zfs pool

replicationzfs

I have one zfs pool containing several zvols and datasets of which some are also nested.
All datasets and zvols are periodically snapshotted by zfs-auto-snapshot.
All datasets and zvols also have some manually created snapshots.

I have setup a remote pool on which due to lack of time, initial copying over local high speed network via zfs send -R did not complete (some datasets are missing, some datasets have outdated or missing snapshots).

Now the pool is physically remote over a slow speed connection and I need to periodically sync the remote pool with local pool, meaning data present in local pool must be copied to remote pool, data gone from local pool must be deleted from remote pool, and data present in remote pool but not in local pool must be deleted from remote pool, by data meaning 'zvols', 'datasets' or 'snapshots'.

If I was doing this between two regular filesystems using rsync, it would be "-axPHAX –delete" (that's what I actually do to backup some systems).

How do I setup a synchronizing task so the remote pool zvols & datasets (including their snapshots) can be in sync with local zvols,datasets&snapshots?

I would like to avoid transferring over ssh, because of low throughput performance of ssh; I'd prefer mbuffer or iscsi instead.

Best Answer

Disclaimer: As I've never used zvols, I cannot say if they are any different in replication than normal filesystems or snapshots. I assume they are, but do not take my word for it.

Your question is actually multiple questions, I try to answer them separately:

How to replicate/mirror complete pool to remote location

You need to split the task into two parts: first, the initial replication has to be complete, afterwards incremental replication is possible, as long as you do not mess with your replication snapshots. To enable incremental replication, you need to preserve the last replication snapshots, everything before that can be deleted. If you delete the previous snapshot, zfs recv will complain and abort the replication. In this case you have to start all over again, so try not to do this.

If you just need the correct options, they are:

zfs send:
- -R: send everything under the given pool or dataset (recursive replication, needed all the time, includes -p). Also, when receiving, all deleted source snapshots are deleted on the destination.
- -I: include all intermediate snapshots between the last replication snapshot and the current replication snapshot (needed only with incremental sends)
zfs recv:
- -F: expand target pool, including deletion of existing datasets that are deleted on the source
- -d: discard the name of the source pool and replace it with the destination pool name (the rest of the filesystem paths will be preserved, and if needed also created)
- -u: do not mount filesystem on destination

If you prefer a complete example, here is a small script:

#!/bin/sh

# Setup/variables:

# Each snapshot name must be unique, timestamp is a good choice.
# You can also use Solaris date, but I don't know the correct syntax.
snapshot_string=DO_NOT_DELETE_remote_replication_
timestamp=$(/usr/gnu/bin/date '+%Y%m%d%H%M%S')
source_pool=tank
destination_pool=tank
new_snap="$source_pool"@"$snapshot_string""$timestamp"
destination_host=remotehostname

# Initial send:

# Create first recursive snapshot of the whole pool.
zfs snapshot -r "$new_snap"
# Initial replication via SSH.
zfs send -R "$new_snap" | ssh "$destination_host" zfs recv -Fdu "$destination_pool"

# Incremental sends:

# Get old snapshot name.
old_snap=$(zfs list -H -o name -t snapshot -r "$source_pool" | grep "$source_pool"@"$snapshot_string" | tail --lines=1)
# Create new recursive snapshot of the whole pool.
zfs snapshot -r "$new_snap"
# Incremental replication via SSH.
zfs send -R -I "$old_snap" "$new_snap" | ssh "$destination_host" zfs recv -Fdu "$destination_pool"
# Delete older snaps on the local source (grep -v inverts the selection)
delete_from=$(zfs list -H -o name -t snapshot -r "$source_pool" | grep "$snapshot_string" | grep -v "$timestamp")
for snap in $delete_from; do
    zfs destroy "$snap"
done

Use something faster than SSH

If you have a sufficiently secured connection, for example IPSec or OpenVPN tunnel and a separate VLAN that only exists between sender and receiver, you may switch from SSH to unencrypted alternatives like mbuffer as described here, or you could use SSH with weak/no encryption and disabled compression, which is detailed here. There also was a website about recomiling SSH to be much faster, but unfortunately I don't remember the URL - I'll edit it later if I find it.

For very large datasets and slow connections, it may also be useful to to the first transmission via hard disk (use encrypted disk to store zpool and transmit it in sealed package via courier, mail or in person). As the method of transmission does not matter for send/recv, you can pipe everything to the disk, export the pool, send the disk to its destination, import the pool and then transmit all incremental sends via SSH.

The problem with messed up snapshots

As stated earlier, if you delete/modify your replication snapshots, you will receive the error message

cannot send 'pool/fs@name': not an earlier snapshot from the same fs

which means either your command was wrong or you are in an inconsistent state where you must remove the snapshots and start all over.

This has several negative implications:

You cannot delete a replication snapshot until the new replication snapshot was successfully transferred. As these replication snapshots include the state of all other (older) snapshots, empty space of deleted files and snapshots will only be reclaimed if the replication finishes. This may lead to temporary or permanent space problems on your pool which you can only fix by restarting or finishing the complete replication procedure.
You will have many additional snapshots, which slows down the list command (except on Oracle Solaris 11, where this was fixed).
You may need to protect the snapshots against (accidental) removal, except by the script itself.

There exists a possible solution to those problems, but I have not tried it myself. You could use zfs bookmark, a new feature in OpenSolaris/illumos created specifically for this task. This would free you of snapshot management. The only downside is that at present, it only works for single datasets, not recursively. You would have to save a list of all your old and new datasets and then loop over them, bookmarking, sending and receiving them, and then updating the list (or small database, if you prefer).

If you try the bookmark route, I would be interested to hear how it worked out for you!

Data at rest / on-disk consistency

I don't care about errors arising to other software or remote connections by doing this, or aborting any long file transfers prematurely, I just want it to offline the pool in the fastest way that's consistent with retaining its consistency and possibly giving it a few seconds for any pending writes to complete and the pool to be in a consistent state for data purposes.

First, the good news: as ZFS uses CoW and atomic transactions, your already existing data will be safe even in case of sudden power loss. This includes the pool layout and metadata. As old data is never moved before new data has been completely written (in fact, it is never moved at all, just reallocated), this data cannot be in danger in any way if the write is suddenly interrupted.

Additionally, checksums (Merkle hash trees) help to certify that nothing bad has happened during the reboot, which you may check by scrubbing the pool. If you have redundant vdevs, ZFS will automatically correct any errors it finds from a known good copy. If some blocks would have been corrupted in any way (for example by a rogue disk controller that does not write but says it does), their checksums would not match with the ones from other vdevs and errors would show.

Data in flight / write modes and loss of last n seconds

Sync and async writes

Normally, ZFS collects multiple transactions to speed up the costly writes to rotating drives - positioning the write head of the HDD takes much more time than actually writing, so you will want to queue as much as possible and then write it out in sequential (faster!) order (remember, we have CoW, this works quite naturally here).

The downside to this is that the longer you collect, the longer your applications would have to wait for a "write successful" message - which means your system would lock for several seconds, which is unacceptable. Even worse - you will lose all the data that is to be written to disk but has not yet been written in the event of power failure. If your applications cannot cope with this, corruption on the application layer may occur.

To combat this, the ZIL (ZFS intent log) was added. All sync transactions are collected in this log (which is stored by default on the slow pool disk, but can be stored on faster, mirrored SSDs, which are named SLOG devices) and after they are stored, "write successful" is returned to the application which may carry on with its tasks (no long locks anymore). Additionally, all async transactions are done without the ZIL, so they can be faster - provided the application calles the correct write operations for its data (sync vs async).

ZFS properties

Now for the more interesting part - what happens to your writes? There we have to discern the operation mode for the file system (it is a ZFS property and can be set individually for each file system). The three possible mode are (from the manpages):

sync=standard
  This is the default option. Synchronous file system transactions
  (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
  and then secondly all devices written are flushed to ensure
  the data is stable (not cached by device controllers).

sync=always
  For the ultra-cautious, every file system transaction is
  written and flushed to stable storage by a system call return.
  This obviously has a big performance penalty.

sync=disabled
  Synchronous requests are disabled.  File system transactions
  only commit to stable storage on the next DMU transaction group
  commit which can be many seconds.  This option gives the
  highest performance.  However, it is very dangerous as ZFS
  is ignoring the synchronous transaction demands of
  applications such as databases or NFS.
  Setting sync=disabled on the currently active root or /var
  file system may result in out-of-spec behavior, application data
  loss and increased vulnerability to replay attacks.
  This option does *NOT* affect ZFS on-disk consistency.
  Administrators should only use this when these risks are understood.

You'll notice that even if disabled is chosen, your pool layout/internal consistency is not affected - you will just lose your last 5 seconds of data and this may put your files into an incorrect state (for example, because you have a VM on top that expects sync writes but you only supplied an async zvol as a backing datastore).

On the other hand, if you don't want to lose anything at all, set all your file systems to always and switch to high-performance SSDs, at least for the SLOG device (or suffer the wait times).

standard is a compromise and the most flexible - the application itself decides which write mode it needs. If your applications are bad, you may experience dataloss. If they behave, you will have the best possible performance with a given baseline of safety.

Pool export/import:

From the documentation about zpool export:

The command attempts to unmount any mounted file systems within the pool before continuing. If any of the file systems fail to unmount, you can forcefully unmount them by using the -f option.

If devices are unavailable at the time of export, the devices cannot be identified as cleanly exported. If one of these devices is later attached to a system without any of the working devices, it appears as “potentially active.”

If ZFS volumes are in use in the pool, the pool cannot be exported, even with the -f option. To export a pool with a ZFS volume, first ensure that all consumers of the volume are no longer active.

This means roughly three things:

-f forces the pool to be exported by force-unmounting all file systems, even if they are active (with disregard for locks or applications writing there)
This does not work with zvols
You should not split pools and use them on different systems (be careful with failover situations)

Summary:

If all you care is on-disk consistency, you are good to go with export -f or a complete shutdown
If you care about all data, use sync=always and fast SSDs
Regarding iSCSI/NFS as datastores for VMs, this overview may also be helpful (excerpt: use NFS or disable iSCSI writeback cache on the guest/VM Host; quiesce VM before taking a ZFS snapshot, ZFS will be fine anyway, but guest VM will only be crash-consistent)

In reply to follow-up questions from the comments (left out questions where I do not have any useful answers):

(1) "good news/COW" - what if top level blocks were about to update - will it always find a usable top level block (even if pointing to slightly old versions of the metadata tree)? How bad can that get?

Worst case would be that the uberblock (the one at the top of all others) is damaged on all redundant devices. Because there is no block above it, you cannot reconstruct it from above, so there exist several copies of each uberblock (IIRC it was about 3 or 4), so one can be lost and a replacement copy is still there.

(2) I'm familiar with TXGs and use ESXi. Using APC UPS + good PSU/hw + P3700 NVMe ZIL so it's decent power + fast ZIL. But unlikely that current writes will all be sync and as you say, sync=always is slow. But your reply does raise a thought, I might do some performance testing. I'm using dedup (4x saving, worth it), so write=slow anyway (has to lookup DDT). Reason being sync=always only affects write which is slow anyway due to DDT. But setting sync=always forces ZIL, ZIL is very fast and and that makes long TXGs safe which might mean disk access is more efficient. Or it might kill latency. No idea which! Might have to try!

I have no real experience with dedup, so I cannot say anything useful here, except that you already have made good choices in hardware (low latency, high random 64k write IOPS, NVMe interface). It could only be faster if you invest in some really expensive permanent RAM drive (ZeusRAM et al.).

(6) By "on disk consistency" you mean ZFS is happy and pool is self-consistent? Not worried if some files/dirs. end up with invalid content or not moved/deleted is pool suddenly vanishes, or file system such as NTFWS/VMFS on a zvol gets internally corrupted (ie as a ZFS zvol it's fine but from client perspective it needs fsck/chkdsk), provided pool is safe/consistent as ZFS sees it

Yes. Essentially "my pool is not fucked up, yay!" in a multi-user setup - even if one user has problems with his files, the others do not suffer.

(7) By "crash consistent" do you mean what I mean (I think you do) - that ZFS will be fine, the pool as far as ZFS sees it will be fine, but the remote client's data may be mangled from that client's perspective similarly to as if the client had hit a sudden disk IO failure and writes had been lost? == pool will be fine, client may have lost/inconsistent data and may need help to recover, as with any other disk IO failure or system crash?

Yes, essentially a hard power off of the VM instead of a clean shutdown and THEN taking a snapshot - if you power it on afterwards, fsck or similar depending on file system will run and it may complain about unclean shutdown. This is in contrast to ESXi snapshots, which resume at the exact point in time as if nothing happened, but they need interaction with the guest system (guest additions installed) and include the virtual memory of the VM.

You can combine both to your advantage: first take a ESXi snapshot, then afterwards a ZFS snapshot of the datastore (ESXi stores its snapshots alongside the VM). Then delete your ESXi snapshot, but keep the ZFS one (takes much less space because of block-level copies). When restoring, first restore your ZFS snapshot and then revert to your (saved) ESXi snapshot and you will resume where you left off. napp-it (excellent ZFS management system with Web interface) has this concept built-in (at least for NFS datastores, I did not check iSCSI but assume it's similar).