Disclaimer: I do not have many links and references to back up all below at hand at the moment, and I did not test it extensively. This is just a summary of things I have read over the last five to seven years about ZFS and how it works, and some limited own testing (not coordinated, but mostly random reboots).
Also, everything below is said without regarding catastrophic events (server burns down completely), software bugs (bugs in ZFS and the main operating system as well as hardware controllers), and active malice (rogue admin, administration errors). For all those cases you still need to have regular and restoreable backups!
Data at rest / on-disk consistency
I don't care about errors arising to other software or remote connections by doing this, or aborting any long file transfers prematurely, I just want it to offline the pool in the fastest way that's consistent with retaining its consistency and possibly giving it a few seconds for any pending writes to complete and the pool to be in a consistent state for data purposes.
First, the good news: as ZFS uses CoW and atomic transactions, your already existing data will be safe even in case of sudden power loss. This includes the pool layout and metadata. As old data is never moved before new data has been completely written (in fact, it is never moved at all, just reallocated), this data cannot be in danger in any way if the write is suddenly interrupted.
Additionally, checksums (Merkle hash trees) help to certify that nothing bad has happened during the reboot, which you may check by scrubbing the pool. If you have redundant vdevs, ZFS will automatically correct any errors it finds from a known good copy. If some blocks would have been corrupted in any way (for example by a rogue disk controller that does not write but says it does), their checksums would not match with the ones from other vdevs and errors would show.
Data in flight / write modes and loss of last n seconds
Sync and async writes
Normally, ZFS collects multiple transactions to speed up the costly writes to rotating drives - positioning the write head of the HDD takes much more time than actually writing, so you will want to queue as much as possible and then write it out in sequential (faster!) order (remember, we have CoW, this works quite naturally here).
The downside to this is that the longer you collect, the longer your applications would have to wait for a "write successful" message - which means your system would lock for several seconds, which is unacceptable. Even worse - you will lose all the data that is to be written to disk but has not yet been written in the event of power failure. If your applications cannot cope with this, corruption on the application layer may occur.
To combat this, the ZIL (ZFS intent log) was added. All sync transactions are collected in this log (which is stored by default on the slow pool disk, but can be stored on faster, mirrored SSDs, which are named SLOG devices) and after they are stored, "write successful" is returned to the application which may carry on with its tasks (no long locks anymore). Additionally, all async transactions are done without the ZIL, so they can be faster - provided the application calles the correct write operations for its data (sync vs async).
ZFS properties
Now for the more interesting part - what happens to your writes? There we have to discern the operation mode for the file system (it is a ZFS property and can be set individually for each file system). The three possible mode are (from the manpages):
sync=standard
This is the default option. Synchronous file system transactions
(fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
and then secondly all devices written are flushed to ensure
the data is stable (not cached by device controllers).
sync=always
For the ultra-cautious, every file system transaction is
written and flushed to stable storage by a system call return.
This obviously has a big performance penalty.
sync=disabled
Synchronous requests are disabled. File system transactions
only commit to stable storage on the next DMU transaction group
commit which can be many seconds. This option gives the
highest performance. However, it is very dangerous as ZFS
is ignoring the synchronous transaction demands of
applications such as databases or NFS.
Setting sync=disabled on the currently active root or /var
file system may result in out-of-spec behavior, application data
loss and increased vulnerability to replay attacks.
This option does *NOT* affect ZFS on-disk consistency.
Administrators should only use this when these risks are understood.
You'll notice that even if disabled
is chosen, your pool layout/internal consistency is not affected - you will just lose your last 5 seconds of data and this may put your files into an incorrect state (for example, because you have a VM on top that expects sync writes but you only supplied an async zvol as a backing datastore).
On the other hand, if you don't want to lose anything at all, set all your file systems to always
and switch to high-performance SSDs, at least for the SLOG device (or suffer the wait times).
standard
is a compromise and the most flexible - the application itself decides which write mode it needs. If your applications are bad, you may experience dataloss. If they behave, you will have the best possible performance with a given baseline of safety.
Pool export/import:
From the documentation about zpool export
:
The command attempts to unmount any mounted file systems within the pool before continuing. If any of the file systems fail to unmount, you can forcefully unmount them by using the -f option.
If devices are unavailable at the time of export, the devices cannot be identified as cleanly exported. If one of these devices is later attached to a system without any of the working devices, it appears as “potentially active.”
If ZFS volumes are in use in the pool, the pool cannot be exported, even with the -f option. To export a pool with a ZFS volume, first ensure that all consumers of the volume are no longer active.
This means roughly three things:
-f
forces the pool to be exported by force-unmounting all file systems, even if they are active (with disregard for locks or applications writing there)
- This does not work with
zvol
s
- You should not split pools and use them on different systems (be careful with failover situations)
Summary:
- If all you care is on-disk consistency, you are good to go with
export -f
or a complete shutdown
- If you care about all data, use
sync=always
and fast SSDs
- Regarding iSCSI/NFS as datastores for VMs, this overview may also be helpful (excerpt: use NFS or disable iSCSI writeback cache on the guest/VM Host; quiesce VM before taking a ZFS snapshot, ZFS will be fine anyway, but guest VM will only be crash-consistent)
In reply to follow-up questions from the comments (left out questions where I do not have any useful answers):
(1) "good news/COW" - what if top level blocks were about to update - will it always find a usable top level block (even if pointing to slightly old versions of the metadata tree)? How bad can that get?
Worst case would be that the uberblock (the one at the top of all others) is damaged on all redundant devices. Because there is no block above it, you cannot reconstruct it from above, so there exist several copies of each uberblock (IIRC it was about 3 or 4), so one can be lost and a replacement copy is still there.
(2) I'm familiar with TXGs and use ESXi. Using APC UPS + good PSU/hw + P3700 NVMe ZIL so it's decent power + fast ZIL. But unlikely that current writes will all be sync and as you say, sync=always is slow. But your reply does raise a thought, I might do some performance testing. I'm using dedup (4x saving, worth it), so write=slow anyway (has to lookup DDT). Reason being sync=always only affects write which is slow anyway due to DDT. But setting sync=always forces ZIL, ZIL is very fast and and that makes long TXGs safe which might mean disk access is more efficient. Or it might kill latency. No idea which! Might have to try!
I have no real experience with dedup, so I cannot say anything useful here, except that you already have made good choices in hardware (low latency, high random 64k write IOPS, NVMe interface). It could only be faster if you invest in some really expensive permanent RAM drive (ZeusRAM et al.).
(6) By "on disk consistency" you mean ZFS is happy and pool is self-consistent? Not worried if some files/dirs. end up with invalid content or not moved/deleted is pool suddenly vanishes, or file system such as NTFWS/VMFS on a zvol gets internally corrupted (ie as a ZFS zvol it's fine but from client perspective it needs fsck/chkdsk), provided pool is safe/consistent as ZFS sees it
Yes. Essentially "my pool is not fucked up, yay!" in a multi-user setup - even if one user has problems with his files, the others do not suffer.
(7) By "crash consistent" do you mean what I mean (I think you do) - that ZFS will be fine, the pool as far as ZFS sees it will be fine, but the remote client's data may be mangled from that client's perspective similarly to as if the client had hit a sudden disk IO failure and writes had been lost? == pool will be fine, client may have lost/inconsistent data and may need help to recover, as with any other disk IO failure or system crash?
Yes, essentially a hard power off of the VM instead of a clean shutdown and THEN taking a snapshot - if you power it on afterwards, fsck
or similar depending on file system will run and it may complain about unclean shutdown. This is in contrast to ESXi snapshots, which resume at the exact point in time as if nothing happened, but they need interaction with the guest system (guest additions installed) and include the virtual memory of the VM.
You can combine both to your advantage: first take a ESXi snapshot, then afterwards a ZFS snapshot of the datastore (ESXi stores its snapshots alongside the VM). Then delete your ESXi snapshot, but keep the ZFS one (takes much less space because of block-level copies). When restoring, first restore your ZFS snapshot and then revert to your (saved) ESXi snapshot and you will resume where you left off. napp-it (excellent ZFS management system with Web interface) has this concept built-in (at least for NFS datastores, I did not check iSCSI but assume it's similar).
Best Answer
Disclaimer: As I've never used zvols, I cannot say if they are any different in replication than normal filesystems or snapshots. I assume they are, but do not take my word for it.
Your question is actually multiple questions, I try to answer them separately:
How to replicate/mirror complete pool to remote location
You need to split the task into two parts: first, the initial replication has to be complete, afterwards incremental replication is possible, as long as you do not mess with your replication snapshots. To enable incremental replication, you need to preserve the last replication snapshots, everything before that can be deleted. If you delete the previous snapshot,
zfs recv
will complain and abort the replication. In this case you have to start all over again, so try not to do this.If you just need the correct options, they are:
zfs send
:-R
: send everything under the given pool or dataset (recursive replication, needed all the time, includes-p
). Also, when receiving, all deleted source snapshots are deleted on the destination.-I
: include all intermediate snapshots between the last replication snapshot and the current replication snapshot (needed only with incremental sends)zfs recv
:-F
: expand target pool, including deletion of existing datasets that are deleted on the source-d
: discard the name of the source pool and replace it with the destination pool name (the rest of the filesystem paths will be preserved, and if needed also created)-u
: do not mount filesystem on destinationIf you prefer a complete example, here is a small script:
Use something faster than SSH
If you have a sufficiently secured connection, for example IPSec or OpenVPN tunnel and a separate VLAN that only exists between sender and receiver, you may switch from SSH to unencrypted alternatives like mbuffer as described here, or you could use SSH with weak/no encryption and disabled compression, which is detailed here. There also was a website about recomiling SSH to be much faster, but unfortunately I don't remember the URL - I'll edit it later if I find it.
For very large datasets and slow connections, it may also be useful to to the first transmission via hard disk (use encrypted disk to store zpool and transmit it in sealed package via courier, mail or in person). As the method of transmission does not matter for send/recv, you can pipe everything to the disk, export the pool, send the disk to its destination, import the pool and then transmit all incremental sends via SSH.
The problem with messed up snapshots
As stated earlier, if you delete/modify your replication snapshots, you will receive the error message
which means either your command was wrong or you are in an inconsistent state where you must remove the snapshots and start all over.
This has several negative implications:
There exists a possible solution to those problems, but I have not tried it myself. You could use
zfs bookmark
, a new feature in OpenSolaris/illumos created specifically for this task. This would free you of snapshot management. The only downside is that at present, it only works for single datasets, not recursively. You would have to save a list of all your old and new datasets and then loop over them, bookmarking, sending and receiving them, and then updating the list (or small database, if you prefer).If you try the bookmark route, I would be interested to hear how it worked out for you!