Freebsd – Offlining a ZFS pool speedily and safely as a monolithic whole

freebsdfreenasunmountingzfs

Much as the question says.

Suppose I want to have the equivalent of a scripted "emergency button" for my FreeNAS pool – something that I can click to run from a GUI or execute in console/SSH, which very quickly closes everything that might be reading or writing to it, unmounts the file system, and – ideally – quiesces the disks or partitions it's using.

I don't care about errors arising to other software or remote connections by doing this, or aborting any long file transfers prematurely, I just want it to offline the pool in the fastest way that's consistent with retaining its consistency and possibly giving it a few seconds for any pending writes to complete and the pool to be in a consistent state for data purposes.

The options suggested by ZFS commands don't look promising: zpool offline only works on individual devices so one might have a race condition if writing happens while disks are removed one at a time; zpool export requires the -f option if in use and carries a warning that -f can lose data as well. One could check all open file descriptors using the pool or its devices (thousands or hundreds of thousands of them?) and manually force-close each but that could hit race conditions as it doesn't stop new fd's being created at the same time. I also should not assume all ZFS activity is mediated by a list of remote file serving daemons to be sent exit signals, because some file activity is likely to be local (cron/CLI/detached sessions).

So looking at how best to offline an entire pool safely and quickly, it looks like umount might be my best bet – it works at a file system level and can offline an entire file system speedily and as a monolithic unit, after which zpool export looks like it would then be able to actually finish and quiesce any internal activity in a safe manner without the -f option, keeping the data itself in a guaranteed consistent state. If there's raw disk activity going on (resilver or scrub) then I guess that would resume or restart when the pool was later brought back online.

But even umount doesn't seem to do it completely, because there could be iSCSI zvol targets in use as well. The data within those inherently can't be kept consistent as the server doesn't know its structure, so the remote initiators will have to do data repair as best they can when they reconnect. I'm fine with that, but I'm not sure if some kind of command to force-terminate or offline the targets is needed or best practice. (Note: force-terminating connections has the same issues as closing individual fd's would.)

I'm aware that there is bound to be some kind of data loss or issue if the pool is abruptly kicked out of RW state when writes are happening. But as long as it doesn't lose consistency (at a ZFS pool and file system level) then that's fine – any in-use files/iSCSI targets being updated will have to take their chances on files/blocks being in a ZFS-consistent but data-invalid state due to going offline partway through data being written. That's unavoidable and not an issue for the question.

So what steps do I actually need to do, to offline an in-use pool as fast as possible consistent with guaranteed pool safety and consistency – and would manually umounting an in-use ZFS file system (as part of a solution) be safe or carry any risk of data damage?

Update: Mentioning here in case someone else finds this useful. The accepted answer states that export -f may have issues with zvols (iSCSI etc). Based on this hint, I found that the iSCSI handler used by FreeNAS can forcibly logout/terminate sessions, and has other useful subcommands which could be issued beforehand – see man ctladm. Whatever your zvols are used for there's likely to be some command to end sessions on them.)

Best Answer

Disclaimer: I do not have many links and references to back up all below at hand at the moment, and I did not test it extensively. This is just a summary of things I have read over the last five to seven years about ZFS and how it works, and some limited own testing (not coordinated, but mostly random reboots).

Also, everything below is said without regarding catastrophic events (server burns down completely), software bugs (bugs in ZFS and the main operating system as well as hardware controllers), and active malice (rogue admin, administration errors). For all those cases you still need to have regular and restoreable backups!

Data at rest / on-disk consistency

I don't care about errors arising to other software or remote connections by doing this, or aborting any long file transfers prematurely, I just want it to offline the pool in the fastest way that's consistent with retaining its consistency and possibly giving it a few seconds for any pending writes to complete and the pool to be in a consistent state for data purposes.

First, the good news: as ZFS uses CoW and atomic transactions, your already existing data will be safe even in case of sudden power loss. This includes the pool layout and metadata. As old data is never moved before new data has been completely written (in fact, it is never moved at all, just reallocated), this data cannot be in danger in any way if the write is suddenly interrupted.

Additionally, checksums (Merkle hash trees) help to certify that nothing bad has happened during the reboot, which you may check by scrubbing the pool. If you have redundant vdevs, ZFS will automatically correct any errors it finds from a known good copy. If some blocks would have been corrupted in any way (for example by a rogue disk controller that does not write but says it does), their checksums would not match with the ones from other vdevs and errors would show.

Data in flight / write modes and loss of last n seconds

Sync and async writes

Normally, ZFS collects multiple transactions to speed up the costly writes to rotating drives - positioning the write head of the HDD takes much more time than actually writing, so you will want to queue as much as possible and then write it out in sequential (faster!) order (remember, we have CoW, this works quite naturally here).

The downside to this is that the longer you collect, the longer your applications would have to wait for a "write successful" message - which means your system would lock for several seconds, which is unacceptable. Even worse - you will lose all the data that is to be written to disk but has not yet been written in the event of power failure. If your applications cannot cope with this, corruption on the application layer may occur.

To combat this, the ZIL (ZFS intent log) was added. All sync transactions are collected in this log (which is stored by default on the slow pool disk, but can be stored on faster, mirrored SSDs, which are named SLOG devices) and after they are stored, "write successful" is returned to the application which may carry on with its tasks (no long locks anymore). Additionally, all async transactions are done without the ZIL, so they can be faster - provided the application calles the correct write operations for its data (sync vs async).

ZFS properties

Now for the more interesting part - what happens to your writes? There we have to discern the operation mode for the file system (it is a ZFS property and can be set individually for each file system). The three possible mode are (from the manpages):

sync=standard
  This is the default option. Synchronous file system transactions
  (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
  and then secondly all devices written are flushed to ensure
  the data is stable (not cached by device controllers).

sync=always
  For the ultra-cautious, every file system transaction is
  written and flushed to stable storage by a system call return.
  This obviously has a big performance penalty.

sync=disabled
  Synchronous requests are disabled.  File system transactions
  only commit to stable storage on the next DMU transaction group
  commit which can be many seconds.  This option gives the
  highest performance.  However, it is very dangerous as ZFS
  is ignoring the synchronous transaction demands of
  applications such as databases or NFS.
  Setting sync=disabled on the currently active root or /var
  file system may result in out-of-spec behavior, application data
  loss and increased vulnerability to replay attacks.
  This option does *NOT* affect ZFS on-disk consistency.
  Administrators should only use this when these risks are understood.

You'll notice that even if disabled is chosen, your pool layout/internal consistency is not affected - you will just lose your last 5 seconds of data and this may put your files into an incorrect state (for example, because you have a VM on top that expects sync writes but you only supplied an async zvol as a backing datastore).

On the other hand, if you don't want to lose anything at all, set all your file systems to always and switch to high-performance SSDs, at least for the SLOG device (or suffer the wait times).

standard is a compromise and the most flexible - the application itself decides which write mode it needs. If your applications are bad, you may experience dataloss. If they behave, you will have the best possible performance with a given baseline of safety.

Pool export/import:

From the documentation about zpool export:

The command attempts to unmount any mounted file systems within the pool before continuing. If any of the file systems fail to unmount, you can forcefully unmount them by using the -f option.

If devices are unavailable at the time of export, the devices cannot be identified as cleanly exported. If one of these devices is later attached to a system without any of the working devices, it appears as “potentially active.”

If ZFS volumes are in use in the pool, the pool cannot be exported, even with the -f option. To export a pool with a ZFS volume, first ensure that all consumers of the volume are no longer active.

This means roughly three things:

-f forces the pool to be exported by force-unmounting all file systems, even if they are active (with disregard for locks or applications writing there)
This does not work with zvols
You should not split pools and use them on different systems (be careful with failover situations)

Summary:

If all you care is on-disk consistency, you are good to go with export -f or a complete shutdown
If you care about all data, use sync=always and fast SSDs
Regarding iSCSI/NFS as datastores for VMs, this overview may also be helpful (excerpt: use NFS or disable iSCSI writeback cache on the guest/VM Host; quiesce VM before taking a ZFS snapshot, ZFS will be fine anyway, but guest VM will only be crash-consistent)

In reply to follow-up questions from the comments (left out questions where I do not have any useful answers):

(1) "good news/COW" - what if top level blocks were about to update - will it always find a usable top level block (even if pointing to slightly old versions of the metadata tree)? How bad can that get?

Worst case would be that the uberblock (the one at the top of all others) is damaged on all redundant devices. Because there is no block above it, you cannot reconstruct it from above, so there exist several copies of each uberblock (IIRC it was about 3 or 4), so one can be lost and a replacement copy is still there.

(2) I'm familiar with TXGs and use ESXi. Using APC UPS + good PSU/hw + P3700 NVMe ZIL so it's decent power + fast ZIL. But unlikely that current writes will all be sync and as you say, sync=always is slow. But your reply does raise a thought, I might do some performance testing. I'm using dedup (4x saving, worth it), so write=slow anyway (has to lookup DDT). Reason being sync=always only affects write which is slow anyway due to DDT. But setting sync=always forces ZIL, ZIL is very fast and and that makes long TXGs safe which might mean disk access is more efficient. Or it might kill latency. No idea which! Might have to try!

I have no real experience with dedup, so I cannot say anything useful here, except that you already have made good choices in hardware (low latency, high random 64k write IOPS, NVMe interface). It could only be faster if you invest in some really expensive permanent RAM drive (ZeusRAM et al.).

(6) By "on disk consistency" you mean ZFS is happy and pool is self-consistent? Not worried if some files/dirs. end up with invalid content or not moved/deleted is pool suddenly vanishes, or file system such as NTFWS/VMFS on a zvol gets internally corrupted (ie as a ZFS zvol it's fine but from client perspective it needs fsck/chkdsk), provided pool is safe/consistent as ZFS sees it

Yes. Essentially "my pool is not fucked up, yay!" in a multi-user setup - even if one user has problems with his files, the others do not suffer.

(7) By "crash consistent" do you mean what I mean (I think you do) - that ZFS will be fine, the pool as far as ZFS sees it will be fine, but the remote client's data may be mangled from that client's perspective similarly to as if the client had hit a sudden disk IO failure and writes had been lost? == pool will be fine, client may have lost/inconsistent data and may need help to recover, as with any other disk IO failure or system crash?

Yes, essentially a hard power off of the VM instead of a clean shutdown and THEN taking a snapshot - if you power it on afterwards, fsck or similar depending on file system will run and it may complain about unclean shutdown. This is in contrast to ESXi snapshots, which resume at the exact point in time as if nothing happened, but they need interaction with the guest system (guest additions installed) and include the virtual memory of the VM.

You can combine both to your advantage: first take a ESXi snapshot, then afterwards a ZFS snapshot of the datastore (ESXi stores its snapshots alongside the VM). Then delete your ESXi snapshot, but keep the ZFS one (takes much less space because of block-level copies). When restoring, first restore your ZFS snapshot and then revert to your (saved) ESXi snapshot and you will resume where you left off. napp-it (excellent ZFS management system with Web interface) has this concept built-in (at least for NFS datastores, I did not check iSCSI but assume it's similar).

ZFS on the receiving side

If you use ZFS on the sending and on the receiving side you can avoid having to transfer the whole snapshot and only transfer the differences of the snapshot compared to the previous one:

ssh myserver 'zfs send -i pool/dataset@2014-02-04 pool/dataset@2014-02-05' | \
  zfs receive

ZFS knows about the snapshots and stores mutual blocks only once. Having the file system understand the snapshots enables you to delete the old ones without problems.

Other file system on the receiving side

In your case you store the snapshots in individual files, and your file system is unaware of the snapshots. As you already noticed, this breaks rotation. You either have to transmit entire snapshots, which will waste bandwidth and storage space, but enables you to delete individual snapshots. They don't depend on each other. You can do incremental snapshots like this:

ssh myserver 'zfs send -i pool/dataset@2014-02-04 pool/dataset@2014-02-05' \
  > incremental-2014-02-04:05

To restore an incremental snapshot you need the previous snapshots as well. This means you can't delete the old incrementals.

Possible solutions

You could do incrementals as shown in my last example and do a new non-incremental every month. The new incrementals depend on this non-incremental and you're free to delete the old snapshots.

Or you could look into other backup solutions. There is rsnapshot, which uses rsync and hard links. It does a very good job at rotation and is very bandwidth efficient, since it requires a full backup only once.

Then there is bareos. It does incrementals, which are bandwith- and space-saving. It has a very nice feature; it can calculate a full backup from a set of incrementals. This enables you to delete old incrementals. But it's a rather complex system and intended for larger setups.

The best solution, however, is to use ZFS on the receiving side. It will be bandwidth efficient, storage efficient and much faster than the other solutions. The only really drawback I can think of is that you should have a minimum of 8 GiB ECC memory on that box (you might be fine with 4 GiB if you don't run any services and only use it to zfs receive).