Linux – Attempt to test corruption of ZFS filesystem using dd fails

ddlinuxzfs

I'm trying to test error detection and recovery on a system with recently installed ZFS. I deliberately overwrite the start of one of the disks with dd but can't force any errors to be detected.

Can I not use dd to do this? Or am I just not hitting any data?

I've created mirrored zfs pool and copied some data to it:

$ zpool status
  pool: zfspool
 state: ONLINE
  scan: scrub repaired 0 in 0h6m with 0 errors on Sun Dec  1 11:53:12 2013
config:

    NAME                                          STATE     READ WRITE CKSUM
    zfspool                                       ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-WDC_WD10EFRX-68JCSN0_WD-WCC1U4257356  ONLINE       0     0     0
        ata-WDC_WD10EFRX-68JCSN0_WD-WCC1U4299344  ONLINE       0     0     0

errors: No known data errors

I then attempt to corrupt one of the the disks:

$ dd of=/dev/sdb if=/dev/zero bs=512 count=10000
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB) copied, 0.103375 s, 49.5 MB/s

and force a read of all data:

$ diff -qr /home/archive/ /zfspool/archive/

I would have thought this should have found the corruption on one of the drives, flag it as invalid and set the pool status to degraded but:

$ zpool status
  pool: zfspool
 state: ONLINE
  scan: scrub repaired 0 in 0h6m with 0 errors on Sun Dec  1 11:53:12 2013
config:

    NAME                                          STATE     READ WRITE CKSUM
    zfspool                                       ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-WDC_WD10EFRX-68JCSN0_WD-WCC1U4257356  ONLINE       0     0     0
        ata-WDC_WD10EFRX-68JCSN0_WD-WCC1U4299344  ONLINE       0     0     0

errors: No known data errors

Ok, let's scrub test it:

$ zpool scrub zfspool
$ zpool status
      pool: zfspool
     state: ONLINE
      scan: scrub repaired 0 in 0h6m with 0 errors on Sun Dec  1 12:46:34 2013
    config:

        NAME                                          STATE     READ WRITE CKSUM
        zfspool                                       ONLINE       0     0     0
          mirror-0                                    ONLINE       0     0     0
            ata-WDC_WD10EFRX-68JCSN0_WD-WCC1U4257356  ONLINE       0     0     0
            ata-WDC_WD10EFRX-68JCSN0_WD-WCC1U4299344  ONLINE       0     0     0

    errors: No known data errors

Additional information requested by @rickhg12hs:

$ fdisk -l /dev/sda

WARNING: GPT (GUID Partition Table) detected on '/dev/sda'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sda: 1000.2 GB, 1000204886016 bytes
256 heads, 63 sectors/track, 121126 cylinders
Units = cylinders of 16128 * 512 = 8257536 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1               1      121127   976762583+  ee  GPT
Partition 1 does not start on physical sector boundary.
[17:37:26][root@zserver2:~]$ fdisk -l /dev/sdb

WARNING: GPT (GUID Partition Table) detected on '/dev/sdb'! The util fdisk doesn't support GPT. Use GNU Parted.


Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000

[17:38:11][root@zserver2:~]$ mount
...
zfspool on /zfspool type zfs (rw,xattr)
zfspool/archive on /zfspool/archive type zfs (rw,xattr)

Best Answer

~~Errors were detected and fixed with the scrub.~~

~~Before that, you didn't attempted any writes, just reads so everything was on the ARC (i.e. in cache on RAM) and disk corruption remained undetected.~~

I overlook the "0 errors". Here is a corrected explanation about what has likely happened:

You overwrote ~ 5 MB in the beginning of the disk with zeroes.

The first 3.5 MB were harmless, ZFS reserves that area for non ZFS stuff so never read or write anything there.
The next .5 MB overwrote two vdev labels (out of four)
The next 1 MB was written in an area that might not been containing any data or metadata.

The vdev labels corruption went unnoticed due to their high redundancy (there were still six of them healthy) and the fact the labels are atomically overwritten anyway.

Data at rest / on-disk consistency

I don't care about errors arising to other software or remote connections by doing this, or aborting any long file transfers prematurely, I just want it to offline the pool in the fastest way that's consistent with retaining its consistency and possibly giving it a few seconds for any pending writes to complete and the pool to be in a consistent state for data purposes.

First, the good news: as ZFS uses CoW and atomic transactions, your already existing data will be safe even in case of sudden power loss. This includes the pool layout and metadata. As old data is never moved before new data has been completely written (in fact, it is never moved at all, just reallocated), this data cannot be in danger in any way if the write is suddenly interrupted.

Additionally, checksums (Merkle hash trees) help to certify that nothing bad has happened during the reboot, which you may check by scrubbing the pool. If you have redundant vdevs, ZFS will automatically correct any errors it finds from a known good copy. If some blocks would have been corrupted in any way (for example by a rogue disk controller that does not write but says it does), their checksums would not match with the ones from other vdevs and errors would show.

Data in flight / write modes and loss of last n seconds

Sync and async writes

Normally, ZFS collects multiple transactions to speed up the costly writes to rotating drives - positioning the write head of the HDD takes much more time than actually writing, so you will want to queue as much as possible and then write it out in sequential (faster!) order (remember, we have CoW, this works quite naturally here).

The downside to this is that the longer you collect, the longer your applications would have to wait for a "write successful" message - which means your system would lock for several seconds, which is unacceptable. Even worse - you will lose all the data that is to be written to disk but has not yet been written in the event of power failure. If your applications cannot cope with this, corruption on the application layer may occur.

To combat this, the ZIL (ZFS intent log) was added. All sync transactions are collected in this log (which is stored by default on the slow pool disk, but can be stored on faster, mirrored SSDs, which are named SLOG devices) and after they are stored, "write successful" is returned to the application which may carry on with its tasks (no long locks anymore). Additionally, all async transactions are done without the ZIL, so they can be faster - provided the application calles the correct write operations for its data (sync vs async).

ZFS properties

Now for the more interesting part - what happens to your writes? There we have to discern the operation mode for the file system (it is a ZFS property and can be set individually for each file system). The three possible mode are (from the manpages):

sync=standard
  This is the default option. Synchronous file system transactions
  (fsync, O_DSYNC, O_SYNC, etc) are written out (to the intent log)
  and then secondly all devices written are flushed to ensure
  the data is stable (not cached by device controllers).

sync=always
  For the ultra-cautious, every file system transaction is
  written and flushed to stable storage by a system call return.
  This obviously has a big performance penalty.

sync=disabled
  Synchronous requests are disabled.  File system transactions
  only commit to stable storage on the next DMU transaction group
  commit which can be many seconds.  This option gives the
  highest performance.  However, it is very dangerous as ZFS
  is ignoring the synchronous transaction demands of
  applications such as databases or NFS.
  Setting sync=disabled on the currently active root or /var
  file system may result in out-of-spec behavior, application data
  loss and increased vulnerability to replay attacks.
  This option does *NOT* affect ZFS on-disk consistency.
  Administrators should only use this when these risks are understood.

You'll notice that even if disabled is chosen, your pool layout/internal consistency is not affected - you will just lose your last 5 seconds of data and this may put your files into an incorrect state (for example, because you have a VM on top that expects sync writes but you only supplied an async zvol as a backing datastore).

On the other hand, if you don't want to lose anything at all, set all your file systems to always and switch to high-performance SSDs, at least for the SLOG device (or suffer the wait times).

standard is a compromise and the most flexible - the application itself decides which write mode it needs. If your applications are bad, you may experience dataloss. If they behave, you will have the best possible performance with a given baseline of safety.

Pool export/import:

From the documentation about zpool export:

The command attempts to unmount any mounted file systems within the pool before continuing. If any of the file systems fail to unmount, you can forcefully unmount them by using the -f option.

If devices are unavailable at the time of export, the devices cannot be identified as cleanly exported. If one of these devices is later attached to a system without any of the working devices, it appears as “potentially active.”

If ZFS volumes are in use in the pool, the pool cannot be exported, even with the -f option. To export a pool with a ZFS volume, first ensure that all consumers of the volume are no longer active.

This means roughly three things:

-f forces the pool to be exported by force-unmounting all file systems, even if they are active (with disregard for locks or applications writing there)
This does not work with zvols
You should not split pools and use them on different systems (be careful with failover situations)

Summary:

If all you care is on-disk consistency, you are good to go with export -f or a complete shutdown
If you care about all data, use sync=always and fast SSDs
Regarding iSCSI/NFS as datastores for VMs, this overview may also be helpful (excerpt: use NFS or disable iSCSI writeback cache on the guest/VM Host; quiesce VM before taking a ZFS snapshot, ZFS will be fine anyway, but guest VM will only be crash-consistent)

In reply to follow-up questions from the comments (left out questions where I do not have any useful answers):

(1) "good news/COW" - what if top level blocks were about to update - will it always find a usable top level block (even if pointing to slightly old versions of the metadata tree)? How bad can that get?

Worst case would be that the uberblock (the one at the top of all others) is damaged on all redundant devices. Because there is no block above it, you cannot reconstruct it from above, so there exist several copies of each uberblock (IIRC it was about 3 or 4), so one can be lost and a replacement copy is still there.

(2) I'm familiar with TXGs and use ESXi. Using APC UPS + good PSU/hw + P3700 NVMe ZIL so it's decent power + fast ZIL. But unlikely that current writes will all be sync and as you say, sync=always is slow. But your reply does raise a thought, I might do some performance testing. I'm using dedup (4x saving, worth it), so write=slow anyway (has to lookup DDT). Reason being sync=always only affects write which is slow anyway due to DDT. But setting sync=always forces ZIL, ZIL is very fast and and that makes long TXGs safe which might mean disk access is more efficient. Or it might kill latency. No idea which! Might have to try!

I have no real experience with dedup, so I cannot say anything useful here, except that you already have made good choices in hardware (low latency, high random 64k write IOPS, NVMe interface). It could only be faster if you invest in some really expensive permanent RAM drive (ZeusRAM et al.).

(6) By "on disk consistency" you mean ZFS is happy and pool is self-consistent? Not worried if some files/dirs. end up with invalid content or not moved/deleted is pool suddenly vanishes, or file system such as NTFWS/VMFS on a zvol gets internally corrupted (ie as a ZFS zvol it's fine but from client perspective it needs fsck/chkdsk), provided pool is safe/consistent as ZFS sees it

Yes. Essentially "my pool is not fucked up, yay!" in a multi-user setup - even if one user has problems with his files, the others do not suffer.

(7) By "crash consistent" do you mean what I mean (I think you do) - that ZFS will be fine, the pool as far as ZFS sees it will be fine, but the remote client's data may be mangled from that client's perspective similarly to as if the client had hit a sudden disk IO failure and writes had been lost? == pool will be fine, client may have lost/inconsistent data and may need help to recover, as with any other disk IO failure or system crash?

Yes, essentially a hard power off of the VM instead of a clean shutdown and THEN taking a snapshot - if you power it on afterwards, fsck or similar depending on file system will run and it may complain about unclean shutdown. This is in contrast to ESXi snapshots, which resume at the exact point in time as if nothing happened, but they need interaction with the guest system (guest additions installed) and include the virtual memory of the VM.

You can combine both to your advantage: first take a ESXi snapshot, then afterwards a ZFS snapshot of the datastore (ESXi stores its snapshots alongside the VM). Then delete your ESXi snapshot, but keep the ZFS one (takes much less space because of block-level copies). When restoring, first restore your ZFS snapshot and then revert to your (saved) ESXi snapshot and you will resume where you left off. napp-it (excellent ZFS management system with Web interface) has this concept built-in (at least for NFS datastores, I did not check iSCSI but assume it's similar).

Best Answer

Related Solutions

Uninterruptible Backup Process on Solaris

Freebsd – Offlining a ZFS pool speedily and safely as a monolithic whole