Understanding the error reporting of ZFS (on Linux)

checksumzfs

I have successfully setup Debian stretch on ZFS, including the root file system. Things are working like expected, and I was thinking that I had understood the basic concepts – until I re-read Sun's ZFS documentation.

My scenario is:

I'd like to prevent (more precisely: detect) silent bit rot
For the moment, I have set up a root pool with one vdev which is a mirror of two identical disks
Of course, I did turn on (i.e. did not turn off) checksums

Now I have come across this document. At the end of the page, they show the output of the zpool status command for their example configuration,

[...]
NAME        STATE     READ WRITE CKSUM
tank        DEGRADED     0     0     0
  mirror-0  DEGRADED     0     0     0
    c1t0d0  ONLINE       0     0     0
    c1t1d0  OFFLINE      0     0     0  48K resilvered
[...]

followed by the statement:

The READ and WRITE columns provide a count of I/O errors that occurred
on the device, while the CKSUM column provides a count of
uncorrectable checksum errors that occurred on the device.

First, what does "device" mean in this context? Are they talking about a physical device, the vdev or even something else? My assumption is that they are talking about every "device" in the hierarchy. The vdev error counter then probably is the sum of the error counters of its physical devices, and the pool error counter probably is the sum of the error counters of its vdevs. Is this correct?

Second, what do they mean by uncorrectable checksum errors? This is a term which I thought is usually used when talking about physical disks, either relating to data transfer from the platter into the disk's electronics, to checksums of physical sectors on the disk or to data transfer from the disk's port (SATA, SAS, …) to the mainboard (or controller).

But what I am really interested in is whether there have been checksum errors at ZFS level (and not hardware level). I am currently convinced that CKSUM is showing the latter (otherwise, it wouldn't make much sense), but I'd like to know for sure.

Third, assuming the checksum errors they are talking about are indeed the checksum errors at the ZFS level (and not hardware level), why do they only show the count of uncorrectable errors? This does not make any sense. We would like to see every checksum error, whether correctable or not, wouldn't we? After all, a checksum error means that there has been some sort of data corruption on the disk which has not been detected by hardware, so we probably want to change that disk as soon as there is any error (even if the mirror disk can still act as "backup"). So I possibly did not understand yet what exactly they mean by "uncorrectable errors".

Then I have come across this document which is even harder to understand. Near the end of the page, it states

[…] ZFS maintains a persistent log of all data errors associated with a pool. […]

and then states

Data corruption errors are always fatal. Their presence indicates that
at least one application experienced an I/O error due to corrupt data
within the pool. Device errors within a redundant pool do not result
in data corruption and are not recorded as part of this log. […]

I am heavily worried about the third sentence. According to that paragraph, there could be two sorts of errors: Data corruption errors and device errors. A mirror configuration of two disks is undoubtedly redundant, so (according to that paragraph) it is no data corruption error if ZFS encounters a checksum error on one of the disks (at the ZFS checksum level, not the hardware level). That means (once more according to that paragraph) that this error will not be recorded as part of the persistent error log.

This would not make any sense, so I must have got something wrong. For me, the main reason for switching to ZFS was its ability to detect silent bit rot on its own, i.e. to detect and report errors on devices even if those errors did not lead to I/O failures at the hardware / driver level. But not including such errors in the persistent log would mean losing them upon reboot, and that would be fatal (IMHO).

So eventually Sun has chosen worrying wording here, or I have misunderstood some concepts (not being a native English speaker).

Best Answer

For a general overview, see Resolving Problems with ZFS, most interesting part:

The second section of the configuration output displays error statistics. These errors are divided into three categories:

READ – I/O errors that occurred while issuing a read request

WRITE – I/O errors that occurred while issuing a write request

CKSUM – Checksum errors, meaning that the device returned corrupted data as the result of a read request

These errors can be used to determine if the damage is permanent. A small number of I/O errors might indicate a temporary outage, while a large number might indicate a permanent problem with the device. These errors do not necessarily correspond to data corruption as interpreted by applications. If the device is in a redundant configuration, the devices might show uncorrectable errors, while no errors appear at the mirror or RAID-Z device level. In such cases, ZFS successfully retrieved the good data and attempted to heal the damaged data from existing replicas.

Now, for your questions:

First, what does "device" mean in this context? Are they talking about a physical device, the vdev or even something else? My assumption is that they are talking about every "device" in the hierarchy. The vdev error count then probably is the sum of the error counts of its physical devices, and the pool error count probably is the sum of the error counts of its vdevs. Is this correct?

Each device is checked independently and all its own errors are summed up. If such an error is present on both mirrors or the vdev is not redundant itself, it propagates upwards. So, in other words, it is the amount of the errors affecting the vdev itself (which is also in line with the logic of displaying each line separately).

But what I am really interested in is whether there have been checksum errors at ZFS level (and not hardware level). I am currently convinced that CKSUM is showing the latter (otherwise, it wouldn't make much sense), but I'd like to know for sure.

Yes, it is the hardware side (non-permanent stuff like faulty cables, suddenly removed disks, power loss etc). I think that is also perspective: faults at the "software side" would mean bugs in ZFS itself, so unwanted behavior that has not been checked for (assuming all normal user interactions are deemed correct) and that is not recognizable by ZFS itself. Fortunately, they are quite rare nowadays. Unfortunately, they are also quite severe much of the time.

Third, assuming the checksum errors they are talking about are indeed the checksum errors at the ZFS level (and not hardware level), why on earth do they only show the count of uncorrectable errors? This does not make any sense. We would like to see every checksum error, whether correctable or not, wouldn't we? After all, a checksum error means that there has been some sort of data corruption on the disk which has not been detected by hardware, so we probably want to change that disk before as soon as there is any error (even if the mirror disk can still act as "backup"). So I possibly did not understand yet what exactly they mean by "uncorrectable".

Faulty disks are already indicated by read/write errors (for example, URE from a disk). Checksum errors are what you are describing: a block was read, its return value was not deemed correct by the checksums of the blocks above it in the tree, so instead of returning it it was discarded and noted as an error. "Uncorrectable" is more or less definition, because if you get garbage and know that it is garbage, you cannot correct it, but you can ignore and not use it (or try again). The wording might be unnecessarily confusing, though.

According to that paragraph, there could be two sorts of errors: Data corruption errors and device errors. A mirror configuration of two disks is undoubtedly redundant, so (according to that paragraph) it is no data corruption error if ZFS encounters a checksum error on one of the disks (at the ZFS checksum level, not the hardware level). That means (once more according to that paragraph) that this error will not be recorded as part of the persistent error log.

Data corruption in this paragraph means some of your files are partly or completely destroyed, unreadable and you need to get your last backup as soon as possible and replace them. It is when all of ZFS' precautions have already failed and it cannot help you anymore (but at least it informs you about this now, not at the next server bootup checkdisk run).

For me, the main reason for switching to ZFS was its ability to detect silent bit rot on its own, i.e. to detect and report errors on devices even if those errors did not lead to I/O failures at the hardware / driver level. But not including such errors in the persistent log would mean losing them upon reboot, and that would be fatal (IMHO).

The idea behind ZFS systems is that they do not need to be taken down to find such errors, because the file system can be checked while online. Remember, 10 years ago this was a feature that was absent in most small-scale systems at the time. So the idea was that (on a redundant config of course) you can check read and write errors of the hardware and correct them by using good known copies. Additionally, you can scrub each month to read all data (because data not read cannot be known to be good) and correct any error you find.

It is like a big archive/library of old books: you have valuable and not so valuable books, some might decay over time, so you need a person that goes around each week or month and looks at all pages of all books for mold, bugs etc., and if he finds anything he tells you. If you have two identical libraries, he can go over to the other building, look at the same book at the same page and replace the destroyed page in the first library with a copy. If he would never check any book, he might be in for a nasty surprise 20 years later.

Related Solutions

Linux – Attempt to test corruption of ZFS filesystem using dd fails

~~Errors were detected and fixed with the scrub.~~

~~Before that, you didn't attempted any writes, just reads so everything was on the ARC (i.e. in cache on RAM) and disk corruption remained undetected.~~

I overlook the "0 errors". Here is a corrected explanation about what has likely happened:

You overwrote ~ 5 MB in the beginning of the disk with zeroes.

The first 3.5 MB were harmless, ZFS reserves that area for non ZFS stuff so never read or write anything there.
The next .5 MB overwrote two vdev labels (out of four)
The next 1 MB was written in an area that might not been containing any data or metadata.

The vdev labels corruption went unnoticed due to their high redundancy (there were still six of them healthy) and the fact the labels are atomically overwritten anyway.

Linux – With ZFS on Linux, how to list device (vdev) specific properties

In order to view the current value of a specific setting like ashift, you will need to use the zdb command instead of the zpool command.

Running zdb on its own with no arguments will give you a view of any pools found on the system, and their vdevs, and disks within the vdevs.

root@pve1:/home/tim# zdb
pm1:
    version: 5000
    name: 'pm1'
    state: 0
    txg: 801772
    pool_guid: 13783858310243843123
    errata: 0
    hostid: 2831164162
    hostname: 'pve1'
    vdev_children: 1
    vdev_tree:
        type: 'root'
        id: 0
        guid: 13783858310243843123
        children[0]:
            type: 'raidz'
            id: 0
            guid: 13677153442601001142
            nparity: 2
            metaslab_array: 34
            metaslab_shift: 33
            ashift: 9
            asize: 1600296845312
            is_log: 0
            create_txg: 4
            children[0]:
                type: 'disk'
                id: 0
                guid: 4356695485691064080
                path: '/dev/disk/by-id/ata-DENRSTE251M45-0400.C_A181B011241000542-part1'
                whole_disk: 1
                not_present: 1
                DTL: 64
                create_txg: 4
            children[1]:
                type: 'disk'
                id: 1
                guid: 14648277375932894482
                path: '/dev/disk/by-id/ata-DENRSTE251M45-0400.C_A181B011241000521-part1'
                whole_disk: 1
                DTL: 82
                create_txg: 4
            children[2]:
                type: 'disk'
                id: 2
                guid: 11362800770521042303
                path: '/dev/disk/by-id/ata-DENRSTE251M45-0400.C_A181B011241000080-part1'
                whole_disk: 1
                DTL: 59
                create_txg: 4
            children[3]:
                type: 'disk'
                id: 3
                guid: 10494331395233532833
                path: '/dev/disk/by-id/ata-DENRSTE251M45-0400.C_A181B011241000517-part1'
                whole_disk: 1
                DTL: 58
                create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

or, for just ashift with some context:

root@pve1:/home/tim#  sudo zdb | egrep 'ashift|vdev|type' | grep -v disk
    vdev_children: 1
    vdev_tree:
        type: 'root'
            type: 'raidz'
            ashift: 9

Here is an old blog post about zdb that is still very informative about the origins and intent, and the information that comes out of zdb. A quick google also reveals many posts that may be more specifically relevant to ZFS on Linux.

Best Answer

Related Solutions

Linux – Attempt to test corruption of ZFS filesystem using dd fails

Linux – With ZFS on Linux, how to list device (vdev) specific properties

Related Question