Centos – How to debug/resolve serious ZFS issues

centoshard-diskzfs

I have had an ongoing saga with a home data server I have set up, and have switched out almost any parts other than the drives themselves.

Starting by using software RAID in CentOS, I've had one series of 5 drives operating, literally flawlessly, for two years in RAID 0. Totally the most dangerous way to run a RAID. The other five drives, identical and from the same batch as the first 5, have always been in some form of RAID 5 configuration, firstly using software RAID, then later in ZFS after a complete rebuild. This set has always, periodically, after months of bulletproof service, just given up and gone offline in more or less spectacular ways.

The drives have lived in external enclosures connected initially by multiplexed eSATA and now by multiplexed USB3.

At first I thought the issue might be with the cheap enclosure/multiplexer, so I swapped the 5 drives of the RAID 0 and the RAID 5 arrays between the two enclosures I had. The RAID 0 continued flawlessly, the RAID 5 continued to have these periodic blackouts.

I had the first hint that the issue was with a drive in the set, but no single one of the five drives has ever had more issues than any other. So I wondered if maybe RAID 5 had some strange power requirement that was tripping the enclosure, and invested in another enclosure, this time a USB 3 connected box – the USB3 is so much more positive than the eSATA.

So that has worked solidly for six months now, until today. On the terminal I received 5 sequential messages:

WARNING: Your hard drive is failing
Device: /dev/sda [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdb [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdc [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdd [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sde [SAT], unable to open device

I have eliminated the box, the connection the multiplexer, the PCIe eSATA extension boards, the problem must surely be with the drives, but short of throwing them all away, I can't think how to debug this. When it first happened zpool status showed near equal errors for all drives, and it is curious that they all went out in alphabetical order.

I zpool cleared, it resilvered, everything was good for a while, then it stopped responding. Now spool status literally hangs the terminal and is immune to Ctrl+C.

New information:

/dev/sda-e have spontaneously renamed themselves to /dev/sda1-e1, Since there was no read or write I power cycled the drive box. The devices disappeared then reappeared as expected, but still with the 1 suffixes on their names.

Update: (06/03/2017)

Using the Oracle documentation I tried setting failmode to continue:

zpool set failmode=continue tank

In this mode I continue to periodically get

WARNING: Your hard drive is failing
Device: /dev/sda [SAT], unable to open device

and the drives in the array all accrue write errors :

   NAME                        STATE     READ WRITE CKSUM
    tank                        ONLINE       0    16    59
      raidz1-0                  ONLINE       0    32   118
        ata-WDC_WDC_WD10-68...  ONLINE       0    14     0
        ata-WDC_WDC_WD10-68...  ONLINE       0    12     0
        sda                     ONLINE       0    12     0
        ata-WDC_WDC_WD10-68...  ONLINE       0    12     0
        ata-WDC_WDC_WD10-68...  ONLINE       0    14     0

errors: 67 data errors, use '-v' for a list

however at this point at least zpool stays alive and does not hang a terminal indefinitely or hang other pools.

It is interesting that only the writes are accruing errors on all of the drives, and in very equal numbers.

Best Answer

As the message is generated by smartdnotify and the system is really having trouble accessing the device, I'd recommend to investigate the drive issues first, as this looks like a hardware problem.

And there's nothing zfs can do about this. Once the faulty hard drive (or cable or controller) has been replaced, zfs may be able to restore the pool again.

Related Question