Externally attached ZFS pool hangs up, no sign of errors on drives

enclosuressatazfs

I have an array of 5 1TB WD Red drives in an external enclosure on a SATA multiplexer. This is being fed into a desktop machine with a SATA multiplexer controller.

After about a year of service (this has happened twice) the array will start to reset itself as in this video. There is no indication that any particular drive is at fault, just that the enclosure shuts down and all drives disconnect in the array.

I have two such enclosures, and the fault always goes with the redundant array when I move them from one to the other. The enclosures have remained constant for years, as have the interface cards, but new drives installed solved the issue for another year.

It could be dozens of things, from noisy power supply killing drive power circuits slowly to a bad OS implementation of ZFS, but it's so hard to know where to even start. What strategy would let me find out what the problem actually is?

  • OS: CentOS 7.0, kernel: 3.10.0

  • Enclosure: SiI 3726 multiplexer

  • Interface card: SiI 3132 demultiplexer

  • Hard Drives: WD10EFRX

Messages:

When the resets are occurring:

[ttt.tttt] ata4.03: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
[ttt.tttt] ata4.03: failed command: WRITE DMA EXT
[ttt.tttt] ata4.03: cmd 35/00:.. ...:00/e0 tag 3 dma 144688 out
[ttt.tttt] ata4.03: status: { Busy }
[ttt.tttt] ata4.03: error: { ICRC UNC AMNF IDNF ABRT }

Once zpool has stopped completely:

[ttt.tttt] INFO: task txg_sync:xxxxx blocked for more than 120 seconds
[ttt.tttt] INFO: task zpool:xxxxx blocked for more than 120 seconds

Once the second has occurred in response to terminal commands like

$ zpool status

the system is essentially useless and requires a full reboot.

The problem does not correlate with a drop in voltages to the drives, as can be seen in the latest video. I think it is a key piece of information that the box itself is resetting, all lights, even its own power light is resetting.

The messages to dmesg are vast, much too long to attach.

Output from badblocks:

$ badblocks -vn /dev/sdp1
irq_stat 0x00060002, device error via SDB FIS
SError: { Handshk }
failed command: WRITE FPDMA QUEUED
cmd 61/...
res 41/... ...Emask 0x410 (ATA bus error) <F>
status: { DRDY ERR }
error: { ICRC ABRT }

And this occurs equally for all 5 drives in the array. It's like the box is getting overloaded and resetting itself.

Update: 06/12/2017

All drives were moved to a second enclosure on USB3 interconnect rather than eSATA.

  • Enclosure: ICY BOX IB-3810U3
    • Multiplexer chip: ASMedia ASM1074L
  • Server motherboard USB3 host: Gigabyte GA-B85-HD3 SKT 1150

With all drives moved to new enclosure the badblocks command was run on each drive without a single error. Then the pool was imported and a scrub run. No error was found and the scrub completed successfully. Today however, a message was listed for all 5 drives, (it was impossible to tell if they were the drives of this pool/tank/array):

WARNING: Your hard drive is failing
Device: /dev/sdk [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdl [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdm [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdn [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdo [SAT], unable to open device

After this, an attempt to list the contents of the drive, it locked up the terminal. A new terminal locked up on any zpool command. top lists txg_sync and a hoard of z_rd_int_x processes that all have some CPU usage. Two other pools are successfully serving files over SAMBA, with one continuing to resilver itself (only evidenced by the HD lights), as zpool status hangs.

smartctl data: 12/12/2017

As per a commentor, the following is the smartctl data for UDMA_CRC_Error_Count.

For the second iteration of array currently failing:

4193, 4030, 3939, 2869, 3977

For the original array (with drive three having been swapped out):

3003, 3666,    0, 4536, 5309

For a RAID0 stripe in the same enclosure and connectivity

 523,  504,  526,  553,  476

For a ZFS mirror with hot spare hosted inside the host machine:

   0,    0,    0

On a Seagate Archive drive, seeming nonsense. :

Temperature_Celsius   UDMA_CRC_Error_Count   Head_Flying_Hours
   40  (0 16 0 0 0)                      0      57501022168585

This potentially just goes to show eSATA and USB 3.0 are inherently noisy and data corruption is inevitable.

Best Answer

The SMART statistics indicate that the hard drives have seen CRC errors on their links. (To verify that it is not an issue that was previously resolved, you should monitor the value of UDMA_CRC_Error_Count over time - it is a total of the errors over the disk's lifetime)

The cases where I have previously seen this, involved bad SATA cables. Swapping the cable has resolved the issue (The counters still have their values, but the value stays constant). This is a quite complex setup though, and the problem might be on a cable, on the mux/demux or somewhere in the enclosure.

Related Question