Linux – How does Linux md-RAID handle disk read errors

diskerror handlinglinux-kernelmdsoftware-raid

There are 2 cases:

  • the read command times out at kernel level (30 seconds by default),
  • the drive reports its unability to read a given sector before the kernel lose patience (the case I'm interested in).

Kernel timeout

As drive access is usually going through the Linux SCSI layer, I think the timeout case is completely handled by this layer. According to this documentation, it tries the command several time after having reset the drive, then the bus, then the host, etc. If none of this works, the SCSI layer will offline the device. At this point, I think the md layer just "discovers" that one drive is gone, and mark it as missing (failed). Is this correct?

Drive reported error

Some drives can be configured to report a read error after a certain timeout is reached, thus aborting internal recovery attempts. This is called ERC (or TLER, CCTL). The disk timeout is usually configured to trigger before the OS timeout (or hw RAID controller), so that the latter knows what really happened instead of just "waiting and aborting".

My question is: how does Linux (and md) handle drive-reported read errors?

Will it try again, do something clever, or just offline the drive without going through all attempts described in "Kernel timeout" above? Is md even aware when such a thing happens?

Some people suggest that ERC is dangerous on Linux as it will not give enough time for the drive to try to recover. They also say that ZFS-raid is nice because if a read error occurs, it will compute the missing unreadable sector data thanks to RAID redundancy, and overwrite it back on the drive. The latter should then stop trying to read the nasty sector, automatically mark it as bad (not to be used anymore), and remap it on a nice sane sector.

Is md also capable of doing this?

Best Answer

This is described in some detail in the md(4) man page, section RECOVERY.

[...] a read-error will instead cause md to attempt a recovery by overwriting the bad block. i.e. it will find the correct data from elsewhere, write it over the block that failed, and then try to read it back again. If either the write or the re-read fail, md will treat the error the same way that a write error is treated, and will fail the whole device.

As for timeouts, while there are reports of drives getting kicked out if they were in standby, it's never actually happened for me. I have 7 HDDs which usually spin down (as the main system runs off SSD and can get by without HDD access for long periods of time) and it works without a problem (except that md wakes one drive after the other instead of all-at-once).

I guess it depends on what the other layers report to md.