Linux – How does Linux md-RAID handle disk read errors

diskerror handlinglinux-kernelmdsoftware-raid

There are 2 cases:

the read command times out at kernel level (30 seconds by default),
the drive reports its unability to read a given sector before the kernel lose patience (the case I'm interested in).

Kernel timeout

As drive access is usually going through the Linux SCSI layer, I think the timeout case is completely handled by this layer. According to this documentation, it tries the command several time after having reset the drive, then the bus, then the host, etc. If none of this works, the SCSI layer will offline the device. At this point, I think the md layer just "discovers" that one drive is gone, and mark it as missing (failed). Is this correct?

Drive reported error

Some drives can be configured to report a read error after a certain timeout is reached, thus aborting internal recovery attempts. This is called ERC (or TLER, CCTL). The disk timeout is usually configured to trigger before the OS timeout (or hw RAID controller), so that the latter knows what really happened instead of just "waiting and aborting".

My question is: how does Linux (and md) handle drive-reported read errors?

Will it try again, do something clever, or just offline the drive without going through all attempts described in "Kernel timeout" above? Is md even aware when such a thing happens?

Some people suggest that ERC is dangerous on Linux as it will not give enough time for the drive to try to recover. They also say that ZFS-raid is nice because if a read error occurs, it will compute the missing unreadable sector data thanks to RAID redundancy, and overwrite it back on the drive. The latter should then stop trying to read the nasty sector, automatically mark it as bad (not to be used anymore), and remap it on a nice sane sector.

Is md also capable of doing this?

Best Answer

This is described in some detail in the md(4) man page, section RECOVERY.

[...] a read-error will instead cause md to attempt a recovery by overwriting the bad block. i.e. it will find the correct data from elsewhere, write it over the block that failed, and then try to read it back again. If either the write or the re-read fail, md will treat the error the same way that a write error is treated, and will fail the whole device.

As for timeouts, while there are reports of drives getting kicked out if they were in standby, it's never actually happened for me. I have 7 HDDs which usually spin down (as the main system runs off SSD and can get by without HDD access for long periods of time) and it works without a problem (except that md wakes one drive after the other instead of all-at-once).

I guess it depends on what the other layers report to md.

Related Solutions

Linux – Kernel failure with software RAID-1 due to single drive read errors

1) Why would all the rescheduled sectors be exactly 8 apart?

Such gaps in the sector numbers are to be expected, the question is more how large those gaps would be (4k or larger). 8x 512 bytes are 4k which is the sector size most filesystems work with. So the filesystem probably requested to read 4k from the RAID, the RAID asks the /dev/sdb for that data. The first sector of that read fails (that's the sector number you see in your logs), the RAID switches to /dev/sda and serves the 4k from there. Then the filesystem requests to read the next 4k, back to /dev/sdb with sector number +8 which fails again which is again what you see in your logs ...

2) Why would the kernel become unresponsive and require a reboot?

Shouldn't happen normally. The problem is that the reallocation situation is about the most expensive you can get. Each read that fails, has to be redirected to the other disk, has to be rewritten on the original disk, etc. If it fills your log file at the same time, causing in turn new write requests, which in turn have to be reallocated again, etc. It would be cheaper to just kick the disk out entirely in this case.

It's also a question of how the rest of the hardware (such as the SATA controller) handles failing drives. If the controller itself gets a hiccup from it, it will hurt performance even more.

It's hard to tell what happened exactly if there is no log entry; it's a weak point of the Linux kernel, there is no straightforward solution to preserve those last messages when things really go south.

3) Why would the unreadable and offline uncorrectable counts reset only 23 hours after the raid resync was complete?

Some values are updated only when you do offline data collection (UPDATED Offline column), that may take some time. If the disk is set to do this automatically, it depends on the disk, e.g. every four hours. If you don't want to rely on the disk, you should set it up using smartmontools.

SATA vs SCSI – How SATA ‘Talks’ SCSI and Shared Features

SCSI and ATA are entirely different standards. They are currently both developed under the aegis of the INCITS standards organization but by different groups. SCSI is under technical committee T10, while ATA is under T13.¹

ATA was designed with hard disk drives in mind, only. SCSI is both broader and older, being a standard way of controlling mass storage devices, tape drives, removable optical media drives (CD, DVD, Blu-Ray...), scanners, and many other device types.

It wasn't obvious in the mid-1980s — when IDE was introduced to the PC world — that SCSI would get pushed to the margins of the computing world. SCSI was well-established and more capable. Unix workstations and Macintosh computers shipped with SCSI hard disk drives for decades. High-end PCs often had a SCSI card for peripherals at least, and often for the system HDD, too. The early CD-ROM and tape drives for personal computers came out in SCSI form first.

The PC industry being what it is, though, there was a push to use the less expensive ATA standard instead of SCSI. The initial compromise was called ATAPI, an extension to ATA that allows a device that understands SCSI internally to receive those SCSI commands over an ATA interface. More on this below.

Several years later, SCSI got the ATA command pass-through feature, basically the inverse of ATAPI, allowing ATA commands over a SCSI bus. One use for this facility is to tunnel ATA SMART commands over SCSI. smartmontools does this, for example.

Later still, the INCITS T10 committee developed a standard called the SCSI/ATA Translation (SAT), which translates SCSI commands to ATA commands and vice versa.² The Linux kernel's libata library provides a SAT implementation for Linux, among other things.

There is some logical overlap in the SCSI and ATA protocols, since they both control hard disk drives. Both obviously need a way to seek to a particular hard drive sector, retrieve that sector's contents, etc. Nevertheless, the command formats are entirely different; otherwise, we wouldn't need these translation and pass-through mechanisms.

SATA actually "talks" SCSI

That is about as true as the assertion that "Cars are pink." Some cars are pink.

ATAPI, ATA pass-through, and SAT are only part of the story. Read on.

I assume it is taken for granted that they differ on the physical layer, as they do not share compatible cables.

That was true in the old parallel SCSI world, but just as SATA replaced PATA, SAS replaced parallel SCSI.

SAS and SATA share the same drive connectors, and they are electrically compatible. A SAS controller can talk to SAS and SATA devices, but a SAS drive cannot work with a SATA-only controller. The difference is in the negotiation, and in the commands you can use after the devices on each end of the cable figure out what they are talking to.

In fact, a lot of "SATA RAID" controllers are really SAS RAID controllers. Such controllers often have one or more SFF-8087 SAS mating connectors on the card, but you can connect SATA drives to them with an SFF-8087 to 4× SATA breakout cable. So, a SAS/SATA RAID card with two SFF-8087 mating connectors controls up to 8 drives.³

Another common situation is a hot-swap drive enclosure or computer case with a SAS backplane. The backplane usually has an SFF-8087 connector on it, allowing use of a simple 8087-to-8087 cable from the backplane to the disk controller. If the drives in the hot-swap trays are SATA, that's of no matter. The SAS controller can talk to them over the SAS cabling, as they sit in drive sleds that plug the drives into the SAS backplane. The drives are still SATA drives, though, speaking the ATA protocol, not SCSI.

I also know that ATAPI is an encapsulation for SCSI

True, but ATAPI is only used for devices other than hard disk drives. The main reason this standard exists is to allow an ATA interface to transport SCSI commands like the streaming data commands for a tape drive, the "eject media" command for an optical disk drive, or the "play track" command for a CD audio disc.

This fact is becoming less relevant as the non-HDD devices that used to speak SCSI over ATAPI disappear or move on to other interfaces. Low-end tape drives no longer exist, so tape drives are all SAS now.⁴ Scanners are pretty much USB-only these days. Optical media drives are moving outside the computer case to be connected via USB, or disappearing entirely, leaving just the increasingly rare internal optical drives speaking ATAPI.

Regardless, a SATA device that understands SCSI over ATAPI is a "SCSI device" only in a limited way. Such devices will not benefit from most of the advantages of SAS over SATA. These capabilities make SAS distinctly valuable compared to SATA, ATAPI notwithstanding.

If you want another car analogy, the fact that I can run my car on an oval race track does not make it a race car.

I've noticed that features from SCSI such as NCQ, FUA, DPO, etc (if I don't remember incorrectly) have been adopted from SCSI. But it is unclear how "much" of the SCSI command set is actually shared or similar.

Mostly this amounts to low-end mimicry. NCQ is not the same thing as TCQ, for example. You will only get a hard drive with TCQ if it is a SAS device. Plug an NCQ-capable SATA drive into a SAS controller, and it doesn't suddenly gain TCQ capability.

That said, a modern SATA device may well be much more capable than a SCSI device from a decade ago. It is certainly going to be capable of much higher levels of I/O.

All of this is confusing and overlapping because that's the nature of the PC hardware world. There aren't clear lines because optical drive manufacturers — just to pick on one sub-industry — really don't want to build two entirely different drives, one speaking SAS to its highest expression, and the other speaking SATA. So, they compromise. They lobby in the committees defining such standards to create a single standard that lets them drop their SATA drive on a SAS bus, and everyone's mostly happy.

Where can I find some clear information on this, and especially how it relates to the Linux kernel?

Ultimately, you want to read the Linux sources. The libATA Developer's Guide should also be helpful.

I'm not aware of any easy summary of how all this works. It wasn't designed to be easy. It was designed to accommodate three decades of hardware evolution, competing standards, and disparate goals. Further, it was designed without magical levels of foresight. In short, it's a mess. The only people who really have to know how the mess works are those building the OS kernels, those designing the hardware, and to a lesser extent, those writing the drivers for the OS kernels. For such a small cadre of highly capable people, standards and working code are sufficient.

Today, Linux calls most rewritable mass-storage devices /dev/sd?. "SD" once stood for "SCSI disk," and existed merely to differentiate from /dev/hd? generically meaning "Hard Disk," but implying PATA in most cases. This distinction is another practical irrelevancy today. Now we have SSDs, USB thumb drives, virtual hard drives, iSCSI devices and more all called /dev/sd?. I suggest you start thinking of "SD" as short for "storage device," rather than worrying about whether the device speaks ATA over SATA, ATA over Ethernet, SCSI over USB, SCSI over ATAPI, SCSI over SAS, SCSI over IP (iSCSI), or what have you.

The core problem is that naming schemes often outlast the reason behind the scheme. You see this in /dev/scd0. The device connected to that /dev node is more likely to be a DVD or Blu-Ray drive than a Compact Disc drive these days.

The alternative — where you name each /dev node after the exact device type that's connected to it — has its own problems. Would it really be better if we named the /dev node after the low-level protocol it used? /dev/atapi0, /dev/sas0, etc? Or maybe you'd prefer /dev/atapibluray0 and such? What about multi-media drives? Does the same driver also need to expose /dev/atapicd0 in case you slide a Compact Disc into the Blu-Ray drive? That just replaces one confusing scheme with another.

Linux's /dev/sd? abstraction is not perfect, but it is useful. For instance, you can learn the fact that /dev/sda is most likely the boot drive without bothering to worry about what cabling, interface protocol, and media are behind that name. If I tell you that a given Linux box has a single system drive, an optical drive, and sometimes has a USB thumb drive plugged into it, you can confidently guess that they are called /dev/sda, /dev/sdb and /dev/sdc, respectively.

Footnotes:

SCSI and ATA didn't start out sharing a parent standards organization. They both started out as proprietary hard disk controllers. SCSI evolved from Shugart Associates' SASI, and ATA/IDE came out of a much later design collaboration between Western Digital, Compaq and CDC.

ANSI later standardized both, with ATA-1 following SCSI-1 about 8 years later.

INCITS is a kind of sister organization to ANSI. INCITS publishes final standards through ANSI in the US, and ISO/IEC JTC 1 worldwide.

The current standard is SAT-3, published in May 2015, with SAT-4 and SAT-5 in progress as I write this in mid-July 2018. The latter link takes you to drafts of the in-progress versions.
I'm ignoring SATA port multipliers, SAS expanders, etc.
Excepting the models made for compatibility with old parallel SCSI systems.