Deltik, you've misunderstood how Linux Software RAID (md
) works.
md
makes a virtual block device out of multiple devices or partitions and has no awareness of what data you are transferring to and from the virtual device.
You hoped that it could do things that it wasn't designed to do.
Answers
1. Why doesn't md
fail the unresponsive drive/partition?
This is because md
has no idea whether
- the drive is busy with I/O from something that
md
itself requested or
- the drive was blocked due to some external circumstance like the drive's own error recovery or in your case an ATA Secure Erase,
so md
will wait to see what the drive returns. The drive eventually didn't return any read or write errors. If there was a read error, md
would have automatically fixed it from parity, and if there was a write error, md
would have failed the device (see the "Recovery" section of the md
man page).
Since there was neither a read error nor a write error, md
continued using the device after the kernel waited for it to respond.
2. Can I drop the drive/partition from the array while the drive is blocked?
No. The /dev/md0
RAID device is blocked and can't be modified until the block is cleared.
You passed the -f
or --fail
flag to the mdadm
"Manage" mode.
Here's a walkthrough of what that actually does:
This is the source code of how that flag works:
case 'f': /* set faulty */
/* FIXME check current member */
if ((sysfd >= 0 && write(sysfd, "faulty", 6) != 6) ||
(sysfd < 0 && ioctl(fd, SET_DISK_FAULTY,
rdev))) {
if (errno == EBUSY)
busy = 1;
pr_err("set device faulty failed for %s: %s\n",
dv->devname, strerror(errno));
if (sysfd >= 0)
close(sysfd);
goto abort;
}
if (sysfd >= 0)
close(sysfd);
sysfd = -1;
count++;
if (verbose >= 0)
pr_err("set %s faulty in %s\n",
dv->devname, devname);
break;
Notice the call write(sysfd, "faulty", 6)
. sysfd
is a variable set earlier in the file:
sysfd = sysfs_open(fd2devnm(fd), dname, "block/dev");
sysfs_open()
is a function from this file:
int sysfs_open(char *devnm, char *devname, char *attr)
{
char fname[50];
int fd;
sprintf(fname, "/sys/block/%s/md/", devnm);
if (devname) {
strcat(fname, devname);
strcat(fname, "/");
}
strcat(fname, attr);
fd = open(fname, O_RDWR);
if (fd < 0 && errno == EACCES)
fd = open(fname, O_RDONLY);
return fd;
}
If you follow the functions, you'll find that mdadm /dev/md0 -f /dev/sdb1
essentially does this:
echo "faulty" > /sys/block/md0/md/dev-sdb1/block/dev
This request will be waiting and won't go through immediately because /dev/md0
is blocked.
3. Can a timeout be configured so that md
automatically fails a drive that isn't responding to ATA commands?
Yes. In fact, by default, the timeout is 30 seconds:
root@node51 [~]# cat /sys/block/sdb/device/timeout
30
The problem with your assumption was that your drive was actually busy running an ATA command (for 188 minutes), so it wasn't timing out.
For details about this, see the Linux kernel SCSI error handling documentation.
4. Why does md
continue to use a device with invalid data?
When the ATA Secure Erase finished, the drive did not report any issues, like an aborted command, so md
had no reason to suspect that there was an issue.
Furthermore, in your case of using partitions as the RAID devices instead of whole disks, the kernel's in-memory partition table wasn't informed that the partition on the wiped drive was gone, so md
would continue to access your /dev/sdb1
like nothing was wrong.
This is from the md
man page:
Scrubbing and Mismatches
As storage devices can develop bad blocks at any time it is valuable to regularly read all blocks on all devices in an array so as to catch such bad blocks early. This process is called scrubbing.
md arrays can be scrubbed by writing either check or repair to the file md/sync_action in the sysfs directory for the device.
Requesting a scrub will cause md to read every block on every device in the array, and check that the data is consistent. For RAID1 and RAID10, this means checking that the copies are identical. For RAID4, RAID5, RAID6 this means checking that the parity block is (or blocks are) correct.
We can infer from this that parity is not normally checked on every disk read. (Besides, checking parity on every read would be very taxing on performance by increasing the transactions necessary just to complete a read and running the comparison of the parity to the data read.)
Under normal operation, md
just assumes that the data it is reading are valid, leaving it vulnerable to silent data corruption. In your case, you had an entire drive of silently corrupted data because you wiped the drive.
Your filesystem wasn't aware of the corruption. You saw input/output errors at the filesystem level because the filesystem couldn't understand why it had bad data.
To avoid silent data corruption, first, don't ever do what you did again. Second, consider using ZFS, a filesystem that focuses on data integrity and detects and corrects silent data corruption.
Best Answer
In addition to the regular logging system, BTRFS does have a stats command, which keeps track of errors (including read, write and corruption/checksum errors) per drive:
So you could create a simple root cronjob:
This will check for positive error counts every hour and send you an email. Obviously, you would test such a scenario (for example by causing corruption or removing the grep) to verify that the email notification works.
In addition, with advanced filesystems like BTRFS (that have checksumming) it's often recommended to schedule a scrub every couple of weeks to detect silent corruption caused by a bad drive.
The
-B
option will keep the scrub in the foreground, so that you will see the results in the email cron sends you. Otherwise, it'll run in the background and you would have to remember to check the results manually as they would not be in the email.Update: Improved grep as suggested by Michael Kjörling, thanks.
Update 2: Additional notes on scrubbing vs. regular read operations (this doesn't just apply to BTRFS only):
As pointed out by Ioan, a scrub can take many hours, depending on the size and type of the array (and other factors), even more than a day in some cases. And it is an active scan, it won't detect future errors - the goal of a scrub is to find and fix errors on your drives at that point in time. But as with other RAID systems, it is recommended to schedule periodic scrubs. It's true that a typical i/o operation, like reading a file, does check if the data that was read is actually correct. But consider a simple mirror - if the first copy of the file is damaged, maybe by a drive that's about to die, but the second copy, which is correct, is actually read by BTRFS, then BTRFS won't know that there is corruption on one of the drives. This is simply because the requested data has been received, it matches the checksum BTRFS has stored for this file, so there's no need for BTRFS to read the other copy. This means that even if you specifically read a file that you know is corrupted on one drive, there is no guarantee that the corruption will be detected by this read operation.
Now, let's assume that BTRFS only ever reads from the good drive, no scrub is run that would detect the damage on the bad drive, and then the good drive goes bad as well - the result would be data loss (at least BTRFS would know which files are still correct and will still allow you to read those). Of course, this is a simplified example; in reality, BTRFS won't always read from one drive and ignore the other.
But the point is that periodic scrubs are important because they will find (and fix) errors that regular read operations won't necessarily detect.
Faulted drives: Since this question is quite popular, I'd like to point out that this "monitoring solution" is for detecting problems with possibly bad drives (e.g., dying drive causing errors but still accessible).
On the other hand, if a drive is suddenly gone (disconnected or completely dead rather than dying and producing errors), it would be a faulted drive (ZFS would mark such a drive as FAULTED). Unfortunately, BTRFS may not realize that a drive is gone while the filesystem is mounted, as pointed out in this mailing list entry from 09/2015 (it's possible that this has been patched):
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg46598.html
There'd be tons of error messages in dmesg by that time, so grepping dmesg might not be reliable.
For a server using BTRFS, it might be an idea to have a custom check (cron job) that sends an alert if at least one of the drives in the RAID array is gone, i.e., not accessible anymore...