I had created five 1TB HDD partitions (/dev/sda1
, /dev/sdb1
, /dev/sdc1
, /dev/sde1
, and /dev/sdf1
) in a RAID 6 array called /dev/md0
using mdadm
on Ubuntu 14.04 LTS Trusty Tahr.
The command sudo mdadm --detail /dev/md0
used to show all drives in active sync.
Then, for testing, I simulated long I/O blocking on /dev/sdb
by running these commands while /dev/sdb1
was still active in the array:
hdparm --user-master u --security-set-pass deltik /dev/sdb
hdparm --user-master u --security-erase-enhanced deltik /dev/sdb
WARNING
DON'T TRY THIS ON DATA YOU CARE ABOUT!
I ended up corrupting 455681 inodes as a result of this ATA operation. I admit my negligence.
The ATA command for secure erase was expected to run for 188 minutes, blocking all other commands for at least that long.
I expected md
to drop the unresponsive drive like a proper RAID controller, but to my surprise, /dev/md0
became blocked as well.
mdadm --detail /dev/md0
queries the blocked device, so it freezes and won't output.
Here's the layout from /proc/mdstat
while I can't use mdadm --detail /dev/md0
:
root@node51 [~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [linear] [multipath] [raid0] [raid1] [raid10]
md0 : active raid6 sdf1[5] sda1[0] sdb1[4] sdc1[2] sde1[1]
2929887744 blocks super 1.2 level 6, 512k chunk, algorithm 2 [5/5] [UUUUU]
unused devices: <none>
I tried mdadm /dev/md0 -f /dev/sdb1
to forcefully fail /dev/sdb1
, but that was also blocked:
root@node51 [~]# ps aux | awk '{if($8~"D"||$8=="STAT"){print $0}}'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 3334 1.2 0.0 42564 1800 ? D 03:21 3:37 parted -l
root 4957 0.0 0.0 13272 900 ? D 06:19 0:00 mdadm /dev/md0 -f /dev/sdb1
root 5706 0.0 0.0 13388 1028 ? D 06:19 0:00 mdadm --detail /dev/md0
root 7541 0.5 0.0 0 0 ? D Jul19 6:12 [kworker/u16:2]
root 22420 0.0 0.0 11480 808 ? D 07:48 0:00 lsblk
root 22796 0.0 0.0 4424 360 pts/13 D+ 05:51 0:00 hdparm --user-master u --security-erase-enhanced deltik /dev/sdb
root 23312 0.0 0.0 4292 360 ? D 05:51 0:00 hdparm -I /dev/sdb
root 23594 0.1 0.0 0 0 ? D 06:11 0:07 [kworker/u16:1]
root 25205 0.0 0.0 17980 556 ? D 05:52 0:00 ls --color=auto
root 26008 0.0 0.0 13388 1032 pts/23 D+ 06:32 0:00 mdadm --detail /dev/md0
dtkms 29271 0.0 0.2 58336 10412 ? DN 05:55 0:00 python /usr/share/backintime/common/backintime.py --backup-job
root 32303 0.0 0.0 0 0 ? D 06:16 0:00 [kworker/u16:0]
UPDATE (21 July 2015): After I waited the full 188 minutes for the I/O block to be cleared, surprise turned to horror when I saw that md
treated the completely blanked out /dev/sdb
as if it were completely in tact.
I thought that md
would have at least seen that parity was mismatched and then would have dropped /dev/sdb1
.
Panicking, I ran mdadm /dev/md0 -f /dev/sdb1
again, and since the I/O block had been lifted, the command completed quickly.
Filesystem corruption was already happening as input/output errors cropped up. Still panicking, I did a lazy unmount of the data partition in the RAID array and a reboot -nf
since I figured it couldn't get any worse.
After a nail-biting e2fsck
on the partition, 455681 inodes made it into lost+found
.
I've since reassembled the array, and the array itself looks fine now:
root@node51 [~]# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Mon Feb 16 14:34:26 2015
Raid Level : raid6
Array Size : 2929887744 (2794.16 GiB 3000.21 GB)
Used Dev Size : 976629248 (931.39 GiB 1000.07 GB)
Raid Devices : 5
Total Devices : 5
Persistence : Superblock is persistent
Update Time : Tue Jul 21 00:00:30 2015
State : active
Active Devices : 5
Working Devices : 5
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : box51:0
UUID : 6b8a654d:59deede9:c66bd472:0ceffc61
Events : 643541
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 97 1 active sync /dev/sdg1
2 8 33 2 active sync /dev/sdc1
6 8 17 3 active sync /dev/sdb1
5 8 113 4 active sync /dev/sdh1
It's still quite a shock to me that md
doesn't have two lines of protection that I expected:
- Failing a device when it locks up
- Failing a device when the data it returns are garbage
Questions
- Why doesn't
md
fail the unresponsive drive/partition? - Can I drop the drive/partition from the array while the drive is blocked?
- Can a timeout be configured so that
md
automatically fails a drive that isn't responding to ATA commands? - Why does
md
continue to use a device with invalid data?
Best Answer
Deltik, you've misunderstood how Linux Software RAID (
md
) works.md
makes a virtual block device out of multiple devices or partitions and has no awareness of what data you are transferring to and from the virtual device.You hoped that it could do things that it wasn't designed to do.
Answers
1. Why doesn't
md
fail the unresponsive drive/partition?This is because
md
has no idea whethermd
itself requested orso
md
will wait to see what the drive returns. The drive eventually didn't return any read or write errors. If there was a read error,md
would have automatically fixed it from parity, and if there was a write error,md
would have failed the device (see the "Recovery" section of themd
man page).Since there was neither a read error nor a write error,
md
continued using the device after the kernel waited for it to respond.2. Can I drop the drive/partition from the array while the drive is blocked?
No. The
/dev/md0
RAID device is blocked and can't be modified until the block is cleared.You passed the
-f
or--fail
flag to themdadm
"Manage" mode.Here's a walkthrough of what that actually does:
This is the source code of how that flag works:
Notice the call
write(sysfd, "faulty", 6)
.sysfd
is a variable set earlier in the file:sysfd = sysfs_open(fd2devnm(fd), dname, "block/dev");
sysfs_open()
is a function from this file:If you follow the functions, you'll find that
mdadm /dev/md0 -f /dev/sdb1
essentially does this:This request will be waiting and won't go through immediately because
/dev/md0
is blocked.3. Can a timeout be configured so that
md
automatically fails a drive that isn't responding to ATA commands?Yes. In fact, by default, the timeout is 30 seconds:
The problem with your assumption was that your drive was actually busy running an ATA command (for 188 minutes), so it wasn't timing out.
For details about this, see the Linux kernel SCSI error handling documentation.
4. Why does
md
continue to use a device with invalid data?When the ATA Secure Erase finished, the drive did not report any issues, like an aborted command, so
md
had no reason to suspect that there was an issue.Furthermore, in your case of using partitions as the RAID devices instead of whole disks, the kernel's in-memory partition table wasn't informed that the partition on the wiped drive was gone, so
md
would continue to access your/dev/sdb1
like nothing was wrong.This is from the
md
man page:We can infer from this that parity is not normally checked on every disk read. (Besides, checking parity on every read would be very taxing on performance by increasing the transactions necessary just to complete a read and running the comparison of the parity to the data read.)
Under normal operation,
md
just assumes that the data it is reading are valid, leaving it vulnerable to silent data corruption. In your case, you had an entire drive of silently corrupted data because you wiped the drive.Your filesystem wasn't aware of the corruption. You saw input/output errors at the filesystem level because the filesystem couldn't understand why it had bad data.
To avoid silent data corruption, first, don't ever do what you did again. Second, consider using ZFS, a filesystem that focuses on data integrity and detects and corrects silent data corruption.