Linux – RAID 5 with 4 disks fails to operate with one failed disk

linuxmdadmraid

I found a question about mdadm spare disks which almost answers my question, but it isn't clear to me what is happening.

We have a RAID5 set up with 4 disks – and all are labeled in normal operation as active/sync:

    Update Time : Sun Sep 29 03:44:01 2013
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

Number   Major   Minor   RaidDevice State
   0     202       32        0      active sync   /dev/sdc
   1     202       48        1      active sync   /dev/sdd
   2     202       64        2      active sync   /dev/sde 
   4     202       80        3      active sync   /dev/sdf

But then when one of the disks failed, the RAID stopped working:

    Update Time : Sun Sep 29 01:00:01 2013
          State : clean, FAILED 
 Active Devices : 2
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 1

Number   Major   Minor   RaidDevice State
   0     202       32        0      active sync   /dev/sdc
   1     202       48        1      active sync   /dev/sdd
   2       0        0        2      removed
   3       0        0        3      removed

   2     202       64        -      faulty spare   /dev/sde
   4     202       80        -      spare   /dev/sdf

What is really going on here??

The fix was to reinstall the RAID – luckily I can do that. Next time it'll probably have some serious data on it. I need to understand this so I can have a RAID that won't fail because of a single disk failure.

I realized I didn't list what I expected vs. what happened.

I expect that a RAID5 with 3 good disks and 1 bad will operate in a degraded mode – 3 active/sync and 1 faulty.

What happened was a spare was created out of thin air and declared faulty – then a new spare was also created out of thin air and declared sound – after which the RAID was declared inoperative.

This is the output from blkid:

$ blkid
/dev/xvda1: LABEL="/" UUID="4797c72d-85bd-421a-9c01-52243aa28f6c" TYPE="ext4" 
/dev/xvdc: UUID="feb2c515-6003-478b-beb0-089fed71b33f" TYPE="ext3" 
/dev/xvdd: UUID="feb2c515-6003-478b-beb0-089fed71b33f" SEC_TYPE="ext2" TYPE="ext3" 
/dev/xvde: UUID="feb2c515-6003-478b-beb0-089fed71b33f" SEC_TYPE="ext2" TYPE="ext3" 
/dev/xvdf: UUID="feb2c515-6003-478b-beb0-089fed71b33f" SEC_TYPE="ext2" TYPE="ext3" 

The TYPE and SEC_TYPE are interesting as the RAID has XFS, not ext3….

Logs for a mount attempted on this disk – which resulted in the end result listed earlier, as every other mount did – has these log entries:

Oct  2 15:08:51 it kernel: [1686185.573233] md/raid:md0: device xvdc operational as raid disk 0
Oct  2 15:08:51 it kernel: [1686185.580020] md/raid:md0: device xvde operational as raid disk 2
Oct  2 15:08:51 it kernel: [1686185.588307] md/raid:md0: device xvdd operational as raid disk 1
Oct  2 15:08:51 it kernel: [1686185.595745] md/raid:md0: allocated 4312kB
Oct  2 15:08:51 it kernel: [1686185.600729] md/raid:md0: raid level 5 active with 3 out of 4 devices, algorithm 2
Oct  2 15:08:51 it kernel: [1686185.608928] md0: detected capacity change from 0 to 2705221484544
Oct  2 15:08:51 it kernel: [1686185.615772] md: recovery of RAID array md0
Oct  2 15:08:51 it kernel: [1686185.621150] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct  2 15:08:51 it kernel: [1686185.627626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  2 15:08:51 it kernel: [1686185.634024]  md0: unknown partition table
Oct  2 15:08:51 it kernel: [1686185.645882] md: using 128k window, over a total of 880605952k.
Oct  2 15:22:25 it kernel: [1686999.697076] XFS (md0): Mounting Filesystem
Oct  2 15:22:26 it kernel: [1686999.889961] XFS (md0): Ending clean mount
Oct  2 15:24:19 it kernel: [1687112.817845] end_request: I/O error, dev xvde, sector 881423360
Oct  2 15:24:19 it kernel: [1687112.820517] raid5_end_read_request: 1 callbacks suppressed
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423360 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Disk failure on xvde, disabling device.
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Operation continuing on 2 devices.
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423368 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423376 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423384 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423392 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423400 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423408 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423416 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423424 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423432 on xvde).
Oct  2 15:24:19 it kernel: [1687113.432129] md: md0: recovery done.
Oct  2 15:24:19 it kernel: [1687113.685151] Buffer I/O error on device md0, logical block 96
Oct  2 15:24:19 it kernel: [1687113.691386] Buffer I/O error on device md0, logical block 96
Oct  2 15:24:19 it kernel: [1687113.697529] Buffer I/O error on device md0, logical block 64
Oct  2 15:24:20 it kernel: [1687113.703589] Buffer I/O error on device md0, logical block 64
Oct  2 15:25:51 it kernel: [1687205.682022] Buffer I/O error on device md0, logical block 96
Oct  2 15:25:51 it kernel: [1687205.688477] Buffer I/O error on device md0, logical block 96
Oct  2 15:25:51 it kernel: [1687205.694591] Buffer I/O error on device md0, logical block 64
Oct  2 15:25:52 it kernel: [1687205.700728] Buffer I/O error on device md0, logical block 64
Oct  2 15:25:52 it kernel: [1687205.748751] XFS (md0): last sector read failed

I don't see xvdf listed there…

Best Answer

This is a fundamental problem with RAID5—bad blocks on rebuild are a killer.

Oct  2 15:08:51 it kernel: [1686185.573233] md/raid:md0: device xvdc operational as raid disk 0
Oct  2 15:08:51 it kernel: [1686185.580020] md/raid:md0: device xvde operational as raid disk 2
Oct  2 15:08:51 it kernel: [1686185.588307] md/raid:md0: device xvdd operational as raid disk 1
Oct  2 15:08:51 it kernel: [1686185.595745] md/raid:md0: allocated 4312kB
Oct  2 15:08:51 it kernel: [1686185.600729] md/raid:md0: raid level 5 active with 3 out of 4 devices, algorithm 2
Oct  2 15:08:51 it kernel: [1686185.608928] md0: detected capacity change from 0 to 2705221484544
⋮

The array has been assembled, degraded. It has been assembled with xvdc, xvde, and xvdd. Apparently, there is a hot spare:

Oct  2 15:08:51 it kernel: [1686185.615772] md: recovery of RAID array md0
Oct  2 15:08:51 it kernel: [1686185.621150] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct  2 15:08:51 it kernel: [1686185.627626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  2 15:08:51 it kernel: [1686185.634024]  md0: unknown partition table
Oct  2 15:08:51 it kernel: [1686185.645882] md: using 128k window, over a total of 880605952k.

The 'partition table' message is unrelated. The other messages are telling you that md is attempting to do a recovery, probably on to a hot spare (which might be the device that failed out before, if you've attempted to remove/re-add it).

⋮
Oct  2 15:24:19 it kernel: [1687112.817845] end_request: I/O error, dev xvde, sector 881423360
Oct  2 15:24:19 it kernel: [1687112.820517] raid5_end_read_request: 1 callbacks suppressed
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423360 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Disk failure on xvde, disabling device.
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Operation continuing on 2 devices.

And this here is md attempting to read a sector from xvde (one of the remaining three devices). That fails [bad sector, probably], and md (since the array is degraded) can not recover. It thus kicks the disk out of the array, and with a double-disk failure, your RAID5 is dead.

I'm not sure why its being labeled as a spare—that's weird (though, I guess I normally look at /proc/mdstat, so maybe that's just how mdadm labels it). Also, I thought newer kernels were much more hesitant to kick out for bad blocks—but maybe you're running something older?

What can you do about this?

Good backups. That's always an important part of any strategy to keep data alive.

Make sure that the array gets scrubbed for bad blocks routinely. Your OS may already include a cron job for this. You do this by echoing either repair or check to /sys/block/md0/md/sync_action. "Repair" will also repair any discovered parity errors (e.g., the parity bit doesn't match with the data on the disks).

# echo repair > /sys/block/md0/md/sync_action
#

Progress can be watched with cat /proc/mdstat, or the various files in that sysfs directory. (You can find somewhat up-to-date documentation at the Linux Raid Wiki mdstat article.

NOTE: On older kernels—not sure the exact version—check may not fix bad blocks.

One final option is to switch to RAID6. This will require another disk (you can run a four- or even three-disk RAID6, you probably don't want to). With new enough kernels, bad blocks are fixed on the fly when possible. RAID6 can survive two disk failures, so when one disk has failed, it can still survive a bad block—and thus it'll both map out the bad block and continue the rebuild.

Related Question