This is a fundamental problem with RAID5—bad blocks on rebuild are a killer.
Oct 2 15:08:51 it kernel: [1686185.573233] md/raid:md0: device xvdc operational as raid disk 0
Oct 2 15:08:51 it kernel: [1686185.580020] md/raid:md0: device xvde operational as raid disk 2
Oct 2 15:08:51 it kernel: [1686185.588307] md/raid:md0: device xvdd operational as raid disk 1
Oct 2 15:08:51 it kernel: [1686185.595745] md/raid:md0: allocated 4312kB
Oct 2 15:08:51 it kernel: [1686185.600729] md/raid:md0: raid level 5 active with 3 out of 4 devices, algorithm 2
Oct 2 15:08:51 it kernel: [1686185.608928] md0: detected capacity change from 0 to 2705221484544
⋮
The array has been assembled, degraded. It has been assembled with xvdc, xvde, and xvdd. Apparently, there is a hot spare:
Oct 2 15:08:51 it kernel: [1686185.615772] md: recovery of RAID array md0
Oct 2 15:08:51 it kernel: [1686185.621150] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Oct 2 15:08:51 it kernel: [1686185.627626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct 2 15:08:51 it kernel: [1686185.634024] md0: unknown partition table
Oct 2 15:08:51 it kernel: [1686185.645882] md: using 128k window, over a total of 880605952k.
The 'partition table' message is unrelated. The other messages are telling you that md is attempting to do a recovery, probably on to a hot spare (which might be the device that failed out before, if you've attempted to remove/re-add it).
⋮
Oct 2 15:24:19 it kernel: [1687112.817845] end_request: I/O error, dev xvde, sector 881423360
Oct 2 15:24:19 it kernel: [1687112.820517] raid5_end_read_request: 1 callbacks suppressed
Oct 2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423360 on xvde).
Oct 2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Disk failure on xvde, disabling device.
Oct 2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Operation continuing on 2 devices.
And this here is md attempting to read a sector from xvde (one of the remaining three devices). That fails [bad sector, probably], and md (since the array is degraded) can not recover. It thus kicks the disk out of the array, and with a double-disk failure, your RAID5 is dead.
I'm not sure why its being labeled as a spare—that's weird (though, I guess I normally look at /proc/mdstat
, so maybe that's just how mdadm labels it). Also, I thought newer kernels were much more hesitant to kick out for bad blocks—but maybe you're running something older?
What can you do about this?
Good backups. That's always an important part of any strategy to keep data alive.
Make sure that the array gets scrubbed for bad blocks routinely. Your OS may already include a cron job for this. You do this by echoing either repair
or check
to /sys/block/md0/md/sync_action
. "Repair" will also repair any discovered parity errors (e.g., the parity bit doesn't match with the data on the disks).
# echo repair > /sys/block/md0/md/sync_action
#
Progress can be watched with cat /proc/mdstat
, or the various files in that sysfs directory. (You can find somewhat up-to-date documentation at the Linux Raid Wiki mdstat article.
NOTE: On older kernels—not sure the exact version—check may not fix bad blocks.
One final option is to switch to RAID6. This will require another disk (you can run a four- or even three-disk RAID6, you probably don't want to). With new enough kernels, bad blocks are fixed on the fly when possible. RAID6 can survive two disk failures, so when one disk has failed, it can still survive a bad block—and thus it'll both map out the bad block and continue the rebuild.
So uh... I guess... well... the disks... shrank?
The area mdadm
reserves for metadata by default probably grew... I've had some cases recently where mdadm
wasted a whopping 128MiB for no apparent reason. You want to check mdadm --examine /dev/device*
for the data offset
entry. Ideally it should be no more than 2048 sectors.
If that is indeed the problem, you could use mdadm --create
along with the --data-offset=
parameter to make mdadm
waste less space for metadata.
If that's still not sufficient, you'd have to either try your luck with the old 0.90
metadata (which might be the most space efficient as it uses no such offset), or shrink the other side of the RAID a little (remember to shrink the LV / filesystem first).
Best Answer
If the bitmap has not changed when the old disk was replaced by the new one, it should work to mark the disk as failed and remove it from the array.
Then replace the disk and add the old one to the array:
I think that shutting down the machine and replacing the disks would also work, but the
mdadm
method has the advantage that the disks can be hot-plugged if supported by the machine.