Linux – RAID 5 with 4 disks fails to operate with one failed disk

linuxmdadmraid

I found a question about mdadm spare disks which almost answers my question, but it isn't clear to me what is happening.

We have a RAID5 set up with 4 disks – and all are labeled in normal operation as active/sync:

    Update Time : Sun Sep 29 03:44:01 2013
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

…

Number   Major   Minor   RaidDevice State
   0     202       32        0      active sync   /dev/sdc
   1     202       48        1      active sync   /dev/sdd
   2     202       64        2      active sync   /dev/sde 
   4     202       80        3      active sync   /dev/sdf

But then when one of the disks failed, the RAID stopped working:

    Update Time : Sun Sep 29 01:00:01 2013
          State : clean, FAILED 
 Active Devices : 2
Working Devices : 3
 Failed Devices : 1
  Spare Devices : 1

…

Number   Major   Minor   RaidDevice State
   0     202       32        0      active sync   /dev/sdc
   1     202       48        1      active sync   /dev/sdd
   2       0        0        2      removed
   3       0        0        3      removed

   2     202       64        -      faulty spare   /dev/sde
   4     202       80        -      spare   /dev/sdf

What is really going on here??

The fix was to reinstall the RAID – luckily I can do that. Next time it'll probably have some serious data on it. I need to understand this so I can have a RAID that won't fail because of a single disk failure.

I realized I didn't list what I expected vs. what happened.

I expect that a RAID5 with 3 good disks and 1 bad will operate in a degraded mode – 3 active/sync and 1 faulty.

What happened was a spare was created out of thin air and declared faulty – then a new spare was also created out of thin air and declared sound – after which the RAID was declared inoperative.

This is the output from blkid:

$ blkid
/dev/xvda1: LABEL="/" UUID="4797c72d-85bd-421a-9c01-52243aa28f6c" TYPE="ext4" 
/dev/xvdc: UUID="feb2c515-6003-478b-beb0-089fed71b33f" TYPE="ext3" 
/dev/xvdd: UUID="feb2c515-6003-478b-beb0-089fed71b33f" SEC_TYPE="ext2" TYPE="ext3" 
/dev/xvde: UUID="feb2c515-6003-478b-beb0-089fed71b33f" SEC_TYPE="ext2" TYPE="ext3" 
/dev/xvdf: UUID="feb2c515-6003-478b-beb0-089fed71b33f" SEC_TYPE="ext2" TYPE="ext3"

The TYPE and SEC_TYPE are interesting as the RAID has XFS, not ext3….

Logs for a mount attempted on this disk – which resulted in the end result listed earlier, as every other mount did – has these log entries:

Oct  2 15:08:51 it kernel: [1686185.573233] md/raid:md0: device xvdc operational as raid disk 0
Oct  2 15:08:51 it kernel: [1686185.580020] md/raid:md0: device xvde operational as raid disk 2
Oct  2 15:08:51 it kernel: [1686185.588307] md/raid:md0: device xvdd operational as raid disk 1
Oct  2 15:08:51 it kernel: [1686185.595745] md/raid:md0: allocated 4312kB
Oct  2 15:08:51 it kernel: [1686185.600729] md/raid:md0: raid level 5 active with 3 out of 4 devices, algorithm 2
Oct  2 15:08:51 it kernel: [1686185.608928] md0: detected capacity change from 0 to 2705221484544
Oct  2 15:08:51 it kernel: [1686185.615772] md: recovery of RAID array md0
Oct  2 15:08:51 it kernel: [1686185.621150] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct  2 15:08:51 it kernel: [1686185.627626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  2 15:08:51 it kernel: [1686185.634024]  md0: unknown partition table
Oct  2 15:08:51 it kernel: [1686185.645882] md: using 128k window, over a total of 880605952k.
Oct  2 15:22:25 it kernel: [1686999.697076] XFS (md0): Mounting Filesystem
Oct  2 15:22:26 it kernel: [1686999.889961] XFS (md0): Ending clean mount
Oct  2 15:24:19 it kernel: [1687112.817845] end_request: I/O error, dev xvde, sector 881423360
Oct  2 15:24:19 it kernel: [1687112.820517] raid5_end_read_request: 1 callbacks suppressed
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423360 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Disk failure on xvde, disabling device.
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Operation continuing on 2 devices.
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423368 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423376 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423384 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423392 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423400 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423408 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423416 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423424 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423432 on xvde).
Oct  2 15:24:19 it kernel: [1687113.432129] md: md0: recovery done.
Oct  2 15:24:19 it kernel: [1687113.685151] Buffer I/O error on device md0, logical block 96
Oct  2 15:24:19 it kernel: [1687113.691386] Buffer I/O error on device md0, logical block 96
Oct  2 15:24:19 it kernel: [1687113.697529] Buffer I/O error on device md0, logical block 64
Oct  2 15:24:20 it kernel: [1687113.703589] Buffer I/O error on device md0, logical block 64
Oct  2 15:25:51 it kernel: [1687205.682022] Buffer I/O error on device md0, logical block 96
Oct  2 15:25:51 it kernel: [1687205.688477] Buffer I/O error on device md0, logical block 96
Oct  2 15:25:51 it kernel: [1687205.694591] Buffer I/O error on device md0, logical block 64
Oct  2 15:25:52 it kernel: [1687205.700728] Buffer I/O error on device md0, logical block 64
Oct  2 15:25:52 it kernel: [1687205.748751] XFS (md0): last sector read failed

I don't see xvdf listed there…

Best Answer

This is a fundamental problem with RAID5—bad blocks on rebuild are a killer.

Oct  2 15:08:51 it kernel: [1686185.573233] md/raid:md0: device xvdc operational as raid disk 0
Oct  2 15:08:51 it kernel: [1686185.580020] md/raid:md0: device xvde operational as raid disk 2
Oct  2 15:08:51 it kernel: [1686185.588307] md/raid:md0: device xvdd operational as raid disk 1
Oct  2 15:08:51 it kernel: [1686185.595745] md/raid:md0: allocated 4312kB
Oct  2 15:08:51 it kernel: [1686185.600729] md/raid:md0: raid level 5 active with 3 out of 4 devices, algorithm 2
Oct  2 15:08:51 it kernel: [1686185.608928] md0: detected capacity change from 0 to 2705221484544
⋮

The array has been assembled, degraded. It has been assembled with xvdc, xvde, and xvdd. Apparently, there is a hot spare:

Oct  2 15:08:51 it kernel: [1686185.615772] md: recovery of RAID array md0
Oct  2 15:08:51 it kernel: [1686185.621150] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct  2 15:08:51 it kernel: [1686185.627626] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Oct  2 15:08:51 it kernel: [1686185.634024]  md0: unknown partition table
Oct  2 15:08:51 it kernel: [1686185.645882] md: using 128k window, over a total of 880605952k.

The 'partition table' message is unrelated. The other messages are telling you that md is attempting to do a recovery, probably on to a hot spare (which might be the device that failed out before, if you've attempted to remove/re-add it).

⋮
Oct  2 15:24:19 it kernel: [1687112.817845] end_request: I/O error, dev xvde, sector 881423360
Oct  2 15:24:19 it kernel: [1687112.820517] raid5_end_read_request: 1 callbacks suppressed
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: read error not correctable (sector 881423360 on xvde).
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Disk failure on xvde, disabling device.
Oct  2 15:24:19 it kernel: [1687112.821837] md/raid:md0: Operation continuing on 2 devices.

And this here is md attempting to read a sector from xvde (one of the remaining three devices). That fails [bad sector, probably], and md (since the array is degraded) can not recover. It thus kicks the disk out of the array, and with a double-disk failure, your RAID5 is dead.

I'm not sure why its being labeled as a spare—that's weird (though, I guess I normally look at /proc/mdstat, so maybe that's just how mdadm labels it). Also, I thought newer kernels were much more hesitant to kick out for bad blocks—but maybe you're running something older?

What can you do about this?

Good backups. That's always an important part of any strategy to keep data alive.

Make sure that the array gets scrubbed for bad blocks routinely. Your OS may already include a cron job for this. You do this by echoing either repair or check to /sys/block/md0/md/sync_action. "Repair" will also repair any discovered parity errors (e.g., the parity bit doesn't match with the data on the disks).

# echo repair > /sys/block/md0/md/sync_action
#

Progress can be watched with cat /proc/mdstat, or the various files in that sysfs directory. (You can find somewhat up-to-date documentation at the Linux Raid Wiki mdstat article.

NOTE: On older kernels—not sure the exact version—check may not fix bad blocks.

One final option is to switch to RAID6. This will require another disk (you can run a four- or even three-disk RAID6, you probably don't want to). With new enough kernels, bad blocks are fixed on the fly when possible. RAID6 can survive two disk failures, so when one disk has failed, it can still survive a bad block—and thus it'll both map out the bad block and continue the rebuild.

What to do now

It seems like you've finally created images of the drives. You ought to have done this first, at least before trying anything beyond a basic --assemble. But anyway,

If the image of the bad drive missed most/all sectors, determine if professional data recovery is worthwhile. Files (and filesystem metadata) are split across drives in RAID0, so you really need both to recover. Professional recovery will probably be able to read the drive.
If the image is mostly OK, except for a few sectors, continue.

Make a copy of the image files. Only work on the copies of the image files. I can not emphasize this enough, you will likely be destroying these copies several times, you need to be able to start over. And you don't want to have to image the disks again, especially since one is failing!

To answer one of your other questions:

Q: Why I cannot use the partition images to create an array?

A: To assemble (or create) an array of image files, you need to use a loopback device. You attach an image to a loopback device using losetup. Read the manpage, but it'll be something along the lines of losetup --show -f /path/to/COPY-of-image. Now, you use mdadm on the loop devices (e.g., /dev/loop0).

Determine the original array layout

You need to find out all the mdadm options that were originally used to create the array (since you destroyed that metadata with --create earlier). You then get to run --create on the two loopback devices, with those options, exactly. You need to figure out the metadata version (-e), the RAID level (-l, appears to be 0), the chunk size (-c), number of devices (-n, should be 2) and the exact order of the devices.

The easiest way to get this is going to be to get two new disks, put then in the NAS, and have the NAS create a new array on them. Preferably with the same NAS firmware version as originally used. IOW, repeat the initial set up. Then pull the disks out, and use mdadm -E on one of the members. Here is an example from a RAID10 array, so slightly different. I've omitted a bunch of lines to highlight the ones you need:

        Version : 1.0                 # -e
     Raid Level : raid10              # -l
   Raid Devices : 4                   # -n

     Chunk Size : 512K                # -c

   Device Role : Active device 0                         # gets you the device order
   Array State : AAAA ('A' == active, '.' == missing)

NOTE: I'm going to assume you're using ext2/3/4 here; if not, use the appropriate utilities for the filesystem the NAS actually used.

Attempt a create (on the loopback devices) with those options. See if e2fsck -n even recognizes it. If not, stop the array, and create it again with the devices in the other order. Try e2fsck -n again.

If neither work, you should go back to the order you think is right, and try a backup superblock. The e2fsck manpage tells you what number to use; you almost certainly have a 4K blocksize. If none of the backup superblocks work, stop the array, and try the other disk order. If that doesn't work, you probably have the wrong --create options; start over with new copy of the images & try some different options—I'd try different metadata versions first.

Once you get e2fsck to run, see how badly damaged the filesystem is. If its completely trashed, that may mean you have the wrong chunk size (stop and re-create the array to try some more).

Copy the data off.

I suggest letting e2fsck try to fix the filesystem. This does risk destroying the filesystem, but, well, that's why you're working on copies! Then you can mount it, and copy the data off. Keep in mind that some of the data is likely corrupted, and that corruption may be hidden (e.g., a page of a document could have been replaced with NULLs).

I can't get the original parameters from the NAS

Then you're in trouble. Your other option is to take guesses until one finally works, or to learn enough about the on-disk formats to figure it out using a hex editor. There may be a utility or two out there to help with this; I don't know.

Alternatively, hire a data recovery firm.

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

fdisk is the wrong tool for disks >2TB. Use parted or gdisk instead.

It appears that /dev/sdc1 and /dev/sdd1 are 2TB partitions, so that's what limits your array size. For the other disks, they have GPT so I assume they are 3TB already, but you should check.

Basically you have to stop the array, enlarge each partition to 3TB (without changing the starting offset), then start it again and follow it up with a grow:

mdadm --grow /dev/md0 --size=max

If you can't stop the array, you'll have to fail each 2TB partition individually, repartition and re-add it. This might go faster if you add a write-intent bitmap first.

mdadm --grow /dev/md0 --bitmap=internal

Then for each disk individually,

mdadm /dev/md0 --fail /dev/disk1 # check mdstat for [UUUU] first
mdadm /dev/md0 --remove /dev/disk1
parted /dev/disk -- mklabel gpt mkpart primary 1mib -1mib
mdadm /dev/md0 --re-add /dev/disk1
mdadm --wait /dev/md0 # must wait for sync

Once that's done you can remove the bitmap again (keeping it may harm performance).

mdadm --grow /dev/md0 --bitmap=none
mdadm --grow /dev/md0 --size=max

Finally do your resize2fs or whatever.

Best Answer

What can you do about this?

Related Solutions

Recovery data from RAID and disk failure (Linux)

What to do now

I can't get the original parameters from the NAS

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

Related Question