First check the disks, try running smart selftest
for i in a b c d; do
smartctl -s on -t long /dev/sd$i
done
It might take a few hours to finish, but check each drive's test status every few minutes, i.e.
smartctl -l selftest /dev/sda
If the status of a disk reports not completed because of read errors, then this disk should be consider unsafe for md1 reassembly. After the selftest finish, you can start trying to reassembly your array. Optionally, if you want to be extra cautious, move the disks to another machine before continuing (just in case of bad ram/controller/etc).
Recently, I had a case exactly like this one. One drive got failed, I re-added in the array but during rebuild 3 of 4 drives failed altogether. The contents of /proc/mdadm was the same as yours (maybe not in the same order)
md1 : inactive sdc2[2](S) sdd2[4](S) sdb2[1](S) sda2[0](S)
But I was lucky and reassembled the array with this
mdadm --assemble /dev/md1 --scan --force
By looking at the --examine output you provided, I can tell the following scenario happened: sdd2 failed, you removed it and re-added it, So it became a spare drive trying to rebuild. But while rebuilding sda2 failed and then sdb2 failed. So the events counter is bigger in sdc2 and sdd2 which are the last active drives in the array (although sdd didn't have the chance to rebuild and so it is the most outdated of all). Because of the differences in the event counters, --force will be necessary. So you could also try this
mdadm --assemble /dev/md1 /dev/sd[abc]2 --force
To conclude, I think that if the above command fails, you should try to recreate the array like this:
mdadm --create /dev/md1 --assume-clean -l5 -n4 -c64 /dev/sd[abc]2 missing
If you do the --create
, the missing
part is important, don't try to add a fourth drive in the array, because then construction will begin and you will lose your data. Creating the array with a missing drive, will not change its contents and you'll have the chance to get a copy elsewhere (raid5 doesn't work the same way as raid1).
If that fails to bring the array up, try this solution (perl script) here Recreating an array
If you finally manage to bring the array up, the filesystem will be unclean and probably corrupted. If one disk fails during rebuild, it is expected that the array will stop and freeze not doing any writes to the other disks. In this case two disks failed, maybe the system was performing write requests that wasn't able to complete, so there is some small chance you lost some data, but also a chance that you will never notice it :-)
edit: some clarification added.
If you just lost one disk, you should have been able to recover from that using the very much safer --assemble
.
You've run create now so much that all the UUIDs are different. sdc1
and sdd1
share a UUID (expected, as that's your working array)... the rest the disks share a name, but all have different UUIDs. So I'm guessing none of those are the original superblocks. Too bad...
Anyway, I'd guess you're either attempting to use the wrong disks, or you're trying to use the wrong chunk size (the default has changed over time, I believe). Your old array may have also used a different superblock version—that default has definitely changed—which could offset all the sectors (and also destroy some of the data). Finally, its possible you're using the wrong layout, though that's less likely.
It's also possible that, your test array was read-write (from a md standpoint) that attempts to use ext3 actually did some writes. E.g., a journal replay. But that's only if it found a superblock at some point, I think.
BTW: I think you really ought to be using --assume-clean
, though of course a degraded array will not try to start rebuilding. Then you probably want to immediately set read-only.
Best Answer
Using mdadm 3.3
Since
mdadm
3.3 (released 2013, Sep 3), if you have a 3.2+ kernel, you can proceed as follows:sdd1
is the device you want to replace,sdc1
is the preferred device to do so and must be declared as a spare on your array.The
--with
option is optional, if not specified, any available spare will be used.Older mdadm version
Note: You still need a 3.2+ kernel.
First, add a new drive as a spare (replace
md0
andsdc1
with your RAID and disk device, respectively):Then, initiate a copy-replace operation like this (
sdd1
being the failing device):Result
The system will copy all readable blocks from
sdd1
tosdc1
. If it comes to an unreadable block, it will reconstruct it from parity. Once the operation is complete, the former spare (here:sdc1
) will become active, and the failing drive will be marked as failed (F) so you can remove it.Note: credit goes to frostschutz and Ansgar Esztermann who found the original solution (see the duplicate question).
Older kernels
Other answers suggest: