First off, drive re-lettering just happens sometimes, depending on how your machine is set up. Drive letters aren't expected to be stable over reboots since, ummm, a while. So it isn't a huge concern that your drive moved on you.
Assuming dmraid and device-mapper aren't using your devices:
Well, mdadm --stop /dev/md0
might take care of your busy messages, I think that's why its complaining. Then you can try your assemble line again. If it doesn't work, --stop again followed by assemble with --run
(without run, --assemble --scan won't start a degraded array). Then you can remove and re-add your failed disk to let it attempt a rebuild.
/dev/sde is outdated (look at the events counter). The others look OK at first glance, so I think you actually have a pretty good chance of no difficulties.
You shouldn't zero any superblocks yet. Way too high risk of data loss. If --run doesn't work, I think you're going to want to find someone locally (or who can ssh in) who knows what he/she is doing to attempt to fix.
In response to Update 1
That "not enough to start the array" is never a good message to get from mdadm. What it means is that mdadm has found 10 drives out of your 12-drive RAID5 array, and as I hope you're aware RAID5 can only survive one failure, not two.
Well, let's try and piece together what happened. First, over reboot, there was a drive letter change, which is annoying for us trying to figure it out, but mdraid doesn't care about that. Reading through your mdadm output, here is the remap that happened (sorted by the raid disk #):
00 sdh1 -> sdb1
02 sdk1 -> sde1 [OUTDATED]
03 sdg1 -> sda1
04 sdf1 -> sdm1
05 sdd1 -> sdk1
06 sdm1 -> sdg1
07 sdc1 -> sdj1
08 sdi1 -> sdc1
09 sde1 -> sdl1
10 sdj1 -> sdd1
11 sdl1 -> sdf1
13 sdb1 -> sdi1 [SPARE]
#02 has a lower 'events' counter than the others. That means it left the array at some point.
It'd be nice if you know some of the history of this array—e.g., is "12-drive RAID5, 1 hot spare" correct?
I'm not quite sure what the sequence of failures that lead up to this is, though. It appears that at some point, device #1 failed, and a rebuild onto device #12 started.
But I can't make out exactly what happened next. Maybe you have logs—or an administrator to ask. Here is what I can't explain:
Somehow, #12 became #13. Somehow, #2 became #12.
So, that rebuild onto #12 should have finished and then #12 would be #1. Maybe it didn't—maybe it failed to rebuild for some reason. Then maybe #2 failed—or maybe #2 failed, is why the rebuild didn't finish, and someone tried removing and re-adding #2? That might make it #12. Then maybe removed and re-added the spare, making it #13.
Ok, but of course, at this point, you'd had a two-disk failure. Ok. That makes sense.
If this is what has happened, you've suffered a two-disk failure. That means you've lost data. What you do next, depends on how important that data is (considering also how good your backups are).
If the data is very valuable (and you don't have good backups), contact data recovery specialists. Otherwise:
If the data is valuable enough, you should use dd
to image all the disks involved (you can use larger disks, and files on each to save money. 2 or 3 TB externals, for example). Then make a copy of the images. Then work on recovering on that copy (you can use loop devices to do this).
Obtain more spares. Probably, you have one dead disk. You have at least a few questionable disks—smartctl
may be able to tell you more.
Next --force
to your --assemble
line. This will make mdadm use the outdated disk anyway. This means some sectors will now have outdated data, some won't. Add in one of those new disks as a spare, let the rebuild finish. Hopefully you don't hit any bad blocks (which would cause the rebuild to fail, and I believe the only answer is to make the disk map them out) Next, fsck -f
the disk. There will probably be errors. Once they're fixed, mount the disk, and see what shape your data is in.
Recommendations
In the future, do not build 12-disk RAID5s. The probability of two-disk failure is too high. Use RAID6 or RAID10 instead. Also, make sure to routinely scrub your arrays for bad blocks (echo check > /sys/block/md0/md0/sync_action
).
Best Answer
The partition table is really just a piece of data that says things like
If you delete the partition table then the data in tracks 10->99, 100->599, 600->16383 are untouched, just the OS no longer knows how to find it. So if you then recreate the paritition table exactly the same way then your data is still available. I made use of this in 2016 when I destroyed all my partition tables by mistake ( https://www.sweharris.org/post/2016-02-10-break-mbr/ )
If you want to delete the data inside the partitions as well then you either need to zero the whole disk, or else format the partitions. Most installers have an option to say "format partition" when you do your setup.
Or make sure your partitions start in different places (eg 9->100, 101->600, 601->16384) so the data inside doesn't look like a filesystem.