Mdadm – Accidentally ran “mdadm –create” on an existing raid-1. The superblock is now corrupt and I am unable to recover data. Did I bork the data

data-recoverymdadmraid1superblock

I have /dev/sdb1 and /dev/sdc2 that were previously setup into a RAID-1 with mdadm, but then I reinstalled and lost the old configuration. Out of idiocy, I ran

sudo mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sdb1 /dev/sdc1

in an attempt to reconfigure the RAID. After I let the drives sync (oops?), now none of /dev/md0, /dev/sdb1, or /dev/sdc2 will mount. For /dev/md0, it complains about a bad magic number in the super block. For the /dev/sd{b,c}1, it complains about missing inodes.

In short, the question is this: Did I just bork all my data, or is it possible to recover the array still?

The following is the output of dumpe2fs for those partitions:

brent@codpiece:~$ sudo dumpe2fs /dev/md0 
dumpe2fs 1.42 (29-Nov-2011)
dumpe2fs: Bad magic number in super-block while trying to open /dev/md0
Couldn't find valid filesystem superblock.
brent@codpiece:~$ sudo dumpe2fs /dev/sdb1 
dumpe2fs 1.42 (29-Nov-2011)
Filesystem volume name:   <none>
Last mounted on:          /var/media
Filesystem UUID:          1462d79f-8a10-4590-8d63-3fcc105b601d
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              61054976
Block count:              244189984
Reserved block count:     12209499
Free blocks:              59069396
Free inodes:              60960671
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      965
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Wed Feb 10 21:04:42 2010
Last mount time:          Fri May 10 20:25:34 2013
Last write time:          Sun May 12 14:41:02 2013
Mount count:              189
Maximum mount count:      38
Last checked:             Wed Feb 10 21:04:42 2010
Check interval:           15552000 (6 months)
Next check after:         Mon Aug  9 22:04:42 2010
Lifetime writes:          250 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      7cd5ce46-b823-4453-aa66-00ddaff69952
Journal backup:           inode blocks
dumpe2fs: A block group is missing an inode table while reading journal inode

Edit:

It seems like @hauke-laging was correct that I created a 1.2 metadata version RAID-1 over what used to be a 1.0 metadata raid. I've reran mdadm --create for the right version, but now my filesystem is corrupt. Do I need to mess with the partition table, or can I simply run fsck /dev/md0?

The following is the new output of fsck and dumpe2fs:

brent@codpiece:~$ sudo fsck /dev/md0 
fsck from util-linux 2.20.1
e2fsck 1.42 (29-Nov-2011)
The filesystem size (according to the superblock) is 244189984 blocks
The physical size of the device is 244189952 blocks
Either the superblock or the partition table is likely to be corrupt!

brent@codpiece:~$ sudo dumpe2fs /dev/md0
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          1462d79f-8a10-4590-8d63-3fcc105b601d
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    (none)
Filesystem state:         not clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              61054976
Block count:              244189984
Reserved block count:     12209499
Free blocks:              240306893
Free inodes:              61054965
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      965
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Wed Feb 10 21:04:42 2010
Last mount time:          n/a
Last write time:          Mon May 13 10:38:58 2013
Mount count:              0
Maximum mount count:      38
Last checked:             Wed Feb 10 21:04:42 2010
Check interval:           15552000 (6 months)
Next check after:         Mon Aug  9 22:04:42 2010
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      7cd5ce46-b823-4453-aa66-00ddaff69952
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x00215ad3
Journal start:            0


Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
  Checksum 0x4453, unused inodes 0
  Primary superblock at 0, Group descriptors at 1-59
  Reserved GDT blocks at 60-1024
  Block bitmap at 1025 (+1025), Inode bitmap at 1041 (+1041)
  Inode table at 1057-1568 (+1057)
  23513 free blocks, 8181 free inodes, 2 directories
  Free blocks: 12576-12591, 12864-12879, <...>
  Free inodes: 
Group 1: (Blocks 32768-65535) [ITABLE_ZEROED]
  Checksum 0x348a, unused inodes 0
  Backup superblock at 32768, Group descriptors at 32769-32827
  Reserved GDT blocks at 32828-33792
  Block bitmap at 1026 (bg #0 + 1026), Inode bitmap at 1042 (bg #0 + 1042)
  Inode table at 1569-2080 (bg #0 + 1569)
  31743 free blocks, 8192 free inodes, 0 directories
  Free blocks: 43232-43239, 43264-43271, <...>
  Free inodes: 
Group 2: (Blocks 65536-98303) [ITABLE_ZEROED]
  Checksum 0x2056, unused inodes 0
  Block bitmap at 1027 (bg #0 + 1027), Inode bitmap at 1043 (bg #0 + 1043)
  Inode table at 2081-2592 (bg #0 + 2081)
  32768 free blocks, 8192 free inodes, 0 directories
  Free blocks: 66417-66432, 66445-66456, 66891, <...>
  Free inodes: 23921-24576
Group 3: (Blocks 98304-131071) [ITABLE_ZEROED]
  Checksum 0x4254, unused inodes 0
  Backup superblock at 98304, Group descriptors at 98305-98363
  Reserved GDT blocks at 98364-99328
  Block bitmap at 1028 (bg #0 + 1028), Inode bitmap at 1044 (bg #0 + 1044)
  Inode table at 2593-3104 (bg #0 + 2593)
  31743 free blocks, 8192 free inodes, 0 directories
  Free blocks: 99334-99339, 99438-99443, 99456-99459, <...>
  Free inodes: 24585-32768
Group 4: (Blocks 131072-163839) [ITABLE_ZEROED]
  Checksum 0x6a00, unused inodes 0
  Block bitmap at 1029 (bg #0 + 1029), Inode bitmap at 1045 (bg #0 + 1045)
  Inode table at 3105-3616 (bg #0 + 3105)
  32768 free blocks, 8192 free inodes, 0 directories
  Free blocks: 131074-131075, 131124-131129, <...>
  Free inodes: 32769-40960
Group 5: (Blocks 163840-196607) [ITABLE_ZEROED]
  Checksum 0x37e0, unused inodes 0
  Backup superblock at 163840, Group descriptors at 163841-163899
  Reserved GDT blocks at 163900-164864
  Block bitmap at 1030 (bg #0 + 1030), Inode bitmap at 1046 (bg #0 + 1046)
  Inode table at 3617-4128 (bg #0 + 3617)
  31743 free blocks, 8192 free inodes, 0 directories
  Free blocks: 164968-164970, 164979, <...>
  Free inodes: 40961-49152
Group 6: (Blocks 196608-229375) [ITABLE_ZEROED]
  <...>

Best Answer

Have a look at this question. I assume that is familiar to your problem.

Recreating and even syncing a RAID-1 should not destroy data. Obviously the MD device starts at another offset now. Thus where mount looks for a superblock there is data. This can have happened in at least two ways:

You (or rather: the default setting) have created the new array with a different superblock format (see --metadata in man mdadm). Thus the superblock is in another position (or has a different size) now. Do you happen to know what the old metadata format was?
The offset has changed even with the same format due to a different default offset. See mdadm --examine /dev/sdb1 (add the output to your question).

You should look for a superblock in the first area of the disks (/dev/sdb1). Maybe this can be done with parted or similar tools. You may have to delete the respective partitions for that though (no problem as you can easily backup and restore the partition table).

Or you create loop devices / DM devices with increasing offsets (not necissarily over the whole disk, a few MiB are enough) and try dumpe2fs -h on them. If you want to do this but don't know how then I can provide some shell code for that.

The worst case would be that the new MD superblock has overwritten the file system superblock. In that case you may search for superblock copies (see the output of mke2fs). A mke2fs run on a dummy device of the same size may tell you the positions of the superblock copies.

Edit 1:

Now I have read and understood your dumpe2fs output. Your old RAID-1 had its superblock at the end (0.9 or 1.0). Now you probably have 1.2 so that a part of your file system has been overwritten. I cannot assess how big the damage may be. This is a case for e2fsck. But first you should reset the RAID to its old type. Would help to know the old version.

You can reduce the risk by putting DM devices over the complete /dev/sdb1 and /dev/sdc1, create snapshots for them (with dmsetup directly and create the new array over the snapshots. That way the relevant parts of your disks are not written. From the dumpe2fs output we know that the MD device must be 1000202174464 bytes in size. This should be checked at once after a test creation

Related Solutions

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

fdisk is the wrong tool for disks >2TB. Use parted or gdisk instead.

It appears that /dev/sdc1 and /dev/sdd1 are 2TB partitions, so that's what limits your array size. For the other disks, they have GPT so I assume they are 3TB already, but you should check.

Basically you have to stop the array, enlarge each partition to 3TB (without changing the starting offset), then start it again and follow it up with a grow:

mdadm --grow /dev/md0 --size=max

If you can't stop the array, you'll have to fail each 2TB partition individually, repartition and re-add it. This might go faster if you add a write-intent bitmap first.

mdadm --grow /dev/md0 --bitmap=internal

Then for each disk individually,

mdadm /dev/md0 --fail /dev/disk1 # check mdstat for [UUUU] first
mdadm /dev/md0 --remove /dev/disk1
parted /dev/disk -- mklabel gpt mkpart primary 1mib -1mib
mdadm /dev/md0 --re-add /dev/disk1
mdadm --wait /dev/md0 # must wait for sync

Once that's done you can remove the bitmap again (keeping it may harm performance).

mdadm --grow /dev/md0 --bitmap=none
mdadm --grow /dev/md0 --size=max

Finally do your resize2fs or whatever.

Should I use `mdadm –create` to recover the RAID

There is nothing wrong with --create - if you know what you are doing.

The only problem is: You don't know.

When you create a RAID, the command is usually something short, like:

mdadm --create /dev/md42 --level=5 --raid-devices=3 /dev/sdx1 /dev/sdy1 /dev/sdz1

Dead simple, right?

Except it isn't, really. RAID has a lot more variables. There's a data offset, a chunksize, a metadata version, and let's not forget the drive order which is easy to get wrong on a re-create, as drive letters may change over time.

Here's what a proper --create command might look like instead:

mdadm --create /dev/md42 --assume-clean \
    --level=5 --chunk=512K --metadata=1.2 --data-offset=2048s \
    --raid-devices=3 /dev/sdz1 missing /dev/sdy1

And whatever that gives you, you should test it read-only. And that may not be everything. Did you know that there are several different RAID layouts, too? --create is the very last straw and the pitfalls are not obvious. Ideally, you should backup all disks, or at least the metadata areas, or operate on a copy-on-write overlay.

For anything you do not provide, mdadm uses default settings. Unfortunately those are not set into stone, basically all of them changed in the past, and they are likely to change again in the future.

So when you use --create for recovery, you have to understand RAID really well, and you need to know what your old RAID looked like exactly. And then you have to add --assume-clean or leave one of the disks as missing, just in case you made a mistake anyway. You should also make a backup, at the very least of the beginning and end of the disk so you can recover from metadata written to the wrong location.

In most cases you have other options. --assemble --force is one, but it has its own pitfalls. You should --examine first and if one of the drives is more outdated than the others, you should not include that in the assembly. There is also --build as well as dmsetup for raid which does not use metadata, and might let you access your data. That doesn't mean it is safe, however - you write on it, you lose data if the settings you picked are wrong.

In general, data recovery is a wide field. You need experience in order to be able to decide on the correct course of action. Avoid the issue if possible; make backups, document your setup, and monitor your disks so your RAID does not die in the first place.

Best Answer

Related Solutions

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

Should I use `mdadm –create` to recover the RAID

Related Question