Reassemble mdadm-raid5

mdadmraidraid5

A friend of mine has a mdadm-raid5 with 9 disks which does not reassemble anymore.

After having a look at the syslog I found that the disk sdi was kicked from the array:

Jul  6 08:43:25 nasty kernel: [   12.952194] md: bind<sdc>
Jul  6 08:43:25 nasty kernel: [   12.952577] md: bind<sdd>
Jul  6 08:43:25 nasty kernel: [   12.952683] md: bind<sde>
Jul  6 08:43:25 nasty kernel: [   12.952784] md: bind<sdf>
Jul  6 08:43:25 nasty kernel: [   12.952885] md: bind<sdg>
Jul  6 08:43:25 nasty kernel: [   12.952981] md: bind<sdh>
Jul  6 08:43:25 nasty kernel: [   12.953078] md: bind<sdi>
Jul  6 08:43:25 nasty kernel: [   12.953169] md: bind<sdj>
Jul  6 08:43:25 nasty kernel: [   12.953288] md: bind<sda>
Jul  6 08:43:25 nasty kernel: [   12.953308] md: kicking non-fresh sdi from array!
Jul  6 08:43:25 nasty kernel: [   12.953314] md: unbind<sdi>
Jul  6 08:43:25 nasty kernel: [   12.960603] md: export_rdev(sdi)
Jul  6 08:43:25 nasty kernel: [   12.969675] raid5: device sda operational as raid disk 0
Jul  6 08:43:25 nasty kernel: [   12.969679] raid5: device sdj operational as raid disk 8
Jul  6 08:43:25 nasty kernel: [   12.969682] raid5: device sdh operational as raid disk 6
Jul  6 08:43:25 nasty kernel: [   12.969684] raid5: device sdg operational as raid disk 5
Jul  6 08:43:25 nasty kernel: [   12.969687] raid5: device sdf operational as raid disk 4
Jul  6 08:43:25 nasty kernel: [   12.969689] raid5: device sde operational as raid disk 3
Jul  6 08:43:25 nasty kernel: [   12.969692] raid5: device sdd operational as raid disk 2
Jul  6 08:43:25 nasty kernel: [   12.969694] raid5: device sdc operational as raid disk 1
Jul  6 08:43:25 nasty kernel: [   12.970536] raid5: allocated 9542kB for md127
Jul  6 08:43:25 nasty kernel: [   12.973975] 0: w=1 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul  6 08:43:25 nasty kernel: [   12.973980] 8: w=2 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul  6 08:43:25 nasty kernel: [   12.973983] 6: w=3 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul  6 08:43:25 nasty kernel: [   12.973986] 5: w=4 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul  6 08:43:25 nasty kernel: [   12.973989] 4: w=5 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul  6 08:43:25 nasty kernel: [   12.973992] 3: w=6 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul  6 08:43:25 nasty kernel: [   12.973996] 2: w=7 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul  6 08:43:25 nasty kernel: [   12.973999] 1: w=8 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul  6 08:43:25 nasty kernel: [   12.974002] raid5: raid level 5 set md127 active with 8 out of 9 devices, algorithm 2

Unfortunately this wasn't recognized and now another drive was kicked (sde):

Jul 14 08:02:45 nasty kernel: [   12.918556] md: bind<sdc>
Jul 14 08:02:45 nasty kernel: [   12.919043] md: bind<sdd>
Jul 14 08:02:45 nasty kernel: [   12.919158] md: bind<sde>
Jul 14 08:02:45 nasty kernel: [   12.919260] md: bind<sdf>
Jul 14 08:02:45 nasty kernel: [   12.919361] md: bind<sdg>
Jul 14 08:02:45 nasty kernel: [   12.919461] md: bind<sdh>
Jul 14 08:02:45 nasty kernel: [   12.919556] md: bind<sdi>
Jul 14 08:02:45 nasty kernel: [   12.919641] md: bind<sdj>
Jul 14 08:02:45 nasty kernel: [   12.919756] md: bind<sda>
Jul 14 08:02:45 nasty kernel: [   12.919775] md: kicking non-fresh sdi from array!
Jul 14 08:02:45 nasty kernel: [   12.919781] md: unbind<sdi>
Jul 14 08:02:45 nasty kernel: [   12.928177] md: export_rdev(sdi)
Jul 14 08:02:45 nasty kernel: [   12.928187] md: kicking non-fresh sde from array!
Jul 14 08:02:45 nasty kernel: [   12.928198] md: unbind<sde>
Jul 14 08:02:45 nasty kernel: [   12.936064] md: export_rdev(sde)
Jul 14 08:02:45 nasty kernel: [   12.943900] raid5: device sda operational as raid disk 0
Jul 14 08:02:45 nasty kernel: [   12.943904] raid5: device sdj operational as raid disk 8
Jul 14 08:02:45 nasty kernel: [   12.943907] raid5: device sdh operational as raid disk 6
Jul 14 08:02:45 nasty kernel: [   12.943909] raid5: device sdg operational as raid disk 5
Jul 14 08:02:45 nasty kernel: [   12.943911] raid5: device sdf operational as raid disk 4
Jul 14 08:02:45 nasty kernel: [   12.943914] raid5: device sdd operational as raid disk 2
Jul 14 08:02:45 nasty kernel: [   12.943916] raid5: device sdc operational as raid disk 1
Jul 14 08:02:45 nasty kernel: [   12.944776] raid5: allocated 9542kB for md127
Jul 14 08:02:45 nasty kernel: [   12.944861] 0: w=1 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul 14 08:02:45 nasty kernel: [   12.944864] 8: w=2 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul 14 08:02:45 nasty kernel: [   12.944867] 6: w=3 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul 14 08:02:45 nasty kernel: [   12.944871] 5: w=4 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul 14 08:02:45 nasty kernel: [   12.944874] 4: w=5 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul 14 08:02:45 nasty kernel: [   12.944877] 2: w=6 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul 14 08:02:45 nasty kernel: [   12.944879] 1: w=7 pa=0 pr=9 m=1 a=2 r=9 op1=0 op2=0
Jul 14 08:02:45 nasty kernel: [   12.944882] raid5: not enough operational devices for md127 (2/9 failed)

And now the array does not start anymore.
However it seems that every disk contains the raid metadata:

/dev/sda:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 8600bda9:18845be8:02187ecc:1bfad83a

    Update Time : Mon Jul 14 00:45:35 2014
       Checksum : e38d46e8 - correct
         Events : 123132

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 0
   Array State : AAA.AAA.A ('A' == active, '.' == missing)


/dev/sdc:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : fe612c05:f7a45b0a:e28feafe:891b2bda

    Update Time : Mon Jul 14 00:45:35 2014
       Checksum : 32bb628e - correct
         Events : 123132

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 1
   Array State : AAA.AAA.A ('A' == active, '.' == missing)


/dev/sdd:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 1d14616c:d30cadc7:6d042bb3:0d7f6631

    Update Time : Mon Jul 14 00:45:35 2014
       Checksum : 62bd5499 - correct
         Events : 123132

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 2
   Array State : AAA.AAA.A ('A' == active, '.' == missing)


/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : a2babca3:1283654a:ef8075b5:aaf5d209

    Update Time : Mon Jul 14 00:45:07 2014
       Checksum : f78d6456 - correct
         Events : 123123

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 3
   Array State : AAAAAAA.A ('A' == active, '.' == missing)


/dev/sdf:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : e67d566d:92aaafb4:24f5f16e:5ceb0db7

    Update Time : Mon Jul 14 00:45:35 2014
       Checksum : 9223b929 - correct
         Events : 123132

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 4
   Array State : AAA.AAA.A ('A' == active, '.' == missing)


/dev/sdg:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 2cee1d71:16c27acc:43e80d02:1da74eeb

    Update Time : Mon Jul 14 00:45:35 2014
       Checksum : 7512efd4 - correct
         Events : 123132

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 5
   Array State : AAA.AAA.A ('A' == active, '.' == missing)


/dev/sdh:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : c239f0ad:336cdb88:62c5ff46:c36ea5f8

    Update Time : Mon Jul 14 00:45:35 2014
       Checksum : c08e8a4d - correct
         Events : 123132

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 6
   Array State : AAA.AAA.A ('A' == active, '.' == missing)


/dev/sdi:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : active
    Device UUID : d06c58f8:370a0535:b7e51073:f121f58c

    Update Time : Mon Jul 14 00:45:07 2014
       Checksum : 77844dcc - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : AAAAAAA.A ('A' == active, '.' == missing)


/dev/sdj:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : f2de262f:49d17fea:b9a475c1:b0cad0b7

    Update Time : Mon Jul 14 00:45:35 2014
       Checksum : dd0acfd9 - correct
         Events : 123132

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : Active device 8
   Array State : AAA.AAA.A ('A' == active, '.' == missing)

But as you can see the two drives (sde, sdi) are in active state (but raid is stopped) and sdi is a spare.
While sde has a slightly lower Events-count than most of the other drives (123123 instead of 123132) sdi has an Events-count of 0. So I think sde is almost up-to-date. But sdi not …

Now we read online that a hard power-off could cause these "kicking non-fresh"-messages. And indeed my friend caused a hard power-off one or two times. So we followed the instructions we found online and tried to re-add sde to the array:

$ mdadm /dev/md127 --add /dev/sde
mdadm: add new device failed for /dev/sde as 9: Invalid argument

But that failed and now mdadm --examine /dev/sde shows an Events-count of 0 for sde too (+ it's a spare now like sdi):

/dev/sde:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : b8a04dbb:0b5dffda:601eb40d:d2dc37c9
           Name : nasty:stuff  (local to host nasty)
  Creation Time : Sun Mar 16 02:37:47 2014
     Raid Level : raid5
   Raid Devices : 9

 Avail Dev Size : 7814035120 (3726.02 GiB 4000.79 GB)
     Array Size : 62512275456 (29808.18 GiB 32006.29 GB)
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 689e0030:142122ae:7ab37935:c80ab400

    Update Time : Mon Jul 14 00:45:35 2014
       Checksum : 5e6c4cf7 - correct
         Events : 0

         Layout : left-symmetric
     Chunk Size : 512K

   Device Role : spare
   Array State : AAA.AAA.A ('A' == active, '.' == missing)

We know that 2 failed drives usually means the death for a raid5. However is there a way to add at least sde to the raid so that data can be saved?

Best Answer

OK, it looks like we have now access to the raid. At least the first checked files looked good. So here is what we have done:


The raid recovery article on the kernel.org wiki suggests two possible solutions for our problem:

  1. using --assemble --force (also mentioned by derobert)
    The article says:

    [...] If the event count differs by less than 50, then the information on the drive is probably still ok. [...] If the event count closely matches but not exactly, use "mdadm --assemble --force /dev/mdX " to force mdadm to assemble the array [...]. If the event count of a drive is way off [...] that drive [...] shouldn't be included in the assembly.

    In our case the drive sde had an event difference of 9. So there was a good chance that --force would work. However after we executed the --add command the event count dropped to 0 and the drive was marked as spare.

    So we better desisted from using --force.

  2. recreate the array
    This solution is explicitly marked as dangerous because you can loose data if you do something wrong. However this seemed to be the only option we had.

    The idea is to create a new raid on the existing raid-devices (that is overwriting the device's superblocks) with the same configuration of the old raid and explicitly tell mdadm that the raid has already existed and should be assumed as clean.

    Since the event count difference was just 9 and the only problem was that we lost the superblock of sde there were good chances that writing new superblocks will get us access to our data... and it worked :-)


Our solution

Note: This solution was specially geared to our problem and may not work on your setup. You should take these notes to get an idea on how things can be done. But you need to research what's best in your case.

Backup
We already lost a superblock. So this time we saved the first and last gigabyte of each raid device (sd[acdefghij]) using dd before working on the raid. We did this for each raid device:

# save the first gigabyte of sda
dd if=/dev/sda of=bak_sda_start bs=4096 count=262144

# determine the size of the device
fdisk -l /dev/sda
# In this case the size was 4000787030016 byte.

# To get the last gigabyte we need to skip everything except the last gigabyte.
# So we need to skip: 4000787030016 byte - 1073741824 byte = 3999713288000 byte
# Since we read blocks auf 4096 byte we need to skip 3999713288000/4096=976492502 blocks.
dd if=/dev/sda of=bak_sda_end bs=4096 skip=976492502

Gather information
When recreating the raid it is important to use the same configration as the old raid. This is especially important if you want to recreate the array on another machine using a different mdadm version. In this case mdadm's default values may be different and could create superblocks that do not fit to the existing raid (see the wiki article).

In our case we use the same machine (and thus the same mdadm-version) to recreate the array. However the array was created by a 3rd party tool in the first place. So we didn't want to rely on default values here and had to gather some information about the existing raid.

From the output of mdadm --examine /dev/sd[acdefghij] we get the following information about the raid (Note: sdb was the ssd containing the OS and was not part of the raid):

     Raid Level : raid5
   Raid Devices : 9
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
         Layout : left-symmetric
     Chunk Size : 512K
   Device Role : Active device 0

The Used Dev Size is denominated in blocks of 512 byte. You can check this:
7814034432*512/1000000000 ~= 4000.79
But mdadm requires the size in Kibibytes: 7814034432*512/1024 = 3907017216

Important is the Device Role. In the new raid each device must have the same role as before. In our case:

device  role
------  ----
sda     0
sdc     1
sdd     2
sde     3
sdf     4
sdg     5
sdh     6
sdi     spare
sdj     8

Note: Drive letters (and thus the order) can change after reboot!

We also need the layout and the chunk size in the next step.

Recreate raid
We now can use the information of the last step to recreate the array:

mdadm --create --assume-clean --level=5 --raid-devices=9 --size=3907017216 \
    --chunk=512 --layout=left-symmetric /dev/md127 /dev/sda /dev/sdc /dev/sdd \
    /dev/sde /dev/sdf /dev/sdg /dev/sdh missing /dev/sdj

It is important to pass the devices in the correct order!
Moreover we did not add sdi as it's event count was too low. So we set the 7th raid slot to missing. Thus the raid5 contains 8 of 9 devices and will be assembled in degraded mode. And because it lacks a spare device no rebuild will automatically start.

Then we used --examine to check if the new superblocks fit to our old superblocks. And it did :-) We were able to mount the filesystem and read the data. The next step is to backup the data and then add back sdi and start the rebuild.

Related Question