Raid5 exchange 1 of 3 harddrives

mdadmraid5

I want to exchange a Harddrive which is in a pre-failing state, (some reallocated sectors)
which is used in a raid5 array with 3 disks (software raid with mdadm).

Is it possible to set a new harddrive as a hot-spare and initialize a takeover from the failing to the spare drive?

Some methods suggest to add the drive and then set a drive failure command.
As I know, in this state the raid5 is degraded and a drive failure would end up in….

so is there a possibility to "copy" the data live (or rebuild the raid) from the failing drive to the spare (without take its parity function away), and as the copy or rebuild process is finished, remove the failing drive.

Best Answer

Yes, you can (provided you have a 3.2+ kernel). First, add a new drive as a spare:

mdadm /dev/md0 --add /dev/sdc1

(replace md0 and sdc1 with your RAID and disk device, respectively).

Then, initiate a copy-replace operation like this:

echo want_replacement > /sys/block/md0/md/dev-sdd1/state

Where md0 is, again, your RAID device, and sdd1 is the failing drive. (Actually, sdd1 is a partition on the failing drive -- I prefer to create RAID sets on partitions rather than on raw disks).

The system will copy all readable blocks from sdd1 to sdc1. If it comes to an unreadable block, it will reconstruct it from parity. Once the operation is complete, the former spare (here: sdc1) will become active, and the failing drive will be marked as failed (F) so you can remove it.

Using mdadm 3.3

Since mdadm 3.3 (released 2013, Sep 3), if you have a 3.2+ kernel, you can proceed as follows:

# mdadm /dev/md0 --add /dev/sdc1
# mdadm /dev/md0 --replace /dev/sdd1 --with /dev/sdc1

sdd1 is the device you want to replace, sdc1 is the preferred device to do so and must be declared as a spare on your array.

The --with option is optional, if not specified, any available spare will be used.

Older mdadm version

Note: You still need a 3.2+ kernel.

First, add a new drive as a spare (replace md0 and sdc1 with your RAID and disk device, respectively):

# mdadm /dev/md0 --add /dev/sdc1

Then, initiate a copy-replace operation like this (sdd1 being the failing device):

# echo want_replacement > /sys/block/md0/md/dev-sdd1/state

Result

Note: credit goes to frostschutz and Ansgar Esztermann who found the original solution (see the duplicate question).

Older kernels

Reassemble mdadm-raid5

OK, it looks like we have now access to the raid. At least the first checked files looked good. So here is what we have done:

The raid recovery article on the kernel.org wiki suggests two possible solutions for our problem:

using --assemble --force (also mentioned by derobert)
The article says:

[...] If the event count differs by less than 50, then the information on the drive is probably still ok. [...] If the event count closely matches but not exactly, use "mdadm --assemble --force /dev/mdX " to force mdadm to assemble the array [...]. If the event count of a drive is way off [...] that drive [...] shouldn't be included in the assembly.

In our case the drive sde had an event difference of 9. So there was a good chance that --force would work. However after we executed the --add command the event count dropped to 0 and the drive was marked as spare.

So we better desisted from using --force.
recreate the array
This solution is explicitly marked as dangerous because you can loose data if you do something wrong. However this seemed to be the only option we had.

The idea is to create a new raid on the existing raid-devices (that is overwriting the device's superblocks) with the same configuration of the old raid and explicitly tell mdadm that the raid has already existed and should be assumed as clean.

Since the event count difference was just 9 and the only problem was that we lost the superblock of sde there were good chances that writing new superblocks will get us access to our data... and it worked :-)

Our solution

Note: This solution was specially geared to our problem and may not work on your setup. You should take these notes to get an idea on how things can be done. But you need to research what's best in your case.

Backup
We already lost a superblock. So this time we saved the first and last gigabyte of each raid device (sd[acdefghij]) using dd before working on the raid. We did this for each raid device:

# save the first gigabyte of sda
dd if=/dev/sda of=bak_sda_start bs=4096 count=262144

# determine the size of the device
fdisk -l /dev/sda
# In this case the size was 4000787030016 byte.

# To get the last gigabyte we need to skip everything except the last gigabyte.
# So we need to skip: 4000787030016 byte - 1073741824 byte = 3999713288000 byte
# Since we read blocks auf 4096 byte we need to skip 3999713288000/4096=976492502 blocks.
dd if=/dev/sda of=bak_sda_end bs=4096 skip=976492502

Gather information
When recreating the raid it is important to use the same configration as the old raid. This is especially important if you want to recreate the array on another machine using a different mdadm version. In this case mdadm's default values may be different and could create superblocks that do not fit to the existing raid (see the wiki article).

In our case we use the same machine (and thus the same mdadm-version) to recreate the array. However the array was created by a 3rd party tool in the first place. So we didn't want to rely on default values here and had to gather some information about the existing raid.

From the output of mdadm --examine /dev/sd[acdefghij] we get the following information about the raid (Note: sdb was the ssd containing the OS and was not part of the raid):

     Raid Level : raid5
   Raid Devices : 9
  Used Dev Size : 7814034432 (3726.02 GiB 4000.79 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
         Layout : left-symmetric
     Chunk Size : 512K
   Device Role : Active device 0

The Used Dev Size is denominated in blocks of 512 byte. You can check this:
7814034432*512/1000000000 ~= 4000.79
But mdadm requires the size in Kibibytes: 7814034432*512/1024 = 3907017216

Important is the Device Role. In the new raid each device must have the same role as before. In our case:

device  role
------  ----
sda     0
sdc     1
sdd     2
sde     3
sdf     4
sdg     5
sdh     6
sdi     spare
sdj     8

Note: Drive letters (and thus the order) can change after reboot!

We also need the layout and the chunk size in the next step.

Recreate raid
We now can use the information of the last step to recreate the array:

mdadm --create --assume-clean --level=5 --raid-devices=9 --size=3907017216 \
    --chunk=512 --layout=left-symmetric /dev/md127 /dev/sda /dev/sdc /dev/sdd \
    /dev/sde /dev/sdf /dev/sdg /dev/sdh missing /dev/sdj

It is important to pass the devices in the correct order!
Moreover we did not add sdi as it's event count was too low. So we set the 7th raid slot to missing. Thus the raid5 contains 8 of 9 devices and will be assembled in degraded mode. And because it lacks a spare device no rebuild will automatically start.

Then we used --examine to check if the new superblocks fit to our old superblocks. And it did :-) We were able to mount the filesystem and read the data. The next step is to backup the data and then add back sdi and start the rebuild.