Using mdadm 3.3
Since mdadm
3.3 (released 2013, Sep 3), if you have a 3.2+ kernel, you can proceed as follows:
# mdadm /dev/md0 --add /dev/sdc1
# mdadm /dev/md0 --replace /dev/sdd1 --with /dev/sdc1
sdd1
is the device you want to replace, sdc1
is the preferred device to do so and must be declared as a spare on your array.
The --with
option is optional, if not specified, any available spare will be used.
Older mdadm version
Note: You still need a 3.2+ kernel.
First, add a new drive as a spare (replace md0
and sdc1
with your RAID and disk device, respectively):
# mdadm /dev/md0 --add /dev/sdc1
Then, initiate a copy-replace operation like this (sdd1
being the failing device):
# echo want_replacement > /sys/block/md0/md/dev-sdd1/state
Result
The system will copy all readable blocks from sdd1
to sdc1
. If it comes to an unreadable block, it will reconstruct it from parity. Once the operation is complete, the former spare (here: sdc1
) will become active, and the failing drive will be marked as failed (F) so you can remove it.
Note: credit goes to frostschutz and Ansgar Esztermann who found the original solution (see the duplicate question).
Older kernels
Other answers suggest:
- Johnny's approach: convert array to RAID6, "replace" the disk, then back to RAID5,
- Hauke Laging's approach: briefly remove the disk from the RAID5 array, make it part of a RAID1 (mirror) with the new disk and add that mirror drive back to the RAID5 array (theoretical)...
Best Answer
Right now (as of late 2015) it depends on which level you would like to have self-healing capabilities.
I found a similar discussion here about the same issue where one of the "linux guys"1 replied that:
Hence, from a kernel perspective, it seems that there is no intention to support this - unlike Minix, for instance. Having said this, I have not found the specific policy he's talking about or any direct statement by Linus about this.
From a user-space perspective there seems to be at least attempts to deal with this issue on file system level. As summery of another post and the corresponding comments, it is believed that whereas other OS deal with data corruption much better
btrfs
seems to be on a good way to implement this feature for Linux-based OSs as well. However, although claimed to be stable, it is by no means yet as powerful as SUN's (BSD-based) ZFS as can be read here2.1 i.e. Chris Snook - former Red Hat associate
2 very exhaustive blog about benchmarking btrfs which comes to a rather negative conclusion (as of 2015/09/16)