Linux – Does Linux provide Predictive Self-Healing on x86

linuxx86

Predictive Self Healing is a feature of the OS to predict, detect a fault with one of its components and automatically repair it. MINIX, Solaris OS and Linux on POWER all have this. But is it available in modern Linux distributions on x86 platform? Or will be?

Best Answer

Right now (as of late 2015) it depends on which level you would like to have self-healing capabilities.

I found a similar discussion here about the same issue where one of the "linux guys"¹ replied that:

Doing this in-kernel would violate the separation of policy and mechanism. There's nothing wrong with providing the hooks necessary for userspace to do this, but it's generally userspace's responsibility to decide what should be done when a possible problem is detected. [in short..] A distro/vendor problem, not a kernel development problem.

Hence, from a kernel perspective, it seems that there is no intention to support this - unlike Minix, for instance. Having said this, I have not found the specific policy he's talking about or any direct statement by Linus about this.

From a user-space perspective there seems to be at least attempts to deal with this issue on file system level. As summery of another post and the corresponding comments, it is believed that whereas other OS deal with data corruption much better btrfs seems to be on a good way to implement this feature for Linux-based OSs as well. However, although claimed to be stable, it is by no means yet as powerful as SUN's (BSD-based) ZFS as can be read here².

¹ i.e. Chris Snook - former Red Hat associate

² very exhaustive blog about benchmarking btrfs which comes to a rather negative conclusion (as of 2015/09/16)

Using mdadm 3.3

Since mdadm 3.3 (released 2013, Sep 3), if you have a 3.2+ kernel, you can proceed as follows:

# mdadm /dev/md0 --add /dev/sdc1
# mdadm /dev/md0 --replace /dev/sdd1 --with /dev/sdc1

sdd1 is the device you want to replace, sdc1 is the preferred device to do so and must be declared as a spare on your array.

The --with option is optional, if not specified, any available spare will be used.

Older mdadm version

Note: You still need a 3.2+ kernel.

First, add a new drive as a spare (replace md0 and sdc1 with your RAID and disk device, respectively):

# mdadm /dev/md0 --add /dev/sdc1

Then, initiate a copy-replace operation like this (sdd1 being the failing device):

# echo want_replacement > /sys/block/md0/md/dev-sdd1/state

Result

The system will copy all readable blocks from sdd1 to sdc1. If it comes to an unreadable block, it will reconstruct it from parity. Once the operation is complete, the former spare (here: sdc1) will become active, and the failing drive will be marked as failed (F) so you can remove it.

Note: credit goes to frostschutz and Ansgar Esztermann who found the original solution (see the duplicate question).

Best Answer

Related Solutions

Linux RAID5 – How to Safely Replace a Not-Yet-Failed Disk

Using mdadm 3.3

Older mdadm version

Result

Older kernels

Related Question