Bit Rot Detection – Detection and Correction with mdadm

mdadmraid

I'm about to re-organise all my HDDs in my home linux box nas and would like to use mdadm raid for data protection and its flexibility for reshaping the arrays. However, before I use mdadm for this I'd like to know how it handles bit rot. Specifically the kinds of bit rot that do not result in unrecoverable read error messages being sent from the HDD.

Given that I'll likely be using at least 21TB of HDDs in 8 disks in the nas and the various quotes on probabilities of failures on HDDs, I'm thinking that during a rebuild from a single disk failure I'm reasonably likely to encounter some form of bit rot on the remaining disks. If it is an unrecoverable read error on 1 of the drives, that the drive actually reports it as an error, I believe that should be fine with raid6(is it?). However if the data read from the disk is bad but not reported as such by the disk, then I can't see how this can be automatically corrected even with raid6. Is this something we need to be concerned about? Given the article It is 2010 and RAID5 still works, and my own successful experiences at home and work, things are not necessarily as doom and gloom as the buzz words and marketing would have us believe, but I hate having to restore from backups just because a HDD failed.

Given that the usage patterns will be, write at most a few times, and read occasionally, I'll need to perform data scrubbing. I see on
the archlinux wiki
the mdadm commands for data scrubbing an array as

echo check > /sys/block/md0/md/sync_action

then to monitor the progress

cat /proc/mdstat

This seems to me that it will read all sectors of all disks and check that the data matches the parity and vice-versa. Though I notice there is heavy emphasis in the docs to say that there are significant circumstances that the "check" operation will not be able to auto correct, only detect, and it will leave it up to the user to fix.

What mdadm RAID level(s) should I choose to maximise my protection from bit rot and what maintenance and other protective steps should I be doing? And what will this not protect me from?

Edit: I'm not looking to start a RAID vs ZFS or any other technology QA. I want to know specifically about mdadm raid. That is also why I'm asking on Unix & Linux and not on SuperUser.

Edit: is the answer: mdadm can only correct URE's that are reported by the disk systems during a data scrub and detect silent bit rot during a scrub but cannot/will not fix it?

Best Answer

I don't have enough rep to comment, but I want to point out that the mdadm system in Linux DOES NOT correct any errors. If you tell it to "fix" errors during a scrub of, say, RAID6, if there is an inconsistency, it will "fix" it by assuming the data portions are correct and recalculating the parity.

Related Question