I have a RAID5 array running and now also a raid1 that I set up yesterday. Since RAID5 calculates parity it should be able to catch silent data corruption on one disk. However for RAID1 the disks are just mirrors. The more I think about it I figure that RAID1 is actually quite risky. Sure it will save me from a disk failure but it might not be as good when it comes to protecting the data on disk (who is actually more important for me).
- How does Linux software RAID actually store RAID1 type data on disk?
- How does it know what spindle is giving corrupt data (if the disk(subsystem) is not reporting any errors)
If RAID1 really is not giving me data protection but rather disk protection is there some tricks I can do with mdadm to create a two disk "RAID5 like" setup? E.g. loose capacity but still keep redundancy also for data?
Best Answer
Focusing on the actual questions...
Even RAID 5 will not be able to correct silent bit rot, but it can detect it during a data scrub. Though it will be able to correct a single block that has been reported by the disk as having an Unrecoverable Read Error (URE). Note that not all drives in a RAID5 stripe are read from for a normal data read, so if the error exists in the stripe on the unused disk it will go undetected until you perform a data scrub. Silent bit rot detection with any standard RAID can only occur during data scrubbing. RAID 5 cannot do even this during a rebuild of a failed disk, this is what most concerns these days are with RAID 5.
This is why most bit rot will be reported as an Uncorrectable Read Error (URE) by the disk systems to mdadm. However there are still risks to your data that will not result in any error being reported by the disk such as
and other types of errors such as those described on the ServerFault page Is bit rot on hard drives a real problem? What can be done about it?
RAID 6 and RAID 1 arrays with at least 3 disks are the only standard RAID levels that have the potential to be able to detect and correct some forms of silent bit rot that are not reported by the individual disks as errors, though I do not know if mdadm implements the required code for this. By using a forward error correction style voting system.
FYI I noticed that the Synology DS1813+ devices use mdadm for both data and system partitions and it uses RAID 1 across all 8 disks for the system partitions.
As you may have observed this places a lot of reliance on the disk being able to report bad data as an error. While everyone is saying to use ZFS to solve this issue. I believe ZFS's main data integrity improvements are that it provides more frequent data scrubbing due to it checking mirrors/parity with every read, and independent block level parity (which means many silently corrupted blocks are no longer silent and corrected if possible) and it may implement the above logic for silent data corruption.
To test if a particular system can detect and/or correct silent data corruption use the Linux dd command to write random data to one of the partitions in the array, then test if the data is still good on the array. Warning do not do this test on a system with data you want to keep as your system may fail the test. For standard RAID levels you will need to perform a data scrub between corruption and test read.