How to check ‘mdadm’ RAIDs while running

mdadmraid

I'm starting to get a collection of computers at home and to support them I have my "server" linux box running a RAID array.

Its currently mdadm RAID-1, going to RAID-5 once I have more drives (and then RAID-6 I'm hoping for). However I've heard various stories about data getting corrupted on one drive and you never noticing due to the other drive being used, up until the point when the first drive fails, and you find your second drive is also screwed (and 3rd, 4th, 5th drive).

Obviously backups are important and I'm taking care of that also, however I know I've previously seen scripts which claim to help against this problem and allow you to check your RAID while its running. However looking for these scripts again now I'm finding it hard to find anything which seems similar to what I ran before and I feel I'm out of date and not understanding whatever has changed.

How would you check a running RAID to make sure all disks are still preforming normally?

I monitor SMART on all the drives and also have mdadm set to email me in case of failure but I'd like to know my drives occasionally "check" themselves too.

Best Answer

The point of RAID with redundancy is that it will keep going as long as it can, but obviously it will detect errors that put it into a degraded mode, such as a failing disk. You can show the current status of an array with mdadm -D:

# mdadm -D /dev/md0
<snip>
       0       8        5        0      active sync   /dev/sda5
       1       8       23        1      active sync   /dev/sdb7

Furthermore the return status of mdadm -D is nonzero if there is any problem such as a failed component (1 indicates an error that the RAID mode compensates for, and 2 indicates a complete failure).

You can also get a quick summary of all RAID device status by looking at /proc/mdstat. You can get information about a RAID device in /sys/class/block/md*/md/* as well; see Documentation/md.txt in the kernel documentation. Some /sys entries are writable as well; for example you can trigger a full check of md0 with echo check >/sys/class/block/md0/md/sync_action.

In addition to these spot checks, mdadm can notify you as soon as something bad happens. Make sure that you have MAILADDR root in /etc/mdadm.conf (some distributions (e.g. Debian) set this up automatically). Then you will receive an email notification as soon as an error (a degraded array) occurs.

Make sure that you do receive mail send to root on the local machine (some modern distributions omit this, because they consider that all email goes through external providers — but receiving local mail is necessary for any serious system administrator). Test this by sending root a mail: echo hello | mail -s test root@localhost. Usually, a proper email setup requires two things:

Run an MTA on your local machine. The MTA must be set up at least to allow local mail delivery. All distributions come with suitable MTAs, pick anything (but not nullmailer if you want the email to be delivered locally).
Redirect mail going to system accounts (at least root) to an address that you read regularly. This can be your account on the local machine, or an external email address. With most MTAs, the address can be configured in /etc/aliases; you should have a line like
```
root: djsmiley2k
```
for local delivery, or
```
root: djsmiley2k@mail-provider.example.com
```
for remote delivery. If you choose remote delivery, make sure that your MTA is configured for that. Depending on your MTA, you may need to run the newaliases command after editing /etc/aliases.

Related Solutions

Bit rot detection and correction with mdadm

I don't have enough rep to comment, but I want to point out that the mdadm system in Linux DOES NOT correct any errors. If you tell it to "fix" errors during a scrub of, say, RAID6, if there is an inconsistency, it will "fix" it by assuming the data portions are correct and recalculating the parity.

Centos – mdadm: can’t remove components in RAID 1

It's because the device nodes no longer exist on your system (probably udev removed them when the drive died). You should be able to remove them by using the keyword failed or detached instead:

mdadm -r /dev/md0 failed     # all failed devices
mdadm -r /dev/md0 detached   # failed ones that aren't in /dev anymore

If your version of mdadm is too old to do that, you might be able to get it to work by mknod'ing the device to exist again. Or, honestly, just ignore it—it's not really a problem, and should go away the next time you reboot.

Best Answer

Related Solutions

Bit rot detection and correction with mdadm

Centos – mdadm: can’t remove components in RAID 1

Related Question