Replacing raid hard drives before failure (3 years old!)

grub2hard-diskmdadmraidsfdisk

I'm thinking that the smart thing to do with my raid setup is to replace the drives before they start failing and as they start to get old… I can't really afford a lot of cloud backup space, and I want to get a jump on the guaranteed eventual fail of my drives due to wear.

I have 3 2TB drives with GPT, grub, a small system raid1 partition, and a large raid5 home partition. I'm using Arch Linux.

I was going to replace the drives one at a time. I wanted to post my plan of action and see if anyone could think of a reason why it wouldn't work or if there was a better way to do it.

step one:

figure out which device (ie /dev/sda) I am replacing by unplugging it physically and checking /proc/mdstat to find out the /dev/sdx that fails.

step two:

Plug it back in and use sfdisk to copy the partition table

sfdisk -d /dev/sdx > partition.layout

step three:

Put in a new physical drive of the same size

step four:

sfdisk /dev/sdx < partition.layout

step five:

Use mdadm to add the new drive to the array based on the instructions on the arch wiki.

mdadm --add /dev/md0 /dev/sdx1
mdadm --add /dev/md1 /dev/sdx2

step six:

Reinstall grub? wait for the resync to complete, then repeat the whole process with the other 2 drives?

I guess my question is mostly like, will this work out? is there anything I'm missing? I don't want to miss something obvious and lose all my data.

Thank you very much for any assistance/insight.

Edit:

Just to get the results of the discussion down in the same place, I wanted to say that I figured out how to have mdadm and smartmontools (smartd) montior and notify me via email if things start going bad with my hard drives. I set up ssmtp with a gmail account that I have synced to my phone.

Since I already bought the new drives, I'm going to keep them around, and swap them in as things go bad. It is my understanding that eventually all hard drives fail. Thanks for the suggestions and protips on how to do that (without degrading the array). Once I can afford an upgrade I'm going to use ZFS with an ECC motherboard/memory/etc. and thanks for the tips in that direction. Thanks a lot you guys really helped 😀

Best Answer

That's a bad idea because you're deliberately degrading your RAID and Resyncs might fail unexpectedly. It's better to hook the new disk up to the system (so you have n+1 disks) and then use mdadm --replace to sync it in. That way the RAID never degrades in between.

You don't have to fail / remove drives to find out which is which. You can see a device's role number in mdadm --examine, in mdstat output [UUU] in role numbers is [012]; and you can check the drive's serial number with hdparm or smartctl and compare to the sticker on the drive itself.

For partitions, it might be better to use GPT nowadays instead of MSDOS. If you are not only replacing disks but also upgrading them in size, you might have no other choice anyhow, since MSDOS partitions pretty much stop at 2TB.

Personally I don't do this at all. So what if the disks are 3 years old? Disks live a lot longer than that, and new disks die all the same.

It's much more important to test your disks on a regular (automated) basis, and replace disks once they have their first pending/uncorrectable/reallocated sector, read error in selftest, or other issues.

Even more important is having backups of any data you don't want to lose.

You could also switch to RAID6 for more redundancy, but the case of two disks dying at the same time is highly unlikely as long as you actively check for errors. Don't let your rebuild be your first read test in years.

Related Question