Replacing raid hard drives before failure (3 years old!)

grub2hard-diskmdadmraidsfdisk

I'm thinking that the smart thing to do with my raid setup is to replace the drives before they start failing and as they start to get old… I can't really afford a lot of cloud backup space, and I want to get a jump on the guaranteed eventual fail of my drives due to wear.

I have 3 2TB drives with GPT, grub, a small system raid1 partition, and a large raid5 home partition. I'm using Arch Linux.

I was going to replace the drives one at a time. I wanted to post my plan of action and see if anyone could think of a reason why it wouldn't work or if there was a better way to do it.

step one:

figure out which device (ie /dev/sda) I am replacing by unplugging it physically and checking /proc/mdstat to find out the /dev/sdx that fails.

step two:

Plug it back in and use sfdisk to copy the partition table

sfdisk -d /dev/sdx > partition.layout

step three:

Put in a new physical drive of the same size

step four:

sfdisk /dev/sdx < partition.layout

step five:

Use mdadm to add the new drive to the array based on the instructions on the arch wiki.

mdadm --add /dev/md0 /dev/sdx1
mdadm --add /dev/md1 /dev/sdx2

step six:

Reinstall grub? wait for the resync to complete, then repeat the whole process with the other 2 drives?

I guess my question is mostly like, will this work out? is there anything I'm missing? I don't want to miss something obvious and lose all my data.

Thank you very much for any assistance/insight.

Edit:

Just to get the results of the discussion down in the same place, I wanted to say that I figured out how to have mdadm and smartmontools (smartd) montior and notify me via email if things start going bad with my hard drives. I set up ssmtp with a gmail account that I have synced to my phone.

Since I already bought the new drives, I'm going to keep them around, and swap them in as things go bad. It is my understanding that eventually all hard drives fail. Thanks for the suggestions and protips on how to do that (without degrading the array). Once I can afford an upgrade I'm going to use ZFS with an ECC motherboard/memory/etc. and thanks for the tips in that direction. Thanks a lot you guys really helped 😀

Best Answer

That's a bad idea because you're deliberately degrading your RAID and Resyncs might fail unexpectedly. It's better to hook the new disk up to the system (so you have n+1 disks) and then use mdadm --replace to sync it in. That way the RAID never degrades in between.

You don't have to fail / remove drives to find out which is which. You can see a device's role number in mdadm --examine, in mdstat output [UUU] in role numbers is [012]; and you can check the drive's serial number with hdparm or smartctl and compare to the sticker on the drive itself.

For partitions, it might be better to use GPT nowadays instead of MSDOS. If you are not only replacing disks but also upgrading them in size, you might have no other choice anyhow, since MSDOS partitions pretty much stop at 2TB.

Personally I don't do this at all. So what if the disks are 3 years old? Disks live a lot longer than that, and new disks die all the same.

It's much more important to test your disks on a regular (automated) basis, and replace disks once they have their first pending/uncorrectable/reallocated sector, read error in selftest, or other issues.

Even more important is having backups of any data you don't want to lose.

You could also switch to RAID6 for more redundancy, but the case of two disks dying at the same time is highly unlikely as long as you actively check for errors. Don't let your rebuild be your first read test in years.

Related Solutions

mdadm – Rebuilding IMSM RAID-0 Array from Disk Images Using mdadm

Looking at the partition table for /dev/loop0 and the disk image sizes reported for /dev/loop0 and /dev/loop1, I'm inclined to suggest that the two disks were simply bolted together and then the partition table was built for the resulting virtual disk:

Disk /dev/loop0: 298.1 GiB, 320072933376 bytes, 625142448 sectors

Device       Boot   Start        End    Sectors   Size Id Type
/dev/loop0p1 *       2048    4196351    4194304     2G  7 HPFS/NTFS/exFAT
/dev/loop0p2      4196352 1250273279 1246076928 594.2G  7 HPFS/NTFS/exFAT

and

Disk /dev/loop1: 298.1 GiB, 320072933376 bytes, 625142448 sectors

If we take the two disks at 298.1 GiB and 298.1 GiB we get 596.2 GiB total. If we then take the sizes of the two partitions 2G + 594.2G we also get 596.2 GiB. (This assumes the "G" indicates GiB.)

You have already warned that you cannot get mdadm to recognise the superblock information, so purely on the basis of the disk partition labels I would attempt to build the array like this:

mdadm --build /dev/md0 --raid-devices=2 --level=0 --chunk=128 /dev/loop0 /dev/loop1
cat /proc/mdstat

I have a chunk size of 128KiB to match the chunk size described by the metadata still present on the disks.

If that works you can then proceed to access the partition in the resulting RAID0.

ld=$(losetup --show --find --offset=$((4196352*512)) /dev/md0)
echo loop device is $ld
mkdir -p /mnt/dsk
mount -t ntfs -o ro $ld /mnt/dsk

We already have a couple of loop devices in use, so I've avoided assuming the name of the next free loop device and instead asked the losetup command to tell me the one it's used; this is put into $ld. The offset of 4196532 sectors (each of 512 bytes) corresponds to the offset into the image of the second partition. We could equally have omitted the offset from the losetup command and added it to the mount options.

Best Answer

Related Solutions

mdadm – Rebuilding IMSM RAID-0 Array from Disk Images Using mdadm

Related Question