Mdadm; previously working; after “failure”, cannot join array due to disk size

mdadmraidsoftware-raid

Abstract

I had a functional Raid 5 array, I rebooted the box, and then mdadm couldn't re-assamble one part.

Seeing that it was only one part, I thought it would be easy to just re-sync. But that turned out not to work, because apparently now the device is not large enough to join the array!?

Initial Raid Setup

Sadly rather complicated. I have a Raid 5 combining two 3 TB disks with two linear raids (consisting of 1tb+2tb).
I did not partition the disks, that is, the raid spans physical discs. In hindsight this is probably what caused the initial failure.

After the fateful reboot

mdadm would refuse to assemble one of the linear arrays, claiming that there existed no superblock (checking with mdadm –examine on both didn't return anything). Stranger yet, apparently they still had some partitiontable remains on them.

At this point I thought that the quickest solution would be to just re-create the linear array, add it to the bigger raid5 array and then have it re-sync.
Hence I opted to just remove those partition table entries, that is: partition them to freespace. And then I created a linear array spanning both of the disks.

# mdadm --create /dev/md2 --level=linear --raid-devices=2 /dev/sda /dev/sdc

However, when trying to add them back to the array, I get

# mdadm --add /dev/md0 /dev/md2        
mdadm: /dev/md2 not large enough to join array

So I am I correctly guessing the disks shrank?

Counting blocks

I guess it's time for some block counts!

The two components of the linear array:

RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0   1000204886016   /dev/sda
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0   2000398934016   /dev/sdc

If mdadm's linear mode would have no overhead, the sum of the two sizes would be bigger than one of the 3tb drives (3000592982016). But that is not the case:

/proc/mdstat reports that the linear array has size 2930015024, which is 120016 less than the required

# mdadm --detail /dev/md0 | grep Dev\ Size
Used Dev Size : 2930135040 (2794.39 GiB 3000.46 GB)

But that… is terribly fishy! Before rebooting (an albeit earlier incarnation) of this linear array was part of the bigger array!

What I believe happend

After the reboot, mdadm recognized that a part of the array was missing. Since it was the smallest member, the array device size was automagically grown to fill up the next smallest device.

But that does not sound like uh, sensible behavior, does it?

An alternative would be that for some reason I am no longer creating the maximum size linear raid, but… that's sort of nonsensical as well.

What I have been pondering to do

Shrink the degraded array to exclude the "broken" linear array and then try to –add and –grow again. But I am afraid that does not change the device size actually.

Since I do not understand what exactly went wrong, I would prefer to first understand what caused this problem in the first place before doing anything hasty.

Best Answer

So uh... I guess... well... the disks... shrank?

The area mdadm reserves for metadata by default probably grew... I've had some cases recently where mdadm wasted a whopping 128MiB for no apparent reason. You want to check mdadm --examine /dev/device* for the data offset entry. Ideally it should be no more than 2048 sectors.

If that is indeed the problem, you could use mdadm --create along with the --data-offset= parameter to make mdadm waste less space for metadata.

If that's still not sufficient, you'd have to either try your luck with the old 0.90 metadata (which might be the most space efficient as it uses no such offset), or shrink the other side of the RAID a little (remember to shrink the LV / filesystem first).

Related Solutions

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

fdisk is the wrong tool for disks >2TB. Use parted or gdisk instead.

It appears that /dev/sdc1 and /dev/sdd1 are 2TB partitions, so that's what limits your array size. For the other disks, they have GPT so I assume they are 3TB already, but you should check.

Basically you have to stop the array, enlarge each partition to 3TB (without changing the starting offset), then start it again and follow it up with a grow:

mdadm --grow /dev/md0 --size=max

If you can't stop the array, you'll have to fail each 2TB partition individually, repartition and re-add it. This might go faster if you add a write-intent bitmap first.

mdadm --grow /dev/md0 --bitmap=internal

Then for each disk individually,

mdadm /dev/md0 --fail /dev/disk1 # check mdstat for [UUUU] first
mdadm /dev/md0 --remove /dev/disk1
parted /dev/disk -- mklabel gpt mkpart primary 1mib -1mib
mdadm /dev/md0 --re-add /dev/disk1
mdadm --wait /dev/md0 # must wait for sync

Once that's done you can remove the bitmap again (keeping it may harm performance).

mdadm --grow /dev/md0 --bitmap=none
mdadm --grow /dev/md0 --size=max

Finally do your resize2fs or whatever.

Centos – How to resize / shifting partitions

Since you've partitioned your RAID as if it was a single disk, you can ignore the RAID altogether in this case. So it's merely a problem of resizing / shifting partitions.

So for example, you could shrink the www partition, delete the swap and then shift the root partition to the left in order to grow it.

Or, if that seems to complicated and you don't strictly need separate partitions, you could merge the root partition into your www partition since that's already large enough to hold both root and www. That's kind of what I would do.

# mount stuff
mkdir /mnt/root /mnt/www
mount /dev/md0p5 /mnt/root
mount /dev/md0p2 /mnt/www

# since /mnt/www will be the new root, move www files to /var/www
mkdir -p /mnt/www/var/www
mv /mnt/www/* /mnt/var/www/

# copy the root files
rsync -avAHSX /mnt/root/. /mnt/www/.

# comment out old root partition in fstab
# change /var/www to / in fstab

# update bootloader and reboot

This approach also has the advantage that if anything goes wrong, the original root partition is still intact, so you can revert the operation.

Once everything is working fine with the merged root+www partition, you can delete the old root partition and grow it to the full disk size.

Or you could decide that you want to stick with separate partitions after all and move the www files to the old root partition, if you think that's going to be large enough for your www in the foreseeable future.

Or you could shrink the www partition to make room for a new one.

Endless possibilities...