RAID1 – Best Chunksize for 4K Drives with mdadm

mdadmraid1software-raid

I want to use two 3 TB drives in a mdadm raid1 setup (using Debian Sequeeze).

The drives use 4k hardware sectors instead of the traditional 512 byte ones.

I am a bit confused because on the one hand the kernel reports:

$ cat /sys/block/sdb/queue/hw_sector_size
512

But on the other hand fdisk reports:

# fdisk -l /dev/sdb
Disk /dev/sdb: 3000.6 GB, 3000592982016 bytes
255 heads, 63 sectors/track, 364801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

Thus, it seems that the kernel has some idea that the drive uses 4k sectors.

The mdadm man page is a bit cryptic about the chunk size and raid1:

   -c, --chunk=
          Specify chunk size of kibibytes.  The default when  creating  an
          array  is 512KB.  To ensure compatibility with earlier versions,
          the default when Building and array with no persistent  metadata
          is  64KB.   This  is  only  meaningful  for RAID0, RAID4, RAID5,
          RAID6, and RAID10.

Why is it not meaningful for raid1?

Looking at /proc/mdstat, the raid1 device md8 has 2930265424 blocks, i.e.

3000591794176/2930265424/2 = 512

Does mdadm use then a blocksize of 512 bytes? (/2 because it a two-way mirror)

And is chunk-size a different concept than blocksize?

Trying to let mdadm explain a device:

# mdadm -E /dev/sdb -v -v
Avail Dev Size : 5860531120 (2794.52 GiB 3000.59 GB)
Array Size : 5860530848 (2794.52 GiB 3000.59 GB)

Where

3000591794176/5860530848 = 512

With a default mkfs.xfs on the md device, it reports:

sectsz=512
bsize=4096

I corrected this with a call of mkfs.xfs -s size=4096 /dev/md8

Edit: Testing a bit around I noticed following things:

It seems that the initial resync is done with a block size of 128k (and not 512 bytes):

md: resync of RAID array md8
md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
md: using 128k window, over a total of 2930265424 blocks.

The displayed speed via /proc/mdstat is consistent for that blocksize (for 512bytes one would expect a performance hit):

[>....................]  resync =  3.0% (90510912/2930265424) finish=381.1min speed=124166K/sec

(For example when disabling the write cache the displayed speed immediately drops to 18m/sec)

Under /sys there are actually some more relevant files besides hw_sector_size:

# cat /sys/block/sdb/queue/physical_block_size
4096
# cat  /sys/block/sdb/queue/logical_block_size
512

That means that the drive does not lie to the kernel about its 4k sector size and the kernel has some 4k sector support (as the output of fstab -l suggested).

Googling a bit around resulted in a few reports about WD disks, which do not report the 4k size – fortunately this 3 TB WD disk does not do that – perhaps WD fixed their firmware with current disks.

Best Answer

Chunk size does not apply to raid1 because there is no striping; essentially the entire disk is one chunk. In short, you do not need to worry about the 4k physical sector size. Recent versions of mdadm use the information from the kernel to make sure that the start of data is aligned to a 4kb boundary. Just make sure you are using a 1.x metadata format.

Related Solutions

Md raid1 ext3 and 4k sectors slow with directory operations

THIS ENDED UP BEING A HARDWARE ISSUE

Switching to the new shielded cables did not help, but replacing the old card with this one: http://www.amazon.com/gp/product/B000NTM9SY did get rid of the error messages and the strange behavior. Will post something new if anything changes.

IMPORTANT NOTE FOR SATA ENCLOSURES:

Even after doing the above, any drive operation would be incredibly slow (just halt for 10-30 seconds) whenever the drive was idle for a while. The enclosure I'm using has an eSATA interface, but is powered by USB. I determined this was because it didn't have enough power to spin up, so I tried a a couple of things:

Using an external higher-current USB power source (in case the ports were only doing the 500mA minimum)
Disabling spin-down with hdparm -S 0 /dev/sdX (this alleviated the problem greatly, but did not resolve it completely)
Disabled advanced power management via hdparm -B 255 /dev/sdX (again, did not fully resolve)

Eventually, I discovered that Western Digital has a jumper setting for Reduced Power Spinup - designed especially for this use case!

The drives I am using are: WD Red WD10JFCX 1TB IntelliPower 2.5" http://support.wdc.com/images/kb/scrp_connect.jpg

Note that I am still operating without all the power management and spin down features (Still -B 255 and -S 0 on hdparm).

Final Verdict

Unfortunately, the RPS did not solve all of my problems, just reduced the magnitude and frequency. I believe the issues were ultimately due to the fact that the enclosure could not provide enough power (even when I use an AC-USB adapter). I eventually bought this enclosure:

http://www.amazon.com/MiniPro-eSATA-6Gbps-External-Enclosure/dp/B003XEZ33Y

and everything has been working flawlessly for the last three weeks.

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

fdisk is the wrong tool for disks >2TB. Use parted or gdisk instead.

It appears that /dev/sdc1 and /dev/sdd1 are 2TB partitions, so that's what limits your array size. For the other disks, they have GPT so I assume they are 3TB already, but you should check.

Basically you have to stop the array, enlarge each partition to 3TB (without changing the starting offset), then start it again and follow it up with a grow:

mdadm --grow /dev/md0 --size=max

If you can't stop the array, you'll have to fail each 2TB partition individually, repartition and re-add it. This might go faster if you add a write-intent bitmap first.

mdadm --grow /dev/md0 --bitmap=internal

Then for each disk individually,

mdadm /dev/md0 --fail /dev/disk1 # check mdstat for [UUUU] first
mdadm /dev/md0 --remove /dev/disk1
parted /dev/disk -- mklabel gpt mkpart primary 1mib -1mib
mdadm /dev/md0 --re-add /dev/disk1
mdadm --wait /dev/md0 # must wait for sync

Once that's done you can remove the bitmap again (keeping it may harm performance).

mdadm --grow /dev/md0 --bitmap=none
mdadm --grow /dev/md0 --size=max

Finally do your resize2fs or whatever.

Best Answer

Related Solutions

Md raid1 ext3 and 4k sectors slow with directory operations

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

Related Question