Md raid1 ext3 and 4k sectors slow with directory operations

ext3mdmdadmraid1software-raid

I recently moved from a hardware RAID1 enclosure to using two eSATA drives with md. Everything seems to be working fine, except for the fact that directory traversals/listings sometimes crawl (on the order of 10s of seconds). I am using an ext3 filesystem, with the block size set to 4K.

Here is some relevant output from commands that should be important:

mdadm –detail:

/dev/md127:
        Version : 1.2
  Creation Time : Sat Nov 16 09:46:52 2013
     Raid Level : raid1
     Array Size : 976630336 (931.39 GiB 1000.07 GB)
  Used Dev Size : 976630336 (931.39 GiB 1000.07 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Tue Nov 19 01:07:59 2013
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

         Events : 19691

    Number   Major   Minor   RaidDevice State
       2       8       17        0      active sync   /dev/sdb1
       1       8        1        1      active sync   /dev/sda1

fdisk -l /dev/sd{a,b}:

Disk /dev/sda: 1000.2 GB, 1000204886016 bytes, 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0xb410a639

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1            2048  1953525167   976761560   83  Linux

Disk /dev/sdb: 1000.2 GB, 1000204886016 bytes, 1953525168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x261c8b44

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1            2048  1953525167   976761560   83  Linux

time dumpe2fs /dev/md127 |grep size:

dumpe2fs 1.42.7 (21-Jan-2013)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Block size:               4096
Fragment size:            4096
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal size:             128M

real    2m14.242s
user    0m2.286s
sys     0m0.352s

The way I understand it, I've got 4K sectors on these drives (recent WD reds), but the partitions/filesystems appear to be properly aligned. Since it looks like I'm using md metadata version 1.2, I think I'm also good (based on mdadm raid1 and what chunksize (or blocksize) on 4k drives?). The one thing I haven't found an answer for online is whether or not having an inode size of 256 would cause problems. Not all operations are slow, it seems that the buffer cache does a great job of keeping things zippy (as it should).

My kernel version is 3.11.2

EDIT: new info, 2013-11-19

mdadm --examine /dev/sd{a,b}1 | grep -i offset
    Data Offset : 262144 sectors
   Super Offset : 8 sectors
    Data Offset : 262144 sectors
   Super Offset : 8 sectors

Also, I am mounting the filesystem with noatime,nodiratime I'm not really willing to mess with journaling much since if I care enough to have RAID1, it might be self-defeating. I am tempted to turn on directory indexing

EDIT 2013-11-20

Yesterday I tried turning on directory indexing for ext3 and ran e2fsck -D -f to see if that would help. Unfortunately, it hasn't. I am starting to suspect it may be a hardware issue (or is md raid1 over eSATA just really dumb to do?). I'm thinking of taking each of the drives offline and seeing how they perform alone.

EDIT 2013-11-21

iostat -kx 10 |grep -P "(sda|sdb|Device)":

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.37     1.17    0.06    0.11     1.80     5.10    84.44     0.03  165.91   64.66  221.40 100.61   1.64
sdb              13.72     1.17    2.46    0.11   110.89     5.10    90.34     0.08   32.02    6.46  628.90   9.94   2.55
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

I truncated the output past this since it was all 0.00%

I really feel like it should be irrespective of ext4 vs. ext3 because this isn't just feeling a little slower, it can take on the order of tens of seconds to a minute and some change to tab auto-complete or run an ls

EDIT: Likely a hardware issue, will close question when confirmed

The more I think of it, the more I wonder if it's my eSATA card. I'm currently using this one: http://www.amazon.com/StarTech-PEXESAT32-Express-eSATA-Controller/dp/B003GSGMPU
However, I just checked dmesg and it's littered with these messages:

[363802.847117] ata1.00: status: { DRDY }
[363802.847121] ata1: hard resetting link
[363804.979044] ata2: softreset failed (SRST command error)
[363804.979047] ata2: reset failed (errno=-5), retrying in 8 secs
[363804.979059] ata1: softreset failed (SRST command error)
[363804.979064] ata1: reset failed (errno=-5), retrying in 8 secs
[363812.847047] ata1: hard resetting link
[363812.847061] ata2: hard resetting link
[363814.979063] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 10)
[363814.979106] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 10)
....
[364598.751086] ata2.00: status: { DRDY }
[364598.751091] ata2: hard resetting link
[364600.883031] ata2: softreset failed (SRST command error)
[364600.883038] ata2: reset failed (errno=-5), retrying in 8 secs
[364608.751043] ata2: hard resetting link
[364610.883050] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 10)
[364610.884328] ata2.00: configured for UDMA/100
[364610.884336] ata2.00: device reported invalid CHS sector 0
[364610.884342] ata2: EH complete

I am also going to buy shorter shielded eSATA cables as I'm wondering if there is some interference going on.

Best Answer

THIS ENDED UP BEING A HARDWARE ISSUE

Switching to the new shielded cables did not help, but replacing the old card with this one: http://www.amazon.com/gp/product/B000NTM9SY did get rid of the error messages and the strange behavior. Will post something new if anything changes.

IMPORTANT NOTE FOR SATA ENCLOSURES:

Even after doing the above, any drive operation would be incredibly slow (just halt for 10-30 seconds) whenever the drive was idle for a while. The enclosure I'm using has an eSATA interface, but is powered by USB. I determined this was because it didn't have enough power to spin up, so I tried a a couple of things:

Using an external higher-current USB power source (in case the ports were only doing the 500mA minimum)
Disabling spin-down with hdparm -S 0 /dev/sdX (this alleviated the problem greatly, but did not resolve it completely)
Disabled advanced power management via hdparm -B 255 /dev/sdX (again, did not fully resolve)

Eventually, I discovered that Western Digital has a jumper setting for Reduced Power Spinup - designed especially for this use case!

The drives I am using are: WD Red WD10JFCX 1TB IntelliPower 2.5" http://support.wdc.com/images/kb/scrp_connect.jpg

Note that I am still operating without all the power management and spin down features (Still -B 255 and -S 0 on hdparm).

Final Verdict

Unfortunately, the RPS did not solve all of my problems, just reduced the magnitude and frequency. I believe the issues were ultimately due to the fact that the enclosure could not provide enough power (even when I use an AC-USB adapter). I eventually bought this enclosure:

http://www.amazon.com/MiniPro-eSATA-6Gbps-External-Enclosure/dp/B003XEZ33Y

and everything has been working flawlessly for the last three weeks.

Related Solutions

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

fdisk is the wrong tool for disks >2TB. Use parted or gdisk instead.

It appears that /dev/sdc1 and /dev/sdd1 are 2TB partitions, so that's what limits your array size. For the other disks, they have GPT so I assume they are 3TB already, but you should check.

Basically you have to stop the array, enlarge each partition to 3TB (without changing the starting offset), then start it again and follow it up with a grow:

mdadm --grow /dev/md0 --size=max

If you can't stop the array, you'll have to fail each 2TB partition individually, repartition and re-add it. This might go faster if you add a write-intent bitmap first.

mdadm --grow /dev/md0 --bitmap=internal

Then for each disk individually,

mdadm /dev/md0 --fail /dev/disk1 # check mdstat for [UUUU] first
mdadm /dev/md0 --remove /dev/disk1
parted /dev/disk -- mklabel gpt mkpart primary 1mib -1mib
mdadm /dev/md0 --re-add /dev/disk1
mdadm --wait /dev/md0 # must wait for sync

Once that's done you can remove the bitmap again (keeping it may harm performance).

mdadm --grow /dev/md0 --bitmap=none
mdadm --grow /dev/md0 --size=max

Finally do your resize2fs or whatever.

Centos – How to resize / shifting partitions

Since you've partitioned your RAID as if it was a single disk, you can ignore the RAID altogether in this case. So it's merely a problem of resizing / shifting partitions.

So for example, you could shrink the www partition, delete the swap and then shift the root partition to the left in order to grow it.

Or, if that seems to complicated and you don't strictly need separate partitions, you could merge the root partition into your www partition since that's already large enough to hold both root and www. That's kind of what I would do.

# mount stuff
mkdir /mnt/root /mnt/www
mount /dev/md0p5 /mnt/root
mount /dev/md0p2 /mnt/www

# since /mnt/www will be the new root, move www files to /var/www
mkdir -p /mnt/www/var/www
mv /mnt/www/* /mnt/var/www/

# copy the root files
rsync -avAHSX /mnt/root/. /mnt/www/.

# comment out old root partition in fstab
# change /var/www to / in fstab

# update bootloader and reboot

This approach also has the advantage that if anything goes wrong, the original root partition is still intact, so you can revert the operation.

Once everything is working fine with the merged root+www partition, you can delete the old root partition and grow it to the full disk size.

Or you could decide that you want to stick with separate partitions after all and move the www files to the old root partition, if you think that's going to be large enough for your www in the foreseeable future.

Or you could shrink the www partition to make room for a new one.

Endless possibilities...

Best Answer

Related Solutions

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

Centos – How to resize / shifting partitions

Related Question