Linux – Kernel failure with software RAID-1 due to single drive read errors

fedoralinuxraid1software-raid

I am running Fedora 19 (kernel 3.11.3-201.fc19.x86_64) with a software RAID-1 (mdadm) setup on two identical Seagate 1GB drives:

# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdb3[1] sda3[0]
      973827010 blocks super 1.2 [2/2] [UU]

unused devices: <none>

Recently, some errors are presenting on one of the two drives: smartd has detected "1 Currently unreadable (pending) sectors" and "1 Offline uncorrectable sectors". The RAID array "rescheduled" some sectors. Then about a day later, the kernel produced various I/O messages/exceptions:

Oct 18 06:39:20 x smartd[461]: Device: /dev/sdb [SAT], 1 Currently unreadable (pending) sectors
Oct 18 06:39:20 x smartd[461]: Device: /dev/sdb [SAT], 1 Offline uncorrectable sectors
...
Oct 18 07:09:20 x smartd[461]: Device: /dev/sdb [SAT], 1 Currently unreadable (pending) sectors
Oct 18 07:09:20 x smartd[461]: Device: /dev/sdb [SAT], 1 Offline uncorrectable sectors
...
Oct 18 07:30:28 x kernel: [467502.192792] md/raid1:md1: sdb3: rescheduling sector 1849689328
Oct 18 07:30:28 x kernel: [467502.192822] md/raid1:md1: sdb3: rescheduling sector 1849689336
Oct 18 07:30:28 x kernel: [467502.192846] md/raid1:md1: sdb3: rescheduling sector 1849689344
Oct 18 07:30:28 x kernel: [467502.192870] md/raid1:md1: sdb3: rescheduling sector 1849689352
Oct 18 07:30:28 x kernel: [467502.192895] md/raid1:md1: sdb3: rescheduling sector 1849689360
Oct 18 07:30:28 x kernel: [467502.192919] md/raid1:md1: sdb3: rescheduling sector 1849689368
Oct 18 07:30:28 x kernel: [467502.192943] md/raid1:md1: sdb3: rescheduling sector 1849689376
Oct 18 07:30:28 x kernel: [467502.192966] md/raid1:md1: sdb3: rescheduling sector 1849689384
Oct 18 07:30:28 x kernel: [467502.192991] md/raid1:md1: sdb3: rescheduling sector 1849689392
Oct 18 07:30:28 x kernel: [467502.193035] md/raid1:md1: sdb3: rescheduling sector 1849689400
...
Oct 19 06:26:08 x kernel: [550035.944191] ata3.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 19 06:26:08 x kernel: [550035.944224] ata3.01: BMDMA stat 0x64
Oct 19 06:26:08 x kernel: [550035.944248] ata3.01: failed command: READ DMA EXT
Oct 19 06:26:08 x kernel: [550035.944274] ata3.01: cmd 25/00:08:15:fb:9c/00:00:6c:00:00/f0 tag 0 dma 4096 in
Oct 19 06:26:08 x kernel: [550035.944274]          res 51/40:00:1c:fb:9c/40:00:6c:00:00/10 Emask 0x9 (media error)
Oct 19 06:26:08 x kernel: [550035.944322] ata3.01: status: { DRDY ERR }
Oct 19 06:26:08 x kernel: [550035.944340] ata3.01: error: { UNC }
Oct 19 06:26:08 x kernel: [550036.573438] ata3.00: configured for UDMA/133
Oct 19 06:26:08 x kernel: [550036.621444] ata3.01: configured for UDMA/133
Oct 19 06:26:08 x kernel: [550036.621507] sd 2:0:1:0: [sdb] Unhandled sense code
Oct 19 06:26:08 x kernel: [550036.621516] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550036.621523] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 19 06:26:08 x kernel: [550036.621530] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550036.621537] Sense Key : Medium Error [current] [descriptor]
Oct 19 06:26:08 x kernel: [550036.621555] Descriptor sense data with sense descriptors (in hex):
Oct 19 06:26:08 x kernel: [550036.621562]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Oct 19 06:26:08 x kernel: [550036.621606]         6c 9c fb 1c 
Oct 19 06:26:08 x kernel: [550036.621626] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550036.621638] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 19 06:26:08 x kernel: [550036.621646] sd 2:0:1:0: [sdb] CDB: 
Oct 19 06:26:08 x kernel: [550036.621653] Read(10): 28 00 6c 9c fb 15 00 00 08 00
Oct 19 06:26:08 x kernel: [550036.621692] end_request: I/O error, dev sdb, sector 1822227228
Oct 19 06:26:08 x kernel: [550036.621719] raid1_end_read_request: 9 callbacks suppressed
Oct 19 06:26:08 x kernel: [550036.621727] md/raid1:md1: sdb3: rescheduling sector 1816361448
Oct 19 06:26:08 x kernel: [550036.621782] ata3: EH complete
Oct 19 06:26:08 x kernel: [550041.155637] ata3.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
Oct 19 06:26:08 x kernel: [550041.155669] ata3.01: BMDMA stat 0x64
Oct 19 06:26:08 x kernel: [550041.155694] ata3.01: failed command: READ DMA EXT
Oct 19 06:26:08 x kernel: [550041.155719] ata3.01: cmd 25/00:08:15:fb:9c/00:00:6c:00:00/f0 tag 0 dma 4096 in
Oct 19 06:26:08 x kernel: [550041.155719]          res 51/40:00:1c:fb:9c/40:00:6c:00:00/10 Emask 0x9 (media error)
Oct 19 06:26:08 x kernel: [550041.155767] ata3.01: status: { DRDY ERR }
Oct 19 06:26:08 x kernel: [550041.155785] ata3.01: error: { UNC }
Oct 19 06:26:08 x kernel: [550041.343437] ata3.00: configured for UDMA/133
Oct 19 06:26:08 x kernel: [550041.391438] ata3.01: configured for UDMA/133
Oct 19 06:26:08 x kernel: [550041.391501] sd 2:0:1:0: [sdb] Unhandled sense code
Oct 19 06:26:08 x kernel: [550041.391510] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550041.391517] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Oct 19 06:26:08 x kernel: [550041.391525] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550041.391532] Sense Key : Medium Error [current] [descriptor]
Oct 19 06:26:08 x kernel: [550041.391546] Descriptor sense data with sense descriptors (in hex):
Oct 19 06:26:08 x kernel: [550041.391553]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 
Oct 19 06:26:08 x kernel: [550041.391596]         6c 9c fb 1c 
Oct 19 06:26:08 x kernel: [550041.391616] sd 2:0:1:0: [sdb]  
Oct 19 06:26:08 x kernel: [550041.391624] Add. Sense: Unrecovered read error - auto reallocate failed
Oct 19 06:26:08 x kernel: [550041.391636] sd 2:0:1:0: [sdb] CDB: 
Oct 19 06:26:08 x kernel: [550041.391643] Read(10): 28 00 6c 9c fb 15 00 00 08 00
Oct 19 06:26:08 x kernel: [550041.391681] end_request: I/O error, dev sdb, sector 1822227228
Oct 19 06:26:08 x kernel: [550041.391737] ata3: EH complete
Oct 19 06:26:08 x kernel: [550041.409686] md/raid1:md1: read error corrected (8 sectors at 1816363496 on sdb3)
Oct 19 06:26:08 x kernel: [550041.409705] handle_read_error: 9 callbacks suppressed
Oct 19 06:26:08 x kernel: [550041.409709] md/raid1:md1: redirecting sector 1816361448 to other mirror: sda3

The computer continued to log some entries into syslog until about an hour later, and then finally became completely unresponsive. No "kernel oops" was recorded in syslog. The computer had to be rebooted, at which point the raid array was in a resync status. After the resync, all appears well and the drives seem to be working normally.

I also noted that all the rescheduled sectors are exactly 8 sectors apart, which seems very strange to me.

Lastly, after reboot and approximately a day or two later but significantly after the raid resync was complete, the drive reset the unreadable (pending) and offline uncorrectable sector count, which I believe is normal as the drive takes those sectors offline and remaps them:

Oct 20 01:05:42 x kernel: [    2.186400] md: bind<sda3>
Oct 20 01:05:42 x kernel: [    2.204826] md: bind<sdb3>
Oct 20 01:05:42 x kernel: [    2.209618] md: raid1 personality registered for level 1
Oct 20 01:05:42 x kernel: [    2.210079] md/raid1:md1: not clean -- starting background reconstruction
Oct 20 01:05:42 x kernel: [    2.210087] md/raid1:md1: active with 2 out of 2 mirrors
Oct 20 01:05:42 x kernel: [    2.210122] md1: detected capacity change from 0 to 997198858240
Oct 20 01:05:42 x kernel: [    2.210903] md: resync of RAID array md1
Oct 20 01:05:42 x kernel: [    2.210911] md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Oct 20 01:05:42 x kernel: [    2.210915] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for resync.
Oct 20 01:05:42 x kernel: [    2.210920] md: using 128k window, over a total of 973827010k.
Oct 20 01:05:42 x kernel: [    2.241676]  md1: unknown partition table
...
Oct 20 06:33:10 x kernel: [19672.235467] md: md1: resync done.
...
Oct 21 05:35:50 x smartd[455]: Device: /dev/sdb [SAT], No more Currently unreadable (pending) sectors, warning condition reset after 1 email
Oct 21 05:35:50 x smartd[455]: Device: /dev/sdb [SAT], No more Offline uncorrectable sectors, warning condition reset after 1 email

smartctl -a /dev/sdb now shows

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
  5 Reallocated_Sector_Ct   0x0033   096   096   036    Pre-fail  Always       -       195
...
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
...

Some questions arise:

1) Why would all the rescheduled sectors be exactly 8 apart?

2) Why would the kernel become unresponsive and require a reboot? Shouldn't this be exactly the situation RAID-1 is meant to handle without the system blowing up?

3) Why would the unreadable and offline uncorrectable counts reset only 23 hours after the raid resync was complete?

Best Answer

1) Why would all the rescheduled sectors be exactly 8 apart?

Such gaps in the sector numbers are to be expected, the question is more how large those gaps would be (4k or larger). 8x 512 bytes are 4k which is the sector size most filesystems work with. So the filesystem probably requested to read 4k from the RAID, the RAID asks the /dev/sdb for that data. The first sector of that read fails (that's the sector number you see in your logs), the RAID switches to /dev/sda and serves the 4k from there. Then the filesystem requests to read the next 4k, back to /dev/sdb with sector number +8 which fails again which is again what you see in your logs ...

2) Why would the kernel become unresponsive and require a reboot?

Shouldn't happen normally. The problem is that the reallocation situation is about the most expensive you can get. Each read that fails, has to be redirected to the other disk, has to be rewritten on the original disk, etc. If it fills your log file at the same time, causing in turn new write requests, which in turn have to be reallocated again, etc. It would be cheaper to just kick the disk out entirely in this case.

It's also a question of how the rest of the hardware (such as the SATA controller) handles failing drives. If the controller itself gets a hiccup from it, it will hurt performance even more.

It's hard to tell what happened exactly if there is no log entry; it's a weak point of the Linux kernel, there is no straightforward solution to preserve those last messages when things really go south.

3) Why would the unreadable and offline uncorrectable counts reset only 23 hours after the raid resync was complete?

Some values are updated only when you do offline data collection (UPDATED Offline column), that may take some time. If the disk is set to do this automatically, it depends on the disk, e.g. every four hours. If you don't want to rely on the disk, you should set it up using smartmontools.

Related Solutions

Linux – Poor IO due to LUKS/Software RAID/LVM ordering

Your layering is suboptimal because putting the raid 5 on top of the encryption means that you increase the number of encrypt/decrypt operations by 25 % - since 4 * 4 TB are encrypted.

When putting the encryption on top of the raid 5 only 3 * 4 TB are encrypted.

The reasoning behind that is: you don't have to encrypt parity data (which takes up 4 TB in your example) of encrypted data because it does not increase your security.

Your presumption about multiple kcrypt processes is just that. When basing decisions on it, it is a premature optimization that may have quite the opposite effect. Your i7 is quite beefy, probably even including some special instructions that help to speed up AES - and the Linux kernel includes several optimized variants of cryptographic primitives that are automatically selected during boot.

You can verify if the optimized routines for your CPU are used via looking at /proc/cpuinfo (e.g. flag aes there), /proc/crypto, lsmod (unless the aes modules are compiled into the kernel) and the kernel log.

You should benchmark the throughput of kryptd without involving any slow disks to see what the upper bound really is (i.e. on a RAM disk using iozone).

To be able to diagnose potential performance issues later it is also useful to benchmark your RAID-setup of choice without any encryption to get an upper bound on that end.

In addition to the crypto topic, RAID 5 involves more IO-Operations than RAID 1 or 10. Since storage is kind of cheap perhaps it is an option to buy more harddisks and use another RAID level.

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

fdisk is the wrong tool for disks >2TB. Use parted or gdisk instead.

It appears that /dev/sdc1 and /dev/sdd1 are 2TB partitions, so that's what limits your array size. For the other disks, they have GPT so I assume they are 3TB already, but you should check.

Basically you have to stop the array, enlarge each partition to 3TB (without changing the starting offset), then start it again and follow it up with a grow:

mdadm --grow /dev/md0 --size=max

If you can't stop the array, you'll have to fail each 2TB partition individually, repartition and re-add it. This might go faster if you add a write-intent bitmap first.

mdadm --grow /dev/md0 --bitmap=internal

Then for each disk individually,

mdadm /dev/md0 --fail /dev/disk1 # check mdstat for [UUUU] first
mdadm /dev/md0 --remove /dev/disk1
parted /dev/disk -- mklabel gpt mkpart primary 1mib -1mib
mdadm /dev/md0 --re-add /dev/disk1
mdadm --wait /dev/md0 # must wait for sync

Once that's done you can remove the bitmap again (keeping it may harm performance).

mdadm --grow /dev/md0 --bitmap=none
mdadm --grow /dev/md0 --size=max

Finally do your resize2fs or whatever.

Best Answer

Related Solutions

Linux – Poor IO due to LUKS/Software RAID/LVM ordering

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

Related Question