Linux Kernel – Translate MD RAID5 Internal Sector Numbers to Offsets

corruptionlinux-kernelmdraidxfs

TL;DR summary: Translate an md sector number into offsets(s) within the /dev/mdX device, and how to investigate it with xfs_db. The sector number is from sh->sector in linux/drivers/md/raid5.c:handle_parity_checks5().

I don't know MD internals, so I don't know exactly what to do with the output from the printk logging I added.

Offsets into the component devices (for dd or a hex editor/viewer) would also be interesting.

I suppose I should ask this on the Linux-raid mailing list. Is it subscribers-only, or can I post without subscribing?

I have xfs directly on top of MD RAID5 of 4 disks in my desktop (no LVM). A recent scrub detected a non-zero mismatch_cnt (8 in fact, because md operates on 4kiB pages at a time).

This is a RAID5, not RAID1/RAID10 where mismatch_cnt != 0 can happen during normal operation. (The other links at the bottom of this wiki page might be useful to some people.)

I could just blindly repair, but then I'd have no idea which file to check for possible corruption, besides losing any chance to choose which way to reconstruct. Frostschutz's answer on a similar question is the only suggestion I found for tracking back to a difference in the filesystem. It's cumbersome and slow, and I'd rather use something better to narrow it down to a few files first.

Kernel patch to add logging

Bizarrely, md's check feature doesn't report where an error was found. I added a printk in md/raid5.c to log sh->sector in the if branch that increments mddev->resync_mismatches in handle_parity_checks5() (tiny patch published on github, originally based on 4.5-rc4 from kernel.org.) For this to be ok for general use, it would probably need to avoid flooding the logs in repairs with a lot of mismatches (maybe only log if the new value of resync_mismatches is < 1000?). Also maybe only log for check and not repair.

I'm pretty sure I'm logging something useful (even though I don't know MD internals!), because the same function prints that sector number in the error-handling case of the switch.

I compiled my modified kernel and booted it, then re-ran the check:

[  399.957203] md: data-check of RAID array md125
...
[  399.957215] md: using 128k window, over a total of 2441757696k.
...
[21369.258985] md/raid:md125: check found mismatch at sector 4294708224    <-- custom log message
[25667.351869] md: md125: data-check done.

Now I don't know exactly what to do with that sector number. Is sh->sector * 512 a linear address inside /dev/md/t-r5 (aka /dev/md125)? Is it a sector number within each component device (so it refers to three data and one parity sector)? I'm guessing the latter, since a parity-mismatch in RAID5 means N-1 sectors of the md device are in peril, offset from each other by the stripe unit. Is sector 0 the very start of the component device, or is it the sector after the superblock or something? Was there more information in handle_parity_checks5() that I should have calculated / logged?

If I wanted to get just the mismatching blocks, is this correct?

dd if=/dev/sda6 of=mmblock.0 bs=512 count=8 skip=4294708224
dd if=/dev/sdb6 of=mmblock.1 bs=512 count=8 skip=4294708224
dd if=/dev/sda6 of=mmblock.2 bs=512 count=8 skip=4294708224
dd if=/dev/sdd  of=mmblock.3 bs=512 count=8 skip=4294708224  ## not a typo: my 4th component is a smaller full-disk

# i.e.
sec_block() { for dev in {a,b,c}6 d; do dd if=/dev/sd"$dev" of="sec$1.$dev"  skip="$1"  bs=512 count=8;done; }; sec_block 123456

I'm guessing not, because I get 4k of zeros from all four raid components, and 0^0 == 0, so that should be the correct parity, right?

One other place I've seen mention of using sector addresses in md is for sync_min and sync_max (in sysfs). Neil Brown on the linux-raid list, in response to a question about a failed drive with sector numbers from hdrecover, where Neil used the full-disk sector number as an MD sector number. That's not right is it? Wouldn't md sector numbers be relative to the component devices (partitions in that case), not the full device that the partition is a part of?

linear sector to XFS filename:

Before realizing that the md sector number was probably for the components, not the RAID device, I tried using it in read-only xfs_db:

Dave Chinner's very brief suggestion on how to find how XFS is using a given block didn't seem to work at all for me. (I would have expected some kind of result, for some sector, since the number shouldn't be beyond the end of the device even if it's not the mismatched sector)

# xfs_db -r /dev/md/t-r5 
xfs_db> convert daddr 4294708224 fsblock
0x29ad5e00 (699227648)
xfs_db> blockget -nv -b 699227648
xfs_db> blockuse -n       # with or without -c 8
must run blockget first

huh? What am I doing wrong here? I guess this should be a separate question. I'll replace this with a link if/when I ask it or find an answer to this part somewhere else.

My RAID5 is essentially idle, with no write activity and minimal read (and noatime, so reads aren't producing writes).

Extra stuff about my setup, nothing important here

Many of my files are video or other compressed data that give an effective way to tell whether the data is correct or not (either internal checksums in the file format, or just whether it decodes without errors). That would make this read-only loopback method viable, once I know which file to check. I didn't want to run a 4-way diff of every file in the filesystem to find the mismatch first, though, when the kernel has the necessary information while checking, and could easily log it.

my /proc/mdstat for my bulk-data array:

md125 : active raid5 sdd[3] sda6[0] sdb6[1] sdc6[4]
      7325273088 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/19 pages [0KB], 65536KB chunk

It's on partitions on three Toshiba 3TB drives, and a non-partitioned WD25EZRS green-power (slow) drive which I'm replacing with another Toshiba. (Using mdadm --replace to do it online with no gaps in redundancy. I realized after one copy that I should check the RAID health before as well as after, to detect problems. That's when I detected the mismatch. It's possible it's been around for a long time, since I had some crashes almost a year ago, but I don't have old logs and mdadm doesn't seem to send mail about this by default (Ubuntu 15.10).

My other filesystems are on RAID10f2 devices made from earlier partitions on the three larger HDs (and RAID0 for /var/tmp). The RAID5 is just for bulk-storage, not /home or /.

My drives are all fine: SMART error counts are 0 all bad-block counters on all drives, and short + long SMART self-tests passed.

near-duplicates of this question which don't have answers:

What chunks are mismatched in a Linux md array?
http://www.spinics.net/lists/raid/msg49459.html
MDADM mismatch_cnt > 0. Any way to identify which blocks are in disagreement?
Other things already linked inline, but most notably frostschutz's read-only loopback idea.
scrubbing on the Arch wiki RAID page

Best Answer

TL;DR sh->sector is the number of sectors in the physical disks after the start of the data section

Setup

Here's a simple test setup to illustrate:

/dev/raidme/rd[0-3], 2GB devices
/dev/md127 created as a raid5 over these 5, init'd as xfs and filled with random data

Now to get started, get a non-zero block and overwrite it

# dd if=/dev/raidme/rd0 bs=1k count=1 skip=10240 | hexdump -C | head
...
# dd if=/dev/zero of=/dev/raidme/rd0 bs=1k count=1 seek=10240
...
# dd if=/dev/raidme/rd2 bs=1k count=1 skip=10240 | hexdump  -C | head
1024 bytes (1.0 kB, 1.0 KiB) copied, 8.6021e-05 s, 11.9 MB/s
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000400

Make sure the dm/md cache is flushed by stopping/reassembling array, and check:

# mdadm --stop /dev/md127
# mdadm --assemble /dev/md127 /dev/raidme/rd*
# echo check > /sys/class/block/md127/md/sync_action
# dmesg | tail
...
[ 1188.057900] md/raid:md127: check found mismatch at sector 16384

Block on disks

Okay, so first let's check 16384 matches what we wrote. My raid has a 512k stripe so I made sure I wrote something aligned to be easy to match, we wrote at 1024*10240 i.e. 0xa00000.

Your patch gives the info 16384, one thing to be aware of is that data doesn't start at 0:

# mdadm -E /dev/raidme/rd0 | grep "Data Offset"
    Data Offset : 4096 sectors

So printf "%x\n" $(((4096+16384)*512)) says that's 0xa00000 as well. Good.

Block in md

Now to get where that is on the md end, it's actually easier: it's simply the position given in sector times number_of_stripes e.g. for me I have 4 disks (3+1) so 3 stripes.

Here, it means 16384*3*512 e.g. 0x1800000. I filled the disk quite well so it's easy to check just reading the disk and looking for 1k of zeroes:

# dd if=/dev/md127 bs=1M | hexdump -C | grep -C 3 '00 00 00 00 00 00'
... some false positives...
01800000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
01800400  6b a8 9e e0 ad 88 a8 de  dd 2e 68 00 d8 7a a3 52  |k.........h..z.R|

Block in xfs

Cool. Let's see where that is in xfs now. 16384*3 is 49152 (daddr takes sector number):

# xfs_db -r /dev/md127
xfs_db> blockget -n
xfs_db> daddr 49152
xfs_db> blockuse -n
block 6144 (0/6144) type data inode 2052 d.1/f.1

Surely enough, the zeroes are in that file:

# dd if=/mnt/d.1/f.1 bs=1M | hexdump -C | grep -C 3 '00 00 00 00 00'
...
03680000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
03680400  6b a8 9e e0 ad 88 a8 de  dd 2e 68 00 d8 7a a3 52  |k.........h..z.R|

If we overwrite that file, the zeroes are gone in /dev/raidme/rd0 at the correct offset too (just dd it over with another file). If you write in /dev/raidme/rd0 again (make sure you stop/start the array again) then the zeroes are back. Looks good.

There's one more problem though, if your stripe size is as big as mine here (512k), then we don't have a single block to deal with but 1.5MB of possible data corrupted... Often enough that'll be in a single file, but you need to check that, back in xfs_db. Remember inode earlier was 2052.

xfs_db> inode 2052
xfs_db> bmap
data offset 0 startblock 256 (0/256) count 17536 flag 0
data offset 17536 startblock 122880 (0/122880) count 4992 flag 0
data offset 22528 startblock 91136 (0/91136) count 3072 flag 0

A block is size 4096 bytes here (see xfs_info), so our 1.5MB are 384 blocks. Our corrupted segment is block 6144 to 6528 - well within the first segment of this file.

Something else to look at would be to extract the blocks by hand and check where exactly the checksums don't match, which will hopefully give you 3 smaller chunks to look at.

Lastly about your patch, I'm not a md dev myself but as an ex-mdadm raid5 user I would have been pretty interested. I'd say it's definitely worth the effort to push it in a bit. The cleanup you mentioned might be useful and I'm sure the devs will have some comments once you submit a patch, but heck md needs to be more verbose about these errors!

Kernel patch to add logging

linear sector to XFS filename:

Extra stuff about my setup, nothing important here

Best Answer

Setup

Block on disks

Block in md

Block in xfs

Related Solutions

Ubuntu – mdadm – RAID5 array size vs. actual disk size mismatch

RAID6 scrubbing mismatch repair

Related Question