TL;DR summary: Translate an md sector number into offsets(s) within the /dev/mdX
device, and how to investigate it with xfs_db
. The sector number is from sh->sector
in linux/drivers/md/raid5.c:handle_parity_checks5()
.
I don't know MD internals, so I don't know exactly what to do with the output from the printk
logging I added.
Offsets into the component devices (for dd
or a hex editor/viewer) would also be interesting.
I suppose I should ask this on the Linux-raid mailing list. Is it subscribers-only, or can I post without subscribing?
I have xfs directly on top of MD RAID5 of 4 disks in my desktop (no LVM). A recent scrub detected a non-zero mismatch_cnt
(8 in fact, because md operates on 4kiB pages at a time).
This is a RAID5, not RAID1/RAID10 where mismatch_cnt
!= 0 can happen during normal operation. (The other links at the bottom of this wiki page might be useful to some people.)
I could just blindly repair
, but then I'd have no idea which file to check for possible corruption, besides losing any chance to choose which way to reconstruct. Frostschutz's answer on a similar question is the only suggestion I found for tracking back to a difference in the filesystem. It's cumbersome and slow, and I'd rather use something better to narrow it down to a few files first.
Kernel patch to add logging
Bizarrely, md's check feature doesn't report where an error was found. I added a printk
in md/raid5.c to log sh->sector
in the if
branch that increments mddev->resync_mismatches
in handle_parity_checks5()
(tiny patch published on github, originally based on 4.5-rc4 from kernel.org.) For this to be ok for general use, it would probably need to avoid flooding the logs in repairs with a lot of mismatches (maybe only log if the new value of resync_mismatches
is < 1000?). Also maybe only log for check
and not repair
.
I'm pretty sure I'm logging something useful (even though I don't know MD internals!), because the same function prints that sector number in the error-handling case of the switch
.
I compiled my modified kernel and booted it, then re-ran the check:
[ 399.957203] md: data-check of RAID array md125
...
[ 399.957215] md: using 128k window, over a total of 2441757696k.
...
[21369.258985] md/raid:md125: check found mismatch at sector 4294708224 <-- custom log message
[25667.351869] md: md125: data-check done.
Now I don't know exactly what to do with that sector number. Is sh->sector * 512
a linear address inside /dev/md/t-r5
(aka /dev/md125
)? Is it a sector number within each component device (so it refers to three data and one parity sector)? I'm guessing the latter, since a parity-mismatch in RAID5 means N-1 sectors of the md device are in peril, offset from each other by the stripe unit. Is sector 0 the very start of the component device, or is it the sector after the superblock or something? Was there more information in handle_parity_checks5()
that I should have calculated / logged?
If I wanted to get just the mismatching blocks, is this correct?
dd if=/dev/sda6 of=mmblock.0 bs=512 count=8 skip=4294708224
dd if=/dev/sdb6 of=mmblock.1 bs=512 count=8 skip=4294708224
dd if=/dev/sda6 of=mmblock.2 bs=512 count=8 skip=4294708224
dd if=/dev/sdd of=mmblock.3 bs=512 count=8 skip=4294708224 ## not a typo: my 4th component is a smaller full-disk
# i.e.
sec_block() { for dev in {a,b,c}6 d; do dd if=/dev/sd"$dev" of="sec$1.$dev" skip="$1" bs=512 count=8;done; }; sec_block 123456
I'm guessing not, because I get 4k of zeros from all four raid components, and 0^0 == 0
, so that should be the correct parity, right?
One other place I've seen mention of using sector addresses in md is for sync_min
and sync_max
(in sysfs). Neil Brown on the linux-raid list, in response to a question about a failed drive with sector numbers from hdrecover
, where Neil used the full-disk sector number as an MD sector number. That's not right is it? Wouldn't md sector numbers be relative to the component devices (partitions in that case), not the full device that the partition is a part of?
linear sector to XFS filename:
Before realizing that the md sector number was probably for the components, not the RAID device, I tried using it in read-only xfs_db
:
Dave Chinner's very brief suggestion on how to find how XFS is using a given block didn't seem to work at all for me. (I would have expected some kind of result, for some sector, since the number shouldn't be beyond the end of the device even if it's not the mismatched sector)
# xfs_db -r /dev/md/t-r5
xfs_db> convert daddr 4294708224 fsblock
0x29ad5e00 (699227648)
xfs_db> blockget -nv -b 699227648
xfs_db> blockuse -n # with or without -c 8
must run blockget first
huh? What am I doing wrong here? I guess this should be a separate question. I'll replace this with a link if/when I ask it or find an answer to this part somewhere else.
My RAID5 is essentially idle, with no write activity and minimal read (and noatime
, so reads aren't producing writes).
Extra stuff about my setup, nothing important here
Many of my files are video or other compressed data that give an effective way to tell whether the data is correct or not (either internal checksums in the file format, or just whether it decodes without errors). That would make this read-only loopback method viable, once I know which file to check. I didn't want to run a 4-way diff of every file in the filesystem to find the mismatch first, though, when the kernel has the necessary information while checking, and could easily log it.
my /proc/mdstat
for my bulk-data array:
md125 : active raid5 sdd[3] sda6[0] sdb6[1] sdc6[4]
7325273088 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
bitmap: 0/19 pages [0KB], 65536KB chunk
It's on partitions on three Toshiba 3TB drives, and a non-partitioned WD25EZRS green-power (slow) drive which I'm replacing with another Toshiba. (Using mdadm --replace
to do it online with no gaps in redundancy. I realized after one copy that I should check the RAID health before as well as after, to detect problems. That's when I detected the mismatch. It's possible it's been around for a long time, since I had some crashes almost a year ago, but I don't have old logs and mdadm doesn't seem to send mail about this by default (Ubuntu 15.10).
My other filesystems are on RAID10f2 devices made from earlier partitions on the three larger HDs (and RAID0 for /var/tmp). The RAID5 is just for bulk-storage, not /home
or /
.
My drives are all fine: SMART error counts are 0 all bad-block counters on all drives, and short + long SMART self-tests passed.
near-duplicates of this question which don't have answers:
- What chunks are mismatched in a Linux md array?
- http://www.spinics.net/lists/raid/msg49459.html
- MDADM mismatch_cnt > 0. Any way to identify which blocks are in disagreement?
- Other things already linked inline, but most notably frostschutz's read-only loopback idea.
- scrubbing on the Arch wiki RAID page
Best Answer
TL;DR sh->sector is the number of sectors in the physical disks after the start of the data section
Setup
Here's a simple test setup to illustrate:
Now to get started, get a non-zero block and overwrite it
Make sure the dm/md cache is flushed by stopping/reassembling array, and check:
Block on disks
Okay, so first let's check 16384 matches what we wrote. My raid has a 512k stripe so I made sure I wrote something aligned to be easy to match, we wrote at
1024*10240
i.e.0xa00000
.Your patch gives the info
16384
, one thing to be aware of is that data doesn't start at 0:So
printf "%x\n" $(((4096+16384)*512))
says that's0xa00000
as well. Good.Block in md
Now to get where that is on the md end, it's actually easier: it's simply the position given in sector times
number_of_stripes
e.g. for me I have 4 disks (3+1) so 3 stripes.Here, it means
16384*3*512
e.g.0x1800000
. I filled the disk quite well so it's easy to check just reading the disk and looking for 1k of zeroes:Block in xfs
Cool. Let's see where that is in xfs now.
16384*3
is49152
(daddr takes sector number):Surely enough, the zeroes are in that file:
If we overwrite that file, the zeroes are gone in /dev/raidme/rd0 at the correct offset too (just dd it over with another file). If you write in /dev/raidme/rd0 again (make sure you stop/start the array again) then the zeroes are back. Looks good.
There's one more problem though, if your stripe size is as big as mine here (512k), then we don't have a single block to deal with but 1.5MB of possible data corrupted... Often enough that'll be in a single file, but you need to check that, back in xfs_db. Remember inode earlier was 2052.
A block is size 4096 bytes here (see
xfs_info
), so our 1.5MB are 384 blocks. Our corrupted segment is block 6144 to 6528 - well within the first segment of this file.Something else to look at would be to extract the blocks by hand and check where exactly the checksums don't match, which will hopefully give you 3 smaller chunks to look at.
Lastly about your patch, I'm not a md dev myself but as an ex-mdadm raid5 user I would have been pretty interested. I'd say it's definitely worth the effort to push it in a bit. The cleanup you mentioned might be useful and I'm sure the devs will have some comments once you submit a patch, but heck md needs to be more verbose about these errors!