A weird issue relate to ext4/lvm/raid-5 after partition recovery


I have 3 hard disks, in the following paragraphs named /dev/sda, /dev/sdb and /dev/sdc, newest came first. Note: /dev/sdc has one primary partition /dev/sdc1, one extended partition /dev/sd2, and 3 logical partitions /dev/sdc5, /dev/sdc6 and /dev/sda7.

I created a degraded RAID 5 device /dev/md0 with /dev/sda5 and /dev/sdb5 (planned to add /dev/sdc5 to the RAID to turn it to normal state), then used /dev/md0 as the only pv of LVM, and created a lv with ext4 file system /dev/mapper/vg0-lv0.

Unfortunately, when exploring and playing with LVM, I've run dd if=/dev/zero of=/dev/sdc1 bs=64M count=10 after deleting /dev/sdc1. So actually the zeros were written to /dev/sdc2, and broken part of the partition table stored on /dev/sdc2 and the beginning part of /dev/sdc5.

When realized this, I immediately made an image of /dev/sdc via dd like this: dd if=/dev/sdc of=/mount-point-of-vg0-lv0/sdc.img.

Several days later, I finally have time to try to recovery the data on /dev/sdc, actually only /dev/sdc7 since it's the only partition without backup. I ran testdisk with the image file sdc.img, use its Quick Search feature to rebuild the partition table, losetup it to /dev/loop0. /dev/loop0p7 (which is the image of /dev/sdc7) was back and mountable, and all files seems OK. Then I ran find /mount-point-of-loop0p7 -type f -exec md5sum {} \; > sdc7_img.md5sum to build MD5 checksum list for all files on /dev/loop0p7.

When dealing with the physical /dev/sdc device, Quick Search of testdisk doesn't find all the partitions, the Deep Search does. Then I built MD5 checksum list sdc7.md5sum for all files on physical /dev/sdc7 with similar command. When compared it to sdc7_image.md5sum, I found 4 files are different. After comparing them manually, I noticed each file only have 1 byte difference. And because one file have CRC32 in its name, so I can confirm the one from the physical /dev/sdc7 is correct.

So my question is, why did this odd thing happen? I've already ran fsck.ext4 -c -c /dev/mapper/vg0-lv0 to confirm it has no bad blocks. 4 bytes differences in 1.2TB data is in such a small percentage, but this make me don't have confidence in storing data on /dev/mapper/vg0-lv0 in the future.

Update: I have to mention, all the operations was done in latest ArchLinux running in VirtualBox 4.1.16, which running in Windows 7. /dev/sda, /dev/sdb and /dev/sdc are all linked with physical hard disks, via VBoxManage internalcommands createrawvmdk. VirtualBox has only reported error VERR_ACCESS_DENIED during made md5sums for physical /dev/sdc7, after offline the physical disk via diskpart of Win7, no further errors raised.

Best Answer

There are a couple of things that could have happened. First, you didn't mention unmounting sdc7 before imaging the disk, so it could be that the data was being written at the time. I'm going to guess that wasn't the case, though, or you wouldn't be asking. I can't fault your reaction of "first thing, image the disk", that's a pretty good reaction. Though I note that before you rebooted, the kernel still had the partition table in memory, check /proc/partitions.

First thing to check is for memory errors. You could have bad RAM. Your data no doubt went through RAM several times. I'm assuming you don't have ECC memory, which would probably catch this.

Hard disks also have errors. Looking a spec sheet for a few random consumer hard drives, they say 1 per 100 Tbit. You copied 1.2TB at least a few times (read from source, read from destination), so that is something like 19 Tbit read. Having a bit error in that is believable. (They don't give an error rate for writes on the spec sheets, unfortunately).

Was there any rhyme or reason behind the single-byte corruptions? cmp -l can tell you the bytes that vary. E.g., if it were always the same offset in a page (your page size is probably 4K), and always the same bit, that'd point almost conclusively to defective RAM. Even if its only always the same bit, or the same offset, that'd be pretty conclusive (And did you have CRC32 for all four files, or just one?)

Related Question