(Note: This presumes you're not getting a bunch of kernel errors (check dmesg
or journalctl -b -k
) or tons of CRC errors indicated in drive SMART status. If you are... there are a few software things to try first, like turning of NCQ.)
Usually, this means bad RAM. Even when memtest86+ passes (how long did you run it for?) Unless you have ECC RAM, which I doubt from those specs.
Make sure you haven't done something crazy, like find 1+ meter SATA cables and wrap them around the CPU. Although SATA data transfer has CRCs, so you should be getting tons of errors if you're getting corruption here. SATA cables are cheap, you can always try replacing them.
The next step, if you don't just want to replace the RAM, is to try to narrow down when the corruption is happening.
On each drive, repeatedly run md5sum
or similar on a large file showing the issue (needs to be something like 2x RAM, to stop it from being checked from cache) or set of files. Do it a lot of times, like for hours. Do you always get the same result? If not, then there is corruption on the read path; if you always get the same result, then there probably isn't corruption on read. That'd make RAM unlikely.
If you get read corruption on both disks, start with replacing the RAM. If that doesn't fix it, you can try power supply and finally SATA controller (which is likely soldered to the mobo, so you'd have to replace that).
If you get read corruption on one disk (not both), replace the disk. If that doesn't fix it, and you have a backplane (for hot swap in the server), it may be defective. You can try replacing the cables as well. Try a different SATA port. The presumption here is that one bad disk may happen, but two is pretty unlikely. Honestly... I'd swap RAM before presuming two bad disks.
If both disks consistently read back the same data, first confirm you're actually checking enough data to be sure its not cached; I'd want at least twice RAM. You'd then repeatedly write some known data to each disk, and see if reading it back gives a different value. Then pretty much the same solutions as above.
PS: Corruption like this is insidious. In particular, it may have corrupted random bits of your Linux distro, not just your data. After fixing the cause, it's usually best to re-install. At minimum, you need to check every distro-provided file against known-good checksums; some distros provide utilities for doing that. That still won't confirm no damage to dynamic distro data files (e.g., installed package lists), but at least you can be sure the binaries are OK.
Best Answer
That the
btrfs-corrupt-block
is not in thebtrfs-progs
packages is probably because the developers did not want the average user to accidentally start it and corrupt anything. The program is not a target in thebtrfs-progs
'Makefile
and would not be compiled and included by a package builder unless they applied a distro specific patch first. The program is more of a testing tool for developers ofbtrfs
.The source however is in the main repository, you can just check that out and compile it.