Can `btrfs scrub` remember the bad blocks/sectors

btrfs

I have a USB thumb drive formatted in btrfs format. But I found if I copy a 1 GB file onto this disk, the file is corrupted. And btrfs scrub reports there're many uncorrectable errors. But if I keep copying more files on the disk, those files seems to be fine. So I think only a small consecutive blocks on disk is corrupted.

I wonder if btrfs scrub can scrub my entire drive and remember the corrupted blocks and do not use them in the future.

Best Answer

It probably can, but that won't help you due to how flash media work.

In contrast to a hard disk which can write or erase individual bits, while a flash medium can write individual bits it can only erase them a whole erase block at a time. The size of an erase block can differ, but it's often something like 128k. Since that's a lot to erase and rewrite if we only want to change one 'sector' (the size unit with which hard disks and operating systems deal), the thumb drive will split the erase block up into sector-sized units. When you change something it will mark the sector on which you've just changed something as "no longer in use", and then write the modified version somewhere else. After a while, it will see that the erase block has no active sectors anymore, and erase the block.

What this means is that if one sector is broken, the next time you write to that sector it will not be broken anymore, since it will now be a different sector.

In addition, flash tends to wear out after a number of write cycles, at which point it will fail (the exact number differs based on the quality of the flash chips, but is rarely less than something like 100000). For this purpose as well as for the extra space needed for the erase block stuff, a thumb drive has some extra capacity that is not announced; e.g., a 4g thumbdrive might expose 4000M but have 4096M internally, or 4200M, or some such. When a particular erase block starts to fail after too many write/erase cycles, your thumbdrive will mark it as such and no longer use it. It can do this for a while, but eventually the extra space will have been used up; at this point, when it tries to copy a sector to make a requested change, it will not find an empty sector anymore and can only produce a write error.

When your thumb drive reaches that point, as yours seems to have, it's time to replace it; it won't be long now before you'll start losing data (if that hasn't already happened)

Mitigation (for "older" kernels)

The negative effect can be mitigated by increasing the amount of queued requests in the IO scheduler queue like this:

echo 4096 | sudo tee /sys/block/sdc/queue/nr_requests

In my case this nearly triples (~56MB/s) the throughput for the 4GB random data test explained in my question. Of course, the performance still falls short 100MB/s compared to unencrypted IO.

Investigation

Multicore `blktrace`

I further investigated the problematic scenario in which a btrfs is placed on a top of a LUKS encrypted block device. To show me what write instructions are issued to the actual block device, I used blktrace like this:

sudo blktrace -a write -d /dev/sdc -o - | blkparse -b 1 -i - | grep -w D

What this does is (as far as I was able to comprehend) trace IO request to /dev/sdc which are of type "write", then parse this to human readable output but further restrict the output to action "D", which is (according to man blkparse) "IO issued to driver".

The result was something like this (see about 5000 lines of output of the multicore log):

8,32   0    32732   127.148240056     3  D   W 38036976 + 240 [ksoftirqd/0]
8,32   0    32734   127.149958221     3  D   W 38038176 + 240 [ksoftirqd/0]
8,32   0    32736   127.160257521     3  D   W 38038416 + 240 [ksoftirqd/0]
8,32   1    30264   127.186905632    13  D   W 35712032 + 240 [ksoftirqd/1]
8,32   1    30266   127.196561599    13  D   W 35712272 + 240 [ksoftirqd/1]
8,32   1    30268   127.209431760    13  D   W 35713872 + 240 [ksoftirqd/1]

Column 1: major,minor of the block device
Column 2: CPU ID
Column 3: sequence number
Column 4: time stamp
Column 5: process ID
Column 6: action
Column 7: RWBS data (type, sector, length)

This is a snipped of the output produced while dd'ing the 4GB random data onto the mounted filesystem. It is clear that at least two processes are involved. The remaining log shows that all four processors are actually working on it. Sadly, the write requests are not ordered anymore. While CPU0 is writing somewhere around the 38038416th sector, CPU1, which is scheduled afterwards, is writing somewhere around the 35713872nd sector. That's bad.

Singlecore `blktrace`

I did the same experiment after disabling multi-threading and disabling the second core of my CPU. Of course, only one processor is involved in writing to the stick. But more importantly, the write request are properly sequential, which is why the full write performance of ~170MB/s is achieved in the otherwise same setup.

Have a look at about 5000 lines of output in the singlecore log.

Discussion

Now that I know the cause and the proper google search terms, the information about this problem is bubbling up to the surface. As it turns out, I am not the first one to notice.

Four years ago, a patch brought multi-threaded dm-crypt to the kernel. That commit pretty much matches my findings exactly.
Two years ago, patches were discussed improving dm-crypt performance, including re-ordering of write requests.
One year ago, the topic was still discussed.
Recently, a patch enabling sorting for dm-crypt was finally commited to the kernel.
There is an interesting email with performance tests (which I did not read very much of) concerning this phenomenon.

Fixed in current kernels (>=4.0.2)

Because I (later) found the kernel commit obviously targeted at this exact problem, I wanted to try an updated kernel. [After compiling it myself and then finding out it's already in debian/sid] It turns out that the problem is indeed fixed. I don't know the exact kernel release in which the fix appeared, but the original commit will give clues to anyone interested.

For the record:

$ uname -a
Linux t440p 4.0.0-1-amd64 #1 SMP Debian 4.0.2-1 (2015-05-11) x86_64 GNU/Linux
$ dd if=/home/schlimmchen/Documents/random of=/mnt/dd-test bs=1M conv=fsync
4294967296 bytes (4.3 GB) copied, 29.7559 s, 144 MB/s

A hat tip to Mikulas Patocka, who authored the commit.

Is it possible to explicitly retrieve full file contents with bad checksum on a btrfs filesystem

As a last resort you can try

btrfs check --init-csum-tree /tmp/copy_of_the_device.bin

This command will change the filesystem and the result can be worse than before, so run this only on a dd or ddrescue copy of the filesystem.

Best Answer

Related Solutions

Improving General dm-crypt (LUKS) Write Performance