Merge nonzero blocks of huge (sparse) file A into huge file B

binaryfileslarge files

I have two partial disk images from a failing hard drive. File B contains the bulk of the disk's contents, with gaps where sector reads failed. File A is the result of telling ddrescue to retry all the failed sectors, so it is almost entirely gaps, but contains a few places where rereads succeeded. I now need to merge the interesting contents of File A back into File B. The algorithm is simple:

while not eof(A):
   read 512 bytes from A
   if any of them are nonzero:
       seek to corresponding offset in B
       write bytes into B

and I could sit down and write this myself, but I would first like to know if someone else has already written and debugged it.

(To complicate matters, due to limited space, File B and File A are on two different computers — this is why I didn't just tell ddrescue to attempt to fill in the gaps in B in the first place — but A can be transferred over the network relatively easily, being sparse.)

Best Answer

Your algorithm is implemented in GNU dd.

dd bs=512 if=A of=B conv=sparse,notrunc

Please verify this beforehand with some test files of your choice. You don't want to inadvertently damage your File B. A better algorithm would be to check whether B also has zeroes at that position, alas that's something dd does not do.

As for two different computers, you have several options. Use a network filesystem that supports seeks on writes (not all do); transfer the file beforehand; or pipe through SSH like so:

dd if=A | ssh -C B-host dd of=B conv=sparse,notrunc
# or the other way around
ssh -C A-host dd if=A | dd of=B conv=sparse,notrunc

The ssh -C option enables compression, you'd be transferring gigabytes of zeroes over the network otherwise.

Related Solutions

Bash – Grepping over a huge file performance

The key to performance is reading the huge file only once.

You can pass multiple patterns to grep by putting them on separate lines. This is usually done by telling grep to read patterns from a file:

grep -F -f 300KFile 30MFile

This outputs the matches in the order of the large file, and prints lines that match multiple patterns only once. Furthermore, this looks for patterns anywhere in the line; for example, if the pattern file contains 1234, then lines such as 123456,345678,2348962342 and 478912,1211138,1234 will match.

You can restrict to exact column matches by preprocessing the pattern. For example, if the patterns do not contain any special character ()?*+\|[]{}:

<300KFile sed -e 's/^/(^|,)/' -e 's/$/($|,)/' |
grep -E -f - 30MFile

If retaining only the first match for each pattern is important, make a first pass to extract only the relevant lines as above, then do a second pass in awk or perl that tracks patterns that have already been seen.

<300KFile sed -e 's/^/(^|,)/' -e 's/$/($|,)/' |
grep -E -f - 30MFile |
perl -l -F, -ape '
    BEGIN {
        open P, "300KFile" or die;
        %patterns = map {chomp; $_=>1} <P>;
        close P;
    }
    foreach $c (@F) {
        if ($patterns{$c}) {
            print;
            delete $patterns{$c};
        }
    }
'

Text Processing – How to cat Line X to Line Y in a Huge File

I suggest the sed solution, but for the sake of completeness,

awk 'NR >= 57890000 && NR <= 57890010' /path/to/file

To cut out after the last line:

awk 'NR < 57890000 { next } { print } NR == 57890010 { exit }' /path/to/file

Speed test (here on macOS, YMMV on other systems):

100,000,000-line file generated by seq 100000000 > test.in
Reading lines 50,000,000-50,000,010
Tests in no particular order
real time as reported by bash's builtin time

 4.373  4.418  4.395    tail -n+50000000 test.in | head -n10
 5.210  5.179  6.181    sed -n '50000000,50000010p;57890010q' test.in
 5.525  5.475  5.488    head -n50000010 test.in | tail -n10
 8.497  8.352  8.438    sed -n '50000000,50000010p' test.in
22.826 23.154 23.195    tail -n50000001 test.in | head -n10
25.694 25.908 27.638    ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574    awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127    awk 'NR >= 57890000 && NR <= 57890010' test.in

These are by no means precise benchmarks, but the difference is clear and repeatable enough* to give a good sense of the relative speed of each of these commands.

*: Except between the first two, sed -n p;q and head|tail, which seem to be essentially the same.

Best Answer

Related Solutions

Bash – Grepping over a huge file performance

Text Processing – How to cat Line X to Line Y in a Huge File

Related Question