File Reading – How to Read the Middle of a Large File

ddfiles

I have a 1 TB file. I would like to read from byte 12345678901 to byte 19876543212 and put that on standard output on a machine with 100 MB RAM.

I can easily write a perl script that does this. sysread delivers 700 MB/s (which is fine), but syswrite only delivers 30 MB/s. I would like something more efficient, preferably something that is installed every Unix system and that can deliver in the order of 1 GB/s.

My first idea is:

dd if=1tb skip=12345678901 bs=1 count=$((19876543212-12345678901))

But that is not efficient.

Edit:

I have no idea how I measured syswrite wrong. This delivers 3.5 GB/s:

perl -e 'sysseek(STDIN,shift,0) || die; $left = shift; \
         while($read = sysread(STDIN,$buf, ($left > 32768 ? 32768 : $left))){ \
            $left -= $read; syswrite(STDOUT,$buf);
         }' 12345678901 $((19876543212-12345678901)) < bigfile

and avoids the yes | dd bs=1024k count=10 | wc nightmare.

Best Answer

This is slow because of the small block size. Using a recent GNU dd (coreutils v8.16 +), the simplest way is to use the skip_bytes and count_bytes options:

in_file=1tb

start=12345678901
end=19876543212
block_size=4096

copy_size=$(( $end - $start ))

dd if="$in_file" iflag=skip_bytes,count_bytes,fullblock bs="$block_size" \
  skip="$start" count="$copy_size"

Update

fullblock option added above as per @Gilles answer. At first I thought that it might be implied by count_bytes, but this is not the case.

The issues mentioned are a potential problem below, if dds read/write calls are interrupted for any reason then data will be lost. This is not likely in most cases (odds are reduced somewhat since we are reading from a file and not a pipe).

Using a dd without the skip_bytes and count_bytes options is more difficult:

in_file=1tb

start=12345678901
end=19876543212
block_size=4096

copy_full_size=$(( $end - $start ))
copy1_size=$(( $block_size - ($start % $block_size) ))
copy2_start=$(( $start + $copy1_size ))
copy2_skip=$(( $copy2_start / $block_size ))
copy2_blocks=$(( ($end - $copy2_start) / $block_size ))
copy3_start=$(( ($copy2_skip + $copy2_blocks) * $block_size ))
copy3_size=$(( $end - $copy3_start ))

{
  dd if="$in_file" bs=1 skip="$start" count="$copy1_size"
  dd if="$in_file" bs="$block_size" skip="$copy2_skip" count="$copy2_blocks"
  dd if="$in_file" bs=1 skip="$copy3_start" count="$copy3_size"
}

You could also experiment with different block sizes, but the gains won't be very dramatic. See - Is there a way to determine the optimal value for the bs parameter to dd?

Related Solutions

Creating an arbitrarily large “fake” file

You can create a sparse file on certain filesystems, which will appear to be a certain size, but won't actually use that much space on disk.

$ dd if=/dev/null of=sparse bs=1024 count=1 seek=524288000
0+0 records in
0+0 records out
0 bytes (0 B) copied, 2.4444e-05 s, 0.0 kB/s
$ ls -l sparse 
-rw-rw-r--. 1 ignacio ignacio 536870912000 May  9 22:25 sparse
$ du -h sparse
0   sparse

Files – How to Write One File into Another Using dd

To overwrite the start of the destination file without truncating it, give the notrunc conversion directive:

$ dd if=out/one.img of=out/go.img conv=notrunc

If you wanted the source file's data appended to the destination, you can do that with the seek directive:

$ dd if=out/one.img of=out/go.img bs=1k seek=9

This tells dd that the block size is 1 kiB, so that the seek goes forward by 9 kiB before doing the write.

You can also combine the two forms. For example, to overwrite the second 1 kiB block in the file with a 1 kiB source:

$ dd if=out/one.img of=out/go.img bs=1k seek=9 conv=notrunc

That is, it skips the first 1 kiB of the output file, overwrites data it finds there with data from the input file, then closes the output without truncating it first.

Best Answer

Update

Related Solutions

Creating an arbitrarily large “fake” file

Files – How to Write One File into Another Using dd

Related Question