File Reading – How to Read the Middle of a Large File

ddfiles

I have a 1 TB file. I would like to read from byte 12345678901 to byte 19876543212 and put that on standard output on a machine with 100 MB RAM.

I can easily write a perl script that does this. sysread delivers 700 MB/s (which is fine), but syswrite only delivers 30 MB/s. I would like something more efficient, preferably something that is installed every Unix system and that can deliver in the order of 1 GB/s.

My first idea is:

dd if=1tb skip=12345678901 bs=1 count=$((19876543212-12345678901))

But that is not efficient.

Edit:

I have no idea how I measured syswrite wrong. This delivers 3.5 GB/s:

perl -e 'sysseek(STDIN,shift,0) || die; $left = shift; \
         while($read = sysread(STDIN,$buf, ($left > 32768 ? 32768 : $left))){ \
            $left -= $read; syswrite(STDOUT,$buf);
         }' 12345678901 $((19876543212-12345678901)) < bigfile

and avoids the yes | dd bs=1024k count=10 | wc nightmare.

Best Answer

This is slow because of the small block size. Using a recent GNU dd (coreutils v8.16 +), the simplest way is to use the skip_bytes and count_bytes options:

in_file=1tb

start=12345678901
end=19876543212
block_size=4096

copy_size=$(( $end - $start ))

dd if="$in_file" iflag=skip_bytes,count_bytes,fullblock bs="$block_size" \
  skip="$start" count="$copy_size"

Update

fullblock option added above as per @Gilles answer. At first I thought that it might be implied by count_bytes, but this is not the case.

The issues mentioned are a potential problem below, if dds read/write calls are interrupted for any reason then data will be lost. This is not likely in most cases (odds are reduced somewhat since we are reading from a file and not a pipe).


Using a dd without the skip_bytes and count_bytes options is more difficult:

in_file=1tb

start=12345678901
end=19876543212
block_size=4096

copy_full_size=$(( $end - $start ))
copy1_size=$(( $block_size - ($start % $block_size) ))
copy2_start=$(( $start + $copy1_size ))
copy2_skip=$(( $copy2_start / $block_size ))
copy2_blocks=$(( ($end - $copy2_start) / $block_size ))
copy3_start=$(( ($copy2_skip + $copy2_blocks) * $block_size ))
copy3_size=$(( $end - $copy3_start ))

{
  dd if="$in_file" bs=1 skip="$start" count="$copy1_size"
  dd if="$in_file" bs="$block_size" skip="$copy2_skip" count="$copy2_blocks"
  dd if="$in_file" bs=1 skip="$copy3_start" count="$copy3_size"
}

You could also experiment with different block sizes, but the gains won't be very dramatic. See - Is there a way to determine the optimal value for the bs parameter to dd?

Related Question