dd – Best Way to Remove Bytes from Start of a File

Today I had to remove the first 1131 bytes from an 800MB mixed text / binary file, a filtered subversion dump I'm hacking for a new repository. What's the best way to do this?

To start with I tried

dd bs=1 skip=1131 if=filtered.dump of=trimmed.dump

but after the skip this copies the remainder of the file a byte at a time, i.e. very slowly. In the end I worked out I needed 405 bytes to round this up to three blocks of 512 which I could skip

dd if=/dev/zero of=405zeros bs=1 count=405
cat 405zeros filtered.dump | dd bs=512 skip=3 of=trimmed.dump

which completed fairly quickly but there must have been a simpler / better way? Is there another tool I've forgotten about?

Best Answer

You can switch bs and skip options:

dd bs=1131 skip=1 if=filtered.dump of=trimmed.dump

This way the operation can benefit from a greater block.

Otherwise, you could try with tail (although it's not safe to use it with binary files):

tail -c +1132 filtered.dump >trimmed.dump

Finally, you may use 3 dd instances to write something like this:

dd if=filtered.dump bs=512k | { dd bs=1131 count=1 of=/dev/null; dd bs=512k of=trimmed.dump; }

where the first dd prints its standard output filtered.dump; the second one just reads 1131 bytes and throws them away; then, the last one reads from its standard input the remaining bytes of filtered.dump and write them to trimmed.dump.

Related Solutions

Shell – dd remove range of bytes

It is a matter of specifying blocksize, count, and skip:

$ cat hello.txt
hello doge world
$ { dd bs=1 count=2 ; dd skip=3 bs=1 count=1 ; dd skip=6 bs=1 ; } <hello.txt 2>/dev/null
he orld

The above uses three invocations of dd. The first gets the first two characters he. The second skips to the end of hello and copies the space which follows. The third skips into the last word world copying all but its first character.

This was done with GNU dd but BSD dd looks like it should work also.

POSIX – How to Read an Exact Number of Bytes from a File

Unfortunately, to manipulate the content of a binary file, dd is pretty much the only tool in POSIX. Although most modern implementations of text processing tools (cat, sed, awk, …) can manipulate binary files, this is not required by POSIX: some older implementations do choke on null bytes, input not terminated by a newline, or invalid byte sequences in the ambient character encoding.

It is possible, but difficult, to use dd safely. The reason I spend a lot of energy steering people away from it is that there's a lot of advice out there that promotes dd in situations where it is neither useful nor safe.

The problem with dd is its notion of blocks: it assumes that a call to read returns one block; if read returns less data, you get a partial block, which throws things like skip and count off. Here's an example that illustrates the problem, where dd is reading from a pipe that delivers data relatively slowly:

yes hello | while read line; do echo $line; done | dd ibs=4 count=1000 | wc -c

On a bog-standard Linux (Debian jessie, Linux kernel 3.16, dd from GNU coreutils 8.23), I get a highly variable number of bytes, ranging from about 3000 to almost 4000. Change the input block size to a divisor of 6, and the output is consistently 4000 bytes as one would naively expect — the input to dd arrives in bursts of 6 bytes, and as long as a block doesn't span multiple bursts, dd gets to read a complete block.

This suggests a solution: use an input block size of 1. No matter how the input is produced, there's no way for dd to read a partial block if the input block size is 1. (This is not completely obvious: dd could read a block of size 0 if it's interrupted by a signal — but if it's interrupted by a signal, the read system call returns -1. A read returning 0 is only possible if the file is opened in non-blocking mode, and in that case a read had better not be considered to have been performed at all. In blocking mode, read only returns 0 at the end of the file.)

dd ibs=1 count="$number_of_bytes"

The problem with this approach is that it can be slow (but not shockingly slow: only about 4 times slower than head -c in my quick benchmark).

POSIX defines other tools that read binary data and convert it to a text format: uuencode (outputs in historical uuencode format or in Base64), od (outputs an octal or hexadecimal dump). Neither is well-suited to the task at hand. uuencode can be undone by uudecode, but counting bytes in the output is awkward because the number of bytes per line of output is not standardized. It's possible to get well-defined output from od, but unfortunately there's no POSIX tool to go the other way round (it can be done but only through slow loops in sh or awk, which defeats the purpose here).

Best Answer

Related Solutions

Shell – dd remove range of bytes

POSIX – How to Read an Exact Number of Bytes from a File

Related Question