Bash – How to count the number of times a byte sequence occurs in a file

bashescape-charactersgrep

I want to count how many times a certain sequence of bytes happens inside a file that I have. For example, I want to find out how many times the number \0xdeadbeef occurs inside an executable file. Right now I am doing that using grep:

#/usr/bin/fish
grep -c \Xef\Xbe\Xad\Xde my_executable_file

(The bytes are written in reverse order because my CPU is little-endian)

However, I have two problems with my approach:

  • Those \Xnn escape sequences only work in the fish shell.
  • grep is actually counting the number of lines that contain my magic number. If the pattern occurs twice in the same line it will only count once.

Is there a way to fix these problems? How can I make this one liner run in Bash shell and accurately count number of times the pattern occurs inside the file?

Best Answer

This is the one-liner solution requested (for recent shells that have "process substitution"):

grep -o "ef be ad de" <(hexdump -v -e '/1 "%02x "' infile.bin) | wc -l

If no "process substitution" <(…) is available, just use grep as a filter:

hexdump -v -e '/1 "%02x "' infile.bin  | grep -o "ef be ad de" | wc -l

Below is the detailed description of each part of the solution.

Byte values from hex numbers:

Your first problem is easy to resolve:

Those \Xnn escape sequences only work in the fish shell.

Change the upper X to a lower one x and use printf (for most shells):

$ printf -- '\xef\xbe\xad\xde'

Or use:

$ /usr/bin/printf -- '\xef\xbe\xad\xde'

For those shells that choose to not implement the '\x' representation.

Of course, translating hex to octal will work on (almost) any shell:

$ "$sh" -c 'printf '\''%b'\'' "$(printf '\''\\0%o'\'' $((0xef)) $((0xbe)) $((0xad)) $((0xde)) )"'

Where "$sh" is any (reasonable) shell. But it is quite difficult to keep it correctly quoted.

Binary files.

The most robust solution is to transform the file and the byte sequence (both) to some encoding that has no issues with odd character values like (new line) 0x0A or (null byte) 0x00. Both are quite difficult to manage correctly with tools designed and adapted to process "text files".

A transformation like base64 may seem a valid one, but it presents the issue that every input byte may have up to three output representations depending if it is the first, second or third byte of the mod 24 (bits) position.

$ echo "abc" | base64
YWJjCg==

$ echo "-abc" | base64
LWFiYwo=

$ echo "--abc" | base64
LS1hYmMK

$ echo "---abc" | base64        # Note that YWJj repeats.
LS0tYWJjCg==

Hex transform.

Thats why the most robust transformation should be one that starts on each byte boundary, like the simple HEX representation.
We can get a file with the hex representation of the file with either any of this tools:

$ od -vAn -tx1 infile.bin | tr -d '\n'   > infile.hex
$ hexdump -v -e '/1 "%02x "' infile.bin  > infile.hex
$ xxd -c1 -p infile.bin | tr '\n' ' '    > infile.hex

The byte sequence to search is already in hex in this case.
:

$ var="ef be ad de"

But it could also be transformed. An example of a round trip hex-bin-hex follows:

$ echo "ef be ad de" | xxd -p -r | od -vAn -tx1
ef be ad de

The search string may be set from the binary representation. Any of the three options presented above od, hexdump, or xxd are equivalent. Just make sure to include the spaces to ensure the match is on byte boundaries (no nibble shift allowed):

$ a="$(printf "\xef\xbe\xad\xde" | hexdump -v -e '/1 "%02x "')"
$ echo "$a"
ef be ad de

If the binary file looks like this:

$ cat infile.bin | xxd
00000000: 5468 6973 2069 7320 efbe adde 2061 2074  This is .... a t
00000010: 6573 7420 0aef bead de0a 6f66 2069 6e70  est ......of inp
00000020: 7574 200a dead beef 0a66 726f 6d20 6120  ut ......from a 
00000030: 6269 0a6e 6172 7920 6669 6c65 2e0a 3131  bi.nary file..11
00000040: 3232 3131 3232 3131 3232 3131 3232 3131  2211221122112211
00000050: 3232 3131 3232 3131 3232 3131 3232 3131  2211221122112211
00000060: 3232 0a

Then, a simple grep search will give the list of matched sequences:

$ grep -o "$a" infile.hex | wc -l
2

One Line?

It all may be performed in one line:

$ grep -o "ef be ad de" <(xxd -c 1 -p infile.bin | tr '\n' ' ') | wc -l

For example, searching for 11221122 in the same file will need this two steps:

$ a="$(printf '11221122' | hexdump -v -e '/1 "%02x "')"
$ grep -o "$a" <(xxd -c1 -p infile.bin | tr '\n' ' ') | wc -l
4

To "see" the matches:

$ grep -o "$a" <(xxd -c1 -p infile.bin | tr '\n' ' ')
3131323231313232
3131323231313232
3131323231313232
3131323231313232

$ grep "$a" <(xxd -c1 -p infile.bin | tr '\n' ' ')

… 0a 3131323231313232313132323131323231313232313132323131323231313232 313132320a


Buffering

There is a concern that grep will buffer the whole file, and, if the file is big, create a heavy load for the computer. For that, we may use an unbuffered sed solution:

a='ef be ad de'
hexdump -v -e '/1 "%02x "' infile.bin  | 
    sed -ue 's/\('"$a"'\)/\n\1\n/g' | 
        sed -n '/^'"$a"'$/p' |
            wc -l

The first sed is unbuffered (-u) and is used only to inject two newlines on the stream per matching string. The second sed will only print the (short) matching lines. The wc -l will count the matching lines.

This will buffer only some short lines. The matching string(s) in the second sed. This should be quite low in resources used.

Or, somewhat more complex to understand, but the same idea in one sed:

a='ef be ad de'
hexdump -v -e '/1 "%02x "' infile.bin  |
    sed -u '/\n/P;//!s/'"$a"'/\n&\n/;D' |
        wc -l
Related Question