Bash – How to count the number of times a byte sequence occurs in a file

bashescape-charactersgrep

I want to count how many times a certain sequence of bytes happens inside a file that I have. For example, I want to find out how many times the number \0xdeadbeef occurs inside an executable file. Right now I am doing that using grep:

#/usr/bin/fish
grep -c \Xef\Xbe\Xad\Xde my_executable_file

(The bytes are written in reverse order because my CPU is little-endian)

However, I have two problems with my approach:

Those \Xnn escape sequences only work in the fish shell.
grep is actually counting the number of lines that contain my magic number. If the pattern occurs twice in the same line it will only count once.

Is there a way to fix these problems? How can I make this one liner run in Bash shell and accurately count number of times the pattern occurs inside the file?

Best Answer

This is the one-liner solution requested (for recent shells that have "process substitution"):

grep -o "ef be ad de" <(hexdump -v -e '/1 "%02x "' infile.bin) | wc -l

If no "process substitution" <(…) is available, just use grep as a filter:

hexdump -v -e '/1 "%02x "' infile.bin  | grep -o "ef be ad de" | wc -l

Below is the detailed description of each part of the solution.

Byte values from hex numbers:

Your first problem is easy to resolve:

Those \Xnn escape sequences only work in the fish shell.

Change the upper X to a lower one x and use printf (for most shells):

$ printf -- '\xef\xbe\xad\xde'

Or use:

$ /usr/bin/printf -- '\xef\xbe\xad\xde'

For those shells that choose to not implement the '\x' representation.

Of course, translating hex to octal will work on (almost) any shell:

$ "$sh" -c 'printf '\''%b'\'' "$(printf '\''\\0%o'\'' $((0xef)) $((0xbe)) $((0xad)) $((0xde)) )"'

Where "$sh" is any (reasonable) shell. But it is quite difficult to keep it correctly quoted.

Binary files.

The most robust solution is to transform the file and the byte sequence (both) to some encoding that has no issues with odd character values like (new line) 0x0A or (null byte) 0x00. Both are quite difficult to manage correctly with tools designed and adapted to process "text files".

A transformation like base64 may seem a valid one, but it presents the issue that every input byte may have up to three output representations depending if it is the first, second or third byte of the mod 24 (bits) position.

$ echo "abc" | base64
YWJjCg==

$ echo "-abc" | base64
LWFiYwo=

$ echo "--abc" | base64
LS1hYmMK

$ echo "---abc" | base64        # Note that YWJj repeats.
LS0tYWJjCg==

Hex transform.

Thats why the most robust transformation should be one that starts on each byte boundary, like the simple HEX representation.
We can get a file with the hex representation of the file with either any of this tools:

$ od -vAn -tx1 infile.bin | tr -d '\n'   > infile.hex
$ hexdump -v -e '/1 "%02x "' infile.bin  > infile.hex
$ xxd -c1 -p infile.bin | tr '\n' ' '    > infile.hex

The byte sequence to search is already in hex in this case.
:

$ var="ef be ad de"

But it could also be transformed. An example of a round trip hex-bin-hex follows:

$ echo "ef be ad de" | xxd -p -r | od -vAn -tx1
ef be ad de

The search string may be set from the binary representation. Any of the three options presented above od, hexdump, or xxd are equivalent. Just make sure to include the spaces to ensure the match is on byte boundaries (no nibble shift allowed):

$ a="$(printf "\xef\xbe\xad\xde" | hexdump -v -e '/1 "%02x "')"
$ echo "$a"
ef be ad de

If the binary file looks like this:

$ cat infile.bin | xxd
00000000: 5468 6973 2069 7320 efbe adde 2061 2074  This is .... a t
00000010: 6573 7420 0aef bead de0a 6f66 2069 6e70  est ......of inp
00000020: 7574 200a dead beef 0a66 726f 6d20 6120  ut ......from a 
00000030: 6269 0a6e 6172 7920 6669 6c65 2e0a 3131  bi.nary file..11
00000040: 3232 3131 3232 3131 3232 3131 3232 3131  2211221122112211
00000050: 3232 3131 3232 3131 3232 3131 3232 3131  2211221122112211
00000060: 3232 0a

Then, a simple grep search will give the list of matched sequences:

$ grep -o "$a" infile.hex | wc -l
2

One Line?

It all may be performed in one line:

$ grep -o "ef be ad de" <(xxd -c 1 -p infile.bin | tr '\n' ' ') | wc -l

For example, searching for 11221122 in the same file will need this two steps:

$ a="$(printf '11221122' | hexdump -v -e '/1 "%02x "')"
$ grep -o "$a" <(xxd -c1 -p infile.bin | tr '\n' ' ') | wc -l
4

To "see" the matches:

$ grep -o "$a" <(xxd -c1 -p infile.bin | tr '\n' ' ')
3131323231313232
3131323231313232
3131323231313232
3131323231313232

$ grep "$a" <(xxd -c1 -p infile.bin | tr '\n' ' ')

… 0a 3131323231313232313132323131323231313232313132323131323231313232 313132320a

Buffering

There is a concern that grep will buffer the whole file, and, if the file is big, create a heavy load for the computer. For that, we may use an unbuffered sed solution:

a='ef be ad de'
hexdump -v -e '/1 "%02x "' infile.bin  | 
    sed -ue 's/\('"$a"'\)/\n\1\n/g' | 
        sed -n '/^'"$a"'$/p' |
            wc -l

The first sed is unbuffered (-u) and is used only to inject two newlines on the stream per matching string. The second sed will only print the (short) matching lines. The wc -l will count the matching lines.

This will buffer only some short lines. The matching string(s) in the second sed. This should be quite low in resources used.

Or, somewhat more complex to understand, but the same idea in one sed:

a='ef be ad de'
hexdump -v -e '/1 "%02x "' infile.bin  |
    sed -u '/\n/P;//!s/'"$a"'/\n&\n/;D' |
        wc -l

Using grep

Why can't you just use the -r switch to grep to recurse the filesystem instead of making use of find? There are 2 additional switches I'd use too, instead of the -n switch.

$ grep -rHn PATTERN <DIR> | cut -d":" -f1-2

Example #1

$ grep -rHn PATH ~/.bashrc | cut -d":" -f1-2
/home/saml/.bashrc:25

Details

-r - recursively search through files + directories
-H - prints the name of the file if it matches (less restrictive than -l) i.e. it works with grep's other switches
-n - display the line number of the match

Example #2

$ grep -rHn PATH ~/.bash* | cut -d":" -f1-2
/home/saml/.bash_profile:10
/home/saml/.bash_profile:12
/home/saml/.bash_profile_askapache:99
/home/saml/.bash_profile_askapache:101
/home/saml/.bash_profile_askapache:118
/home/saml/.bash_profile_askapache:166
/home/saml/.bash_profile_askapache:218
/home/saml/.bash_profile_askapache:250
/home/saml/.bash_profile_askapache:314
/home/saml/.bash_profile_askapache:2317
/home/saml/.bash_profile_askapache:2323
/home/saml/.bashrc:25

Using find

$ find . -exec sh -c 'grep -Hn PATTERN "$@" | cut -d":" -f1-2' {}  +

Example

$ find ~/.bash* -exec sh -c 'grep -Hn PATH "$@" | cut -d":" -f1-2' {}  +
/home/saml/.bash_profile:10
/home/saml/.bash_profile:12
/home/saml/.bash_profile_askapache:99
/home/saml/.bash_profile_askapache:101
/home/saml/.bash_profile_askapache:118
/home/saml/.bash_profile_askapache:166
/home/saml/.bash_profile_askapache:218
/home/saml/.bash_profile_askapache:250
/home/saml/.bash_profile_askapache:314
/home/saml/.bash_profile_askapache:2317
/home/saml/.bash_profile_askapache:2323
/home/saml/.bashrc:25

If you truly want to use find you can do something like this to exec grep upon finding the files using find.

Match pattern \\\” using grep

To look for `\\\"` anywhere on a line:

grep -F '\\\"'

That is, use -F for a fixed string search as opposed to a regular expression match (where backslash is special). And use strong quotes ('...') inside which backslash is not special.

Without -F, you'd need to double the backslashes:

grep '\\\\\\"'

Or use:

grep '\\\{3\}"'
grep -E '\\{3}"'
grep -E '[\]{3}"'

Within double quotes, you'd need another level of backslashes and also escape the " with backslash:

#              1
#     1234567890123
grep "\\\\\\\\\\\\\""

backslash is another shell quoting operator. So you can also quote those backslash and " characters with backslash:

\g\r\e\p \\\\\\\\\\\\\"

I've even quoted the characters of grep above though that's not necessary (as none of g, r, e, p are special to the shell (except in the Bourne shell if they appear in $IFS). The only character I've not quoted is the space character, as we do need its special meaning in the shell: separate arguments.

To look for `\\\"` provided it's not preceded by another backslash

grep -e '^\\\\\\"' -e '[^\]\\\\\\"'

That is, look for \\\" at the beginning of the line, or following a character other than backslash.

That time, we have to use a regular expression, a fixed-string search won't do.

grep returns the lines that match any of those expressions. You can also write it with one expression per line:

grep '^\\\\\\"
[^\]\\\\\\"'

Or with only one expression:

grep '^\(.*[^\]\)\{0,1\}\\\{3\}"' # BRE
grep -E '^(.*[^\])?\\{3}"'        # ERE equivalent
grep -E '(^|[^\])\\{3}"'

With GNU grep built with PCRE support, you can use a look-behind negative assertion:

grep -P '(?<!\\)\\{3}"'

Get a match count

To get a count of the lines that match the pattern (that is, that have one or more occurrences of \\\"), you'd add the -c option to grep. If however you want the number of occurrences, you can use the GNU specific -o option (though now also supported by a few other implementations) to print all the matches one per line, and then pipe to wc -l to get a line-count:

grep -Po '(?<!\\)\\{3}"' | wc -l

Or standardly/POSIXly, use awk instead:

awk '{n+=gsub(/(^|[^\\])\\{3}"/,"")};END{print 0+n}'

(awk's gsub() substitutes and returns the number of substitutions).