Linux – how to find offset of one binary file inside another

binaryfileslarge fileslinux

I have two binary files.
One of few hundreds kilos and other of few gigabytes.
I want to know whether the whole, smaller, file is contained within the larger one and if so then what is the offset from the start of the larger file.
I am interested only in exact matches i.e. whether the whole file is contained by the another.
Both files are binary.
Is there any existing tool/one-liner that does that ?

Best Answer

I could not come up with an existing tool.

grep -F --binary --byte-offset --only-matching seems to be close enough - but you can't escape newlines with -F. And cmp only allows to skip characters. diff also does not seem to be of much help.

But it is a few liner in a programming language with a decent library. For example as a C++ program using Boost:

#include <boost/algorithm/string/find.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <cassert>
#include <iostream>
using namespace boost;
using namespace boost::algorithm;
using namespace boost::iostreams;
using namespace std;

int main(int argc, char **argv)
{
  if (argc != 3) {
    cerr << "Call: " << argv[0] << " PATTERN_FILE SRC_FILE\n";
    return 3;
  }
  mapped_file_source pattern(argv[1]);
  mapped_file_source src(argv[2]);
  iterator_range<const char*> p_range(pattern.data(),
      pattern.data() + pattern.size());
  iterator_range<const char*> s_range(src.data(), src.data() + src.size());
  iterator_range<const char*> result = find_first(s_range, p_range);
  if (result) {
    size_t pos = result.begin()-s_range.begin();
    cout << pos << '\n';
    return 0;
  }
  return 1;
}

You can compile it like this (when the program source is saved as find.cc):

$ make CXXFLAGS="-Wall -g" LDLIBS="-lboost_iostreams" searchb

To test it:

$ dd if=WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3 of=t skip=232323 bs=1 count=4K
$ ls -l t
-rw-r--r-- 1 juser users 4096 2012-05-31 15:24 t
$ ./searchb t WTF_-_EPISODE_277_RACHAEL_HARRIS.mp3
232323

The output is the matching position in the source file.

If the file is not contained the exit status is 1.

Update: In the meantime I've implemented this simple tool in several languages (C/C++/Python/Rust/Go) and have included those implementations in my utility repository. Look for searchb*. The Python implementation is the shortest one and doesn't require any external dependencies.

Related Solutions

Grep – Split Binary Data by Fixed Byte Offset

You can operate on the binary file without needing to go through xxd. I ran your data back through xxd and used grep -b to show me the byte offsets of your pattern (converted from hex to chars \xfa) in the binary file.

I removed with sed the matched characters from the output to leave just the numbers. I then set the shell positional args to the resulting offsets (set -- ...)

xxd -r -p <data26.6.2015.txt >/tmp/f1
set -- $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//')

You now have a list of offsets in $1, $2, ... You can then extract the part that interests you with dd, setting a block size to 1 (bs=1) so that it reads byte by byte. skip= says how many bytes to skip in the input, and count= the number of bytes to copy.

start=$1 end=$2
let count=$end-$start
dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f2

The above extracts from the start of the 1st pattern to just before the 2nd pattern. To not include the pattern, you can add 4 to start (and count reduces by 4).

If you want to extract all parts, use a loop around this same code, and add starting offset 0 and ending offset size-of-file to the list of numbers:

xxd -r -p <data26.6.2015.txt >/tmp/f1
size=$(stat -c '%s' /tmp/f1)
set -- 0 $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//') $size
i=2
while [ $# -ge 2 ]
do start=$1 end=$2
   let count=$end-$start
   dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f$i
   let i=i+1
   shift
done

If grep doesnt manage to work with the binary data, you can use the xxd hex dump data. First remove all the newlines to have one enormous line, then do the grep using the unescaped hex values, but then divide all the offsets by 2, and do the dd with the raw file:

xxd -r -p <data26.6.2015.txt >r328.raw
tr -d '\n' <data26.6.2015.txt >f1
let size2=2*$(stat -c '%s' f1)
set -- 0 $(grep -b -a -o -P 'fafafafa' f1 | sed 's/:.*//') $size2
i=2
while [ $# -ge 2 ]
do  let start=$1/2
    let end=$2/2
    let count=$end-$start
    dd bs=1 count=$count skip=$start <r328.raw  >f$i
    let i=i+1
    shift
done

Replace string in a huge (70GB), one line, text file

The usual text processing tools are not designed to handle lines that don't fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).

If there's an ASCII character that appears frequently in the file and doesn't appear in <unk> or <raw_unk>, then you can use that as the record separator. Since most tools don't allow custom record separators, swap between that character and newlines. tr processes bytes, not lines, so it doesn't care about any record size. Supposing that ; works:

<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new

You could also anchor on the first character of the text you're searching for, assuming that it isn't repeated in the search text and it appears frequently enough. If the file may start with unk>, change the sed command to sed '2,$ s/… to avoid a spurious match.

<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new

Alternatively, use the last character.

<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new

Note that this technique assumes that sed operates seamlessly on a file that doesn't end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you'll avoid any portability trouble.

Best Answer

Related Solutions

Grep – Split Binary Data by Fixed Byte Offset

Replace string in a huge (70GB), one line, text file

Related Question