You can operate on the binary file without needing to go through xxd.
I ran your data back through xxd and used grep -b
to show me the byte
offsets of your pattern (converted from hex to chars \xfa
) in the binary
file.
I removed with sed
the matched characters from the output to leave just
the numbers.
I then set the shell positional args to the resulting offsets (set --
...)
xxd -r -p <data26.6.2015.txt >/tmp/f1
set -- $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//')
You now have a list of offsets in $1, $2, ...
You can then extract the part that interests you with dd, setting a block
size to 1 (bs=1
) so that it reads byte by byte. skip=
says how many bytes
to skip in the input, and count=
the number of bytes to copy.
start=$1 end=$2
let count=$end-$start
dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f2
The above extracts from the start of the 1st pattern to just before the 2nd
pattern. To not include the pattern, you can add 4 to start (and count
reduces by 4).
If you want to extract all parts, use a loop around this same code, and add
starting offset 0 and ending offset size-of-file to the list of numbers:
xxd -r -p <data26.6.2015.txt >/tmp/f1
size=$(stat -c '%s' /tmp/f1)
set -- 0 $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//') $size
i=2
while [ $# -ge 2 ]
do start=$1 end=$2
let count=$end-$start
dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f$i
let i=i+1
shift
done
If grep doesnt manage to work with the binary data, you can use the xxd hex dump data. First remove all the newlines to have one enormous line, then do the grep using the unescaped hex values, but then divide all the offsets by 2, and do the dd with the raw file:
xxd -r -p <data26.6.2015.txt >r328.raw
tr -d '\n' <data26.6.2015.txt >f1
let size2=2*$(stat -c '%s' f1)
set -- 0 $(grep -b -a -o -P 'fafafafa' f1 | sed 's/:.*//') $size2
i=2
while [ $# -ge 2 ]
do let start=$1/2
let end=$2/2
let count=$end-$start
dd bs=1 count=$count skip=$start <r328.raw >f$i
let i=i+1
shift
done
The usual text processing tools are not designed to handle lines that don't fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).
If there's an ASCII character that appears frequently in the file and doesn't appear in <unk>
or <raw_unk>
, then you can use that as the record separator. Since most tools don't allow custom record separators, swap between that character and newlines. tr
processes bytes, not lines, so it doesn't care about any record size. Supposing that ;
works:
<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new
You could also anchor on the first character of the text you're searching for, assuming that it isn't repeated in the search text and it appears frequently enough. If the file may start with unk>
, change the sed command to sed '2,$ s/…
to avoid a spurious match.
<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new
Alternatively, use the last character.
<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new
Note that this technique assumes that sed operates seamlessly on a file that doesn't end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you'll avoid any portability trouble.
Best Answer
I could not come up with an existing tool.
grep -F --binary --byte-offset --only-matching
seems to be close enough - but you can't escape newlines with-F
. Andcmp
only allows to skip characters.diff
also does not seem to be of much help.But it is a few liner in a programming language with a decent library. For example as a C++ program using Boost:
You can compile it like this (when the program source is saved as
find.cc
):To test it:
The output is the matching position in the source file.
If the file is not contained the exit status is
1
.Update: In the meantime I've implemented this simple tool in several languages (C/C++/Python/Rust/Go) and have included those implementations in my utility repository. Look for
searchb*
. The Python implementation is the shortest one and doesn't require any external dependencies.