The file is encoded in ISO-8859-1, not in UTF-8:
$ hd 0606461.txt | grep -B1 '^0002c520'
0002c510 64 75 6d 20 66 65 72 69 65 6e 74 20 72 75 69 6e |dum ferient ruin|
0002c520 e6 0d 0a 2d 2d 48 6f 72 61 63 65 2e 0d 0a 0d 0a |...--Horace.....|
And the byte "e6" alone is not a valid UTF-8 sequence.
So, use iconv -f latin1 -t ascii//TRANSLIT file
.
You can operate on the binary file without needing to go through xxd.
I ran your data back through xxd and used grep -b
to show me the byte
offsets of your pattern (converted from hex to chars \xfa
) in the binary
file.
I removed with sed
the matched characters from the output to leave just
the numbers.
I then set the shell positional args to the resulting offsets (set --
...)
xxd -r -p <data26.6.2015.txt >/tmp/f1
set -- $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//')
You now have a list of offsets in $1, $2, ...
You can then extract the part that interests you with dd, setting a block
size to 1 (bs=1
) so that it reads byte by byte. skip=
says how many bytes
to skip in the input, and count=
the number of bytes to copy.
start=$1 end=$2
let count=$end-$start
dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f2
The above extracts from the start of the 1st pattern to just before the 2nd
pattern. To not include the pattern, you can add 4 to start (and count
reduces by 4).
If you want to extract all parts, use a loop around this same code, and add
starting offset 0 and ending offset size-of-file to the list of numbers:
xxd -r -p <data26.6.2015.txt >/tmp/f1
size=$(stat -c '%s' /tmp/f1)
set -- 0 $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//') $size
i=2
while [ $# -ge 2 ]
do start=$1 end=$2
let count=$end-$start
dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f$i
let i=i+1
shift
done
If grep doesnt manage to work with the binary data, you can use the xxd hex dump data. First remove all the newlines to have one enormous line, then do the grep using the unescaped hex values, but then divide all the offsets by 2, and do the dd with the raw file:
xxd -r -p <data26.6.2015.txt >r328.raw
tr -d '\n' <data26.6.2015.txt >f1
let size2=2*$(stat -c '%s' f1)
set -- 0 $(grep -b -a -o -P 'fafafafa' f1 | sed 's/:.*//') $size2
i=2
while [ $# -ge 2 ]
do let start=$1/2
let end=$2/2
let count=$end-$start
dd bs=1 count=$count skip=$start <r328.raw >f$i
let i=i+1
shift
done
Best Answer
grep
is a text processing tool. It expects their input to be text files. It seems that the same goes fortr
on macOS (even thoughtr
is supposed to support binary files).Computers store data as sequences of bytes. A text is a sequence of characters. There are several ways to encode characters as bytes, called character encodings. The de facto standard character encoding in most of the world, especially on OSX, is UTF-8, which is an encoding for the Unicode character set. There are only 256 possible bytes, but over a million possible Unicode characters, so most characters are encoded as multiple bytes. UTF-8 is a variable-length encoding: depending on the character, it can take from one to four bytes to encode a character. Some sequences of bytes do not represent any character in UTF-8. Therefore, there are sequences of bytes which are not valid UTF-8 text files.
tr
is complaining because it encountered such a byte sequence. It expects to see a text file encoded in UTF-8, but it sees binary data which is not valid UTF-8.A Microsoft Word document is not a text file: it's a word processing document. Word processing document formats encode not only text, but also formatting, embedded images, etc. The Word format, like most word processing formats, is not a text file.
You can instruct text processing tools to operate on bytes by changing the locale. Specifically, select the “C” locale, which basically means means “nothing fancy”. On the command line, you can choose locale settings with environment variables.
This will not emit any error, but it won't do anything useful either since
target-file
is still a binary file which is unlikely to contain most search strings that you'll specify.Incidentally,
tr '\r' '\n'
is not a very useful command unless you have text files left over from Mac OS 9 or older.\r
(carriage return) was the newline separator in Mac OS before Mac OS X. Since OSX, the newline separator is\n
(line feed, the unix standard) and text files do not contain carriage returns. Windows uses the two-character sequence CR-LF to represent line breaks;tr -d '\r'
would convert a Windows text file into a Unix/Linux/OSX text file.So how can you search in a Word document from the command line? A
.docx
Word document is actually a zip archive containing several files, the main ones being in XML.Mac OS X includes the zipgrep utility to search inside zip files.
The result is not going to be very readable because XML files in the docx format mostly consist of one huge line. If you want to search inside the main body text of the document, extract the file
word/document.xml
from the archive. Note that in addition to the document text, this file contains XML markup which represents the structure of the document. You can massage the XML markup a bit withsed
to split it into manageable lines.