Tr complains of “Illegal byte sequence”

binarycharacter encodinggreptext processingtr

I'm brand new to UNIX and I am using Kirk McElhearn's "The Mac OS X Command Line" to teach myself some commands.

I am attempting to use tr and grep so that I can search for text strings in a regular MS-Office Word Document.

$ tr '\r' '\n' < target-file | grep search-string

But all it returns is:

Illegal byte sequence.

robomechanoid:Position-Paper-Final-Draft robertjralph$ tr '\r' '\n' < Position-Paper-Final-Version.docx | grep DeCSS
tr: Illegal byte sequence
robomechanoid:Position-Paper-Final-Draft robertjralph$

I've actually run the same line on a script that I created in vi and it does the search correctly.

Best Answer

grep is a text processing tool. It expects their input to be text files. It seems that the same goes for tr on macOS (even though tr is supposed to support binary files).

Computers store data as sequences of bytes. A text is a sequence of characters. There are several ways to encode characters as bytes, called character encodings. The de facto standard character encoding in most of the world, especially on OSX, is UTF-8, which is an encoding for the Unicode character set. There are only 256 possible bytes, but over a million possible Unicode characters, so most characters are encoded as multiple bytes. UTF-8 is a variable-length encoding: depending on the character, it can take from one to four bytes to encode a character. Some sequences of bytes do not represent any character in UTF-8. Therefore, there are sequences of bytes which are not valid UTF-8 text files.

tr is complaining because it encountered such a byte sequence. It expects to see a text file encoded in UTF-8, but it sees binary data which is not valid UTF-8.

A Microsoft Word document is not a text file: it's a word processing document. Word processing document formats encode not only text, but also formatting, embedded images, etc. The Word format, like most word processing formats, is not a text file.

You can instruct text processing tools to operate on bytes by changing the locale. Specifically, select the “C” locale, which basically means means “nothing fancy”. On the command line, you can choose locale settings with environment variables.

export LC_CTYPE=C
tr '\r' '\n' < target-file | grep search-string

This will not emit any error, but it won't do anything useful either since target-file is still a binary file which is unlikely to contain most search strings that you'll specify.

Incidentally, tr '\r' '\n' is not a very useful command unless you have text files left over from Mac OS 9 or older. \r (carriage return) was the newline separator in Mac OS before Mac OS X. Since OSX, the newline separator is \n (line feed, the unix standard) and text files do not contain carriage returns. Windows uses the two-character sequence CR-LF to represent line breaks; tr -d '\r' would convert a Windows text file into a Unix/Linux/OSX text file.

So how can you search in a Word document from the command line? A .docx Word document is actually a zip archive containing several files, the main ones being in XML.

unzip -l Position-Paper-Final-Version.docx

Mac OS X includes the zipgrep utility to search inside zip files.

zipgrep DeCSS Position-Paper-Final-Version.docx

The result is not going to be very readable because XML files in the docx format mostly consist of one huge line. If you want to search inside the main body text of the document, extract the file word/document.xml from the archive. Note that in addition to the document text, this file contains XML markup which represents the structure of the document. You can massage the XML markup a bit with sed to split it into manageable lines.

unzip -p Position-Paper-Final-Version.docx word/document.xml |
sed -e 's/></>\n</g' |
grep DeCSS

Related Solutions

Iconv illegal input sequence- why

The file is encoded in ISO-8859-1, not in UTF-8:

$ hd 0606461.txt | grep -B1 '^0002c520'
0002c510  64 75 6d 20 66 65 72 69  65 6e 74 20 72 75 69 6e  |dum ferient ruin|
0002c520  e6 0d 0a 2d 2d 48 6f 72  61 63 65 2e 0d 0a 0d 0a  |...--Horace.....|

And the byte "e6" alone is not a valid UTF-8 sequence.

So, use iconv -f latin1 -t ascii//TRANSLIT file.

Grep – Split Binary Data by Fixed Byte Offset

You can operate on the binary file without needing to go through xxd. I ran your data back through xxd and used grep -b to show me the byte offsets of your pattern (converted from hex to chars \xfa) in the binary file.

I removed with sed the matched characters from the output to leave just the numbers. I then set the shell positional args to the resulting offsets (set -- ...)

xxd -r -p <data26.6.2015.txt >/tmp/f1
set -- $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//')

You now have a list of offsets in $1, $2, ... You can then extract the part that interests you with dd, setting a block size to 1 (bs=1) so that it reads byte by byte. skip= says how many bytes to skip in the input, and count= the number of bytes to copy.

start=$1 end=$2
let count=$end-$start
dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f2

The above extracts from the start of the 1st pattern to just before the 2nd pattern. To not include the pattern, you can add 4 to start (and count reduces by 4).

If you want to extract all parts, use a loop around this same code, and add starting offset 0 and ending offset size-of-file to the list of numbers:

xxd -r -p <data26.6.2015.txt >/tmp/f1
size=$(stat -c '%s' /tmp/f1)
set -- 0 $(grep -b -a -o -P '\xfa\xfa\xfa\xfa' /tmp/f1 | sed 's/:.*//') $size
i=2
while [ $# -ge 2 ]
do start=$1 end=$2
   let count=$end-$start
   dd bs=1 count=$count skip=$start </tmp/f1 >/tmp/f$i
   let i=i+1
   shift
done

If grep doesnt manage to work with the binary data, you can use the xxd hex dump data. First remove all the newlines to have one enormous line, then do the grep using the unescaped hex values, but then divide all the offsets by 2, and do the dd with the raw file:

xxd -r -p <data26.6.2015.txt >r328.raw
tr -d '\n' <data26.6.2015.txt >f1
let size2=2*$(stat -c '%s' f1)
set -- 0 $(grep -b -a -o -P 'fafafafa' f1 | sed 's/:.*//') $size2
i=2
while [ $# -ge 2 ]
do  let start=$1/2
    let end=$2/2
    let count=$end-$start
    dd bs=1 count=$count skip=$start <r328.raw  >f$i
    let i=i+1
    shift
done

Best Answer

Related Solutions

Iconv illegal input sequence- why

Grep – Split Binary Data by Fixed Byte Offset

Related Question