Tr complains of “Illegal byte sequence”

binarycharacter encodinggreptext processingtr

I'm brand new to UNIX and I am using Kirk McElhearn's "The Mac OS X Command Line" to teach myself some commands.

I am attempting to use tr and grep so that I can search for text strings in a regular MS-Office Word Document.

$ tr '\r' '\n' < target-file | grep search-string

But all it returns is:

Illegal byte sequence.

robomechanoid:Position-Paper-Final-Draft robertjralph$ tr '\r' '\n' < Position-Paper-Final-Version.docx | grep DeCSS
tr: Illegal byte sequence
robomechanoid:Position-Paper-Final-Draft robertjralph$ 

I've actually run the same line on a script that I created in vi and it does the search correctly.

Best Answer

grep is a text processing tool. It expects their input to be text files. It seems that the same goes for tr on macOS (even though tr is supposed to support binary files).

Computers store data as sequences of bytes. A text is a sequence of characters. There are several ways to encode characters as bytes, called character encodings. The de facto standard character encoding in most of the world, especially on OSX, is UTF-8, which is an encoding for the Unicode character set. There are only 256 possible bytes, but over a million possible Unicode characters, so most characters are encoded as multiple bytes. UTF-8 is a variable-length encoding: depending on the character, it can take from one to four bytes to encode a character. Some sequences of bytes do not represent any character in UTF-8. Therefore, there are sequences of bytes which are not valid UTF-8 text files.

tr is complaining because it encountered such a byte sequence. It expects to see a text file encoded in UTF-8, but it sees binary data which is not valid UTF-8.

A Microsoft Word document is not a text file: it's a word processing document. Word processing document formats encode not only text, but also formatting, embedded images, etc. The Word format, like most word processing formats, is not a text file.

You can instruct text processing tools to operate on bytes by changing the locale. Specifically, select the “C” locale, which basically means means “nothing fancy”. On the command line, you can choose locale settings with environment variables.

export LC_CTYPE=C
tr '\r' '\n' < target-file | grep search-string

This will not emit any error, but it won't do anything useful either since target-file is still a binary file which is unlikely to contain most search strings that you'll specify.

Incidentally, tr '\r' '\n' is not a very useful command unless you have text files left over from Mac OS 9 or older. \r (carriage return) was the newline separator in Mac OS before Mac OS X. Since OSX, the newline separator is \n (line feed, the unix standard) and text files do not contain carriage returns. Windows uses the two-character sequence CR-LF to represent line breaks; tr -d '\r' would convert a Windows text file into a Unix/Linux/OSX text file.

So how can you search in a Word document from the command line? A .docx Word document is actually a zip archive containing several files, the main ones being in XML.

unzip -l Position-Paper-Final-Version.docx

Mac OS X includes the zipgrep utility to search inside zip files.

zipgrep DeCSS Position-Paper-Final-Version.docx

The result is not going to be very readable because XML files in the docx format mostly consist of one huge line. If you want to search inside the main body text of the document, extract the file word/document.xml from the archive. Note that in addition to the document text, this file contains XML markup which represents the structure of the document. You can massage the XML markup a bit with sed to split it into manageable lines.

unzip -p Position-Paper-Final-Version.docx word/document.xml |
sed -e 's/></>\n</g' |
grep DeCSS
Related Question