POSIX Tools – Process Unicode Files with BOM Correctly

grepposixregular expressionunicode

Trying to use grep today, I ran into the familiar problem of the Byte Order Mark (BOM) in a Unicode file (UTF-8, in this case). Specifically, I was trying to find a file beginning with XYZ with the pattern grep '^XYZ', but of course grep treated the BOM as three separate characters and did not match the first line of the file if the first line started with XYZ. I even tried to update the regular expression to ignore spaces ('^[[:space:]]*XYZ'), but to no avail.

Other questions have dealt with converting files or targeting the BOM specifically, but I want to know if POSIX tools have a general option to handle Unicode files correctly. If grep handled the Unicode file correctly, it would consider the file contents to start after the BOM and match XYZ on the first line just like any other line.

Best Answer

The Unicode Consortium has an FAQ that includes How should I deal with BOMs. This portion includes:

Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.

and

Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.

Note that UTF-8 is always of known endianness, because it has no endianness. So as long as you know the text is UTF-8, "the BOM should not be used."

Even cat will return incorrect results when using a BOM unnecessarily, as the BOMs of all files but the first will be treated as zero-width non-breaking spaces. But, the power of UNIX lies in filters.

For operations on a single file or stream,sed "1s/^$(printf '\357\273\277')//" in a pipeline will strip a BOM if present, leaving all other streams intact.

For operations with multiple files, a shell with process substitution (like Bash, but unfortunately not POSIX shell) is useful:

sb() { sed "1s/$(printf '\357\273\277')//" "$@" ; }
cat <(sb file1) <(sb file2) …

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

How to make the login shell xterm use utf-8

At the time the sshd process on the remote computer forks to run /usr/bin/xterm there are very few environment variable set. In fact the LANG variable is not set. Hence the xterm process does not know that it should display characters in UTF-8. It falls back to xterms defaults. Whatever that might be.

However, the subshell running inside the xterm runs all setup scripts and alike. Including setting the LANG environment variable.

One needs to understand the difference between the remote xterm process and the shell process running inside of xterm.

The solution is to run the remote xterm process like this:

/usr/bin/env LANG=en_US.UTF-8 /usr/bin/xterm

env(1) is a utility to run a program in a modified environment.

Setting LANG will make the remote xterm display UTF-8 characters properly.

Eskil... :-)

P.s: Reading the xterm manual page I also found an easier way to achieve this:

xterm -en en_US.UTF-8

P.P.s: I do not think setting resources in ~/.Xresources will take effect unless you merge them in with xrdb. The xterm process on the Linux computer will query the X server running on your windows computer. At the time where xterm starts it is very unlikely that your X-Win32 server has the xterm* resources set. But you might be able to set resources in X-Win32 if it supports that.

Best Answer

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

How to make the login shell xterm use utf-8

Related Question