Linux Text Processing Unicode TR – How to Make tr Aware of Non-ASCII (Unicode) Characters?

linuxtext processingtrunicode

I'm trying to remove some characters from file(UTF-8). I'm using tr for this purpose:

tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat

File contains some foreign characters (like "Латвийская" or "àé"). tr doesn't seem to understand them: it treats them as non-alpha and removes too.

I've tried changing some of my locale settings:

LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat

Unfortunately, none of these worked.

How can I make tr understand Unicode?

Best Answer

That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr.

It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.

Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.

GNU's got a plan (see also) to fix that and work is under way but not there yet.

FreeBSD or Solaris tr don't have the problem.

In the mean time, for most use cases of tr, you can use GNU sed or GNU awk which do support multi-byte characters.

For instance, your:

tr -cs '[[:alpha:][:space:]]' ' '

could be written:

gsed -E 's/( |[^[:space:][:alpha:]])+/ /'

or:

gawk -v RS='( |[^[:space:][:alpha:]])+' '{printf "%s", sep $0; sep=" "}'

To convert between lower and upper case (tr '[:upper:]' '[:lower:]'):

gsed 's/[[:upper:]]/\l&/g'

(that l is a lowercase L, not the 1 digit).

or:

gawk '{print tolower($0)}'

For portability, perl is another alternative:

perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'

If you know the data can be represented in a single-byte character set, then you can process it in that charset:

(export LC_ALL=ru_RU.iso88595
 iconv -f utf-8 |
   tr -cs '[:alpha:][:space:]' ' ' |
   iconv -t utf-8) < Russian-file.utf8

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

How to make the login shell xterm use utf-8

At the time the sshd process on the remote computer forks to run /usr/bin/xterm there are very few environment variable set. In fact the LANG variable is not set. Hence the xterm process does not know that it should display characters in UTF-8. It falls back to xterms defaults. Whatever that might be.

However, the subshell running inside the xterm runs all setup scripts and alike. Including setting the LANG environment variable.

One needs to understand the difference between the remote xterm process and the shell process running inside of xterm.

The solution is to run the remote xterm process like this:

/usr/bin/env LANG=en_US.UTF-8 /usr/bin/xterm

env(1) is a utility to run a program in a modified environment.

Setting LANG will make the remote xterm display UTF-8 characters properly.

Eskil... :-)

P.s: Reading the xterm manual page I also found an easier way to achieve this:

xterm -en en_US.UTF-8

P.P.s: I do not think setting resources in ~/.Xresources will take effect unless you merge them in with xrdb. The xterm process on the Linux computer will query the X server running on your windows computer. At the time where xterm starts it is very unlikely that your X-Win32 server has the xterm* resources set. But you might be able to set resources in X-Win32 if it supports that.

Best Answer

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

How to make the login shell xterm use utf-8

Related Question