Linux Text Processing Unicode TR – How to Make tr Aware of Non-ASCII (Unicode) Characters?

linuxtext processingtrunicode

I'm trying to remove some characters from file(UTF-8). I'm using tr for this purpose:

tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat 

File contains some foreign characters (like "Латвийская" or "àé"). tr doesn't seem to understand them: it treats them as non-alpha and removes too.

I've tried changing some of my locale settings:

LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat

Unfortunately, none of these worked.

How can I make tr understand Unicode?

Best Answer

That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr.

It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.

Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.

GNU's got a plan (see also) to fix that and work is under way but not there yet.

FreeBSD or Solaris tr don't have the problem.


In the mean time, for most use cases of tr, you can use GNU sed or GNU awk which do support multi-byte characters.

For instance, your:

tr -cs '[[:alpha:][:space:]]' ' '

could be written:

gsed -E 's/( |[^[:space:][:alpha:]])+/ /'

or:

gawk -v RS='( |[^[:space:][:alpha:]])+' '{printf "%s", sep $0; sep=" "}'

To convert between lower and upper case (tr '[:upper:]' '[:lower:]'):

gsed 's/[[:upper:]]/\l&/g'

(that l is a lowercase L, not the 1 digit).

or:

gawk '{print tolower($0)}'

For portability, perl is another alternative:

perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'

If you know the data can be represented in a single-byte character set, then you can process it in that charset:

(export LC_ALL=ru_RU.iso88595
 iconv -f utf-8 |
   tr -cs '[:alpha:][:space:]' ' ' |
   iconv -t utf-8) < Russian-file.utf8
Related Question