I'm trying to remove some characters from file(UTF-8). I'm using tr
for this purpose:
tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
File contains some foreign characters (like "Латвийская" or "àé"). tr
doesn't seem to understand them: it treats them as non-alpha and removes too.
I've tried changing some of my locale settings:
LC_CTYPE=C LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=C tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
LC_CTYPE=ru_RU.UTF-8 LC_COLLATE=ru_RU.UTF-8 tr -cs '[[:alpha:][:space:]]' ' ' <testdata.dat
Unfortunately, none of these worked.
How can I make tr
understand Unicode?
Best Answer
That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of
tr
.It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.
Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.
GNU's got a plan (see also) to fix that and work is under way but not there yet.
FreeBSD or Solaris
tr
don't have the problem.In the mean time, for most use cases of
tr
, you can use GNU sed or GNU awk which do support multi-byte characters.For instance, your:
could be written:
or:
To convert between lower and upper case (
tr '[:upper:]' '[:lower:]'
):(that
l
is a lowercaseL
, not the1
digit).or:
For portability,
perl
is another alternative:If you know the data can be represented in a single-byte character set, then you can process it in that charset: