That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr
.
It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.
Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.
GNU's got a plan (see also) to fix that and work is under way but not there yet.
FreeBSD or Solaris tr
don't have the problem.
In the mean time, for most use cases of tr
, you can use GNU sed or GNU awk which do support multi-byte characters.
For instance, your:
tr -cs '[[:alpha:][:space:]]' ' '
could be written:
gsed -E 's/( |[^[:space:][:alpha:]])+/ /'
or:
gawk -v RS='( |[^[:space:][:alpha:]])+' '{printf "%s", sep $0; sep=" "}'
To convert between lower and upper case (tr '[:upper:]' '[:lower:]'
):
gsed 's/[[:upper:]]/\l&/g'
(that l
is a lowercase L
, not the 1
digit).
or:
gawk '{print tolower($0)}'
For portability, perl
is another alternative:
perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'
If you know the data can be represented in a single-byte character set, then you can process it in that charset:
(export LC_ALL=ru_RU.iso88595
iconv -f utf-8 |
tr -cs '[:alpha:][:space:]' ' ' |
iconv -t utf-8) < Russian-file.utf8
I would think Autokey would do the job as described here. The idea is that Autokey is always running and can accept arbitrary strings as triggers.
As the post describes, you just need to set up Autokey to paste unicode characters and as a trigger it could accept something like /delta, which it then replaces with a δ.
Best Answer
GNU
sed
does work with multi-byte characters. So:It's not so much that GNU
tr
hasn't been internationalised but that it doesn't support multi-byte characters (like the non-ASCII ones in UTF-8 locales). GNUtr
would work withÆ
,Œ
as long as they were single-byte like in the iso8859-15 character set.More on that at How to make tr aware of non-ascii(unicode) characters?
In any case, that has nothing to do with Linux, it's about the
tr
implementation on the system. Whether that system uses Linux as a kernel ortr
is built for Linux or use the Linux kernel API is not relevant as that part of thetr
functionality takes place in user space.busybox
tr
and GNUtr
are the most commonly found on distributions of software built for Linux and don't support multi-byte characters, but there are others that have been ported to Linux like thetr
of the heirloom toolchest (ported from OpenSolaris) or of ast-open that do.Note that
sed
'sy
doesn't support ranges likea-z
. Also note that if that script that containssed 'y/é½Æ/ABŒ/'
is written in the UTF-8 charset, it will no longer work as expected if called in a locale where UTF-8 is not the charset.An alternative could be to use
perl
:Above, the perl code is expected to be in UTF-8, but it will process the input in the locale's encoding (and output in that same encoding). If called in a UTF-8 locale, it will transliterate a UTF-8
Æ
(0xc3 0x86) to a UTF-8Œ
(0xc5 0x92) and in a ISO8859-15 same but for 0xc6 -> 0xbc.In most shells, having those UTF-8 characters inside the single quotes should be OK even if the script is called in a locale where UTF-8 is not the charset (an exception is
yash
which would complain if those bytes don't form valid characters in the locale). If you're using other quoting than single-quotes, however, it could cause problems. For instance,would fail in a locale where the charset is BIG5-HKSCS because the encoding of
\
(0x5c) also happens to be contained in some other characters there (likeα
: 0xa3 0x5c, and the UTF-8 encoding of♣
happens to end in 0xa3).In any case, don't expect things like
to work at removing acute accents. The above is actually just
That is, the range is based on the unicode codepoints. So ranges won't be useful outside of very well defined sequences that happen to be in the "right" order in Unicode like
A-Z
,0-9
.If you want to remove acute accents, you'd have to use more advanced tools like:
That is use Unicode normalisation forms to decompose characters, remove the acute accents (here the combining form
U+0301
) and recompose.Another useful tool to transliterate Unicode is
uconv
from ICU. For instance, the above could also be written as:Though would only work on UTF-8 data. You'd need:
To be able to process data in the user's locale.