tr Analog for Unicode Characters – Alternatives

trunicodeUtilities

I need internationalized utility that does the same thing as tr: gets character from stream and substitutes it with a corresponding character.
Not a particular case solution like lower-to-upper, but a general case solution is needed.
Without gorillion piped sed calls if possible.

Note that tr does not work on Linux: it translates bytes, not characters. This fails with multibyte encodings.

$ tr --version | head -n 1
tr (GNU coreutils) 8.23
$ echo $LC_CTYPE
en_US.UTF-8
$ echo 'Ångstrom' | tr Æ Œ         
Ņngstrom

Best Answer

GNU sed does work with multi-byte characters. So:

$ echo é½Æ | sed 'y/é½Æ/ABŒ/'
ABŒ

It's not so much that GNU tr hasn't been internationalised but that it doesn't support multi-byte characters (like the non-ASCII ones in UTF-8 locales). GNU tr would work with Æ, Œ as long as they were single-byte like in the iso8859-15 character set.

More on that at How to make tr aware of non-ascii(unicode) characters?

In any case, that has nothing to do with Linux, it's about the tr implementation on the system. Whether that system uses Linux as a kernel or tr is built for Linux or use the Linux kernel API is not relevant as that part of the tr functionality takes place in user space.

busybox tr and GNU tr are the most commonly found on distributions of software built for Linux and don't support multi-byte characters, but there are others that have been ported to Linux like the tr of the heirloom toolchest (ported from OpenSolaris) or of ast-open that do.

Note that sed's y doesn't support ranges like a-z. Also note that if that script that contains sed 'y/é½Æ/ABŒ/' is written in the UTF-8 charset, it will no longer work as expected if called in a locale where UTF-8 is not the charset.

An alternative could be to use perl:

perl -Mopen=locale -Mutf8 -pe 'y/a-zé½Æ/A-ZABŒ/'

Above, the perl code is expected to be in UTF-8, but it will process the input in the locale's encoding (and output in that same encoding). If called in a UTF-8 locale, it will transliterate a UTF-8 Æ (0xc3 0x86) to a UTF-8 Œ (0xc5 0x92) and in a ISO8859-15 same but for 0xc6 -> 0xbc.

In most shells, having those UTF-8 characters inside the single quotes should be OK even if the script is called in a locale where UTF-8 is not the charset (an exception is yash which would complain if those bytes don't form valid characters in the locale). If you're using other quoting than single-quotes, however, it could cause problems. For instance,

perl -Mopen=locale -Mutf8 -pe "y/♣\`/&'/"

would fail in a locale where the charset is BIG5-HKSCS because the encoding of \ (0x5c) also happens to be contained in some other characters there (like α: 0xa3 0x5c, and the UTF-8 encoding of ♣ happens to end in 0xa3).

In any case, don't expect things like

perl -Mopen=locale -Mutf8 -pe 'y/Á-Ź/A-Z/'

to work at removing acute accents. The above is actually just

perl -Mopen=locale -Mutf8 -pe 'y/\x{c1}-\x{179}/\x{41}-\x{5a}/'

That is, the range is based on the unicode codepoints. So ranges won't be useful outside of very well defined sequences that happen to be in the "right" order in Unicode like A-Z, 0-9.

If you want to remove acute accents, you'd have to use more advanced tools like:

perl -Mopen=locale -MUnicode::Normalize -pe '
  $_ = NFKD($_); s/\x{301}//g; $_ = NFKC($_)'

That is use Unicode normalisation forms to decompose characters, remove the acute accents (here the combining form U+0301) and recompose.

Another useful tool to transliterate Unicode is uconv from ICU. For instance, the above could also be written as:

uconv -x '::NFKD; \u0301>; ::NFKC;'

Though would only work on UTF-8 data. You'd need:

iconv -t utf-8 | uconv -x '::NFKD; \u0301>; ::NFKC;' | iconv -f utf-8

To be able to process data in the user's locale.

Related Solutions

Linux Text Processing Unicode TR – How to Make tr Aware of Non-ASCII (Unicode) Characters?

That's a known (1, 2, 3, 4, 5, 6) limitation of the GNU implementation of tr.

It's not as much that it doesn't support foreign, non-English or non-ASCII characters, but that it doesn't support multi-byte characters.

Those Cyrillic characters would be treated OK, if written in the iso8859-5 (single-byte per character) character set (and your locale was using that charset), but your problem is that you're using UTF-8 where non-ASCII characters are encoded in 2 or more bytes.

GNU's got a plan (see also) to fix that and work is under way but not there yet.

FreeBSD or Solaris tr don't have the problem.

In the mean time, for most use cases of tr, you can use GNU sed or GNU awk which do support multi-byte characters.

For instance, your:

tr -cs '[[:alpha:][:space:]]' ' '

could be written:

gsed -E 's/( |[^[:space:][:alpha:]])+/ /'

or:

gawk -v RS='( |[^[:space:][:alpha:]])+' '{printf "%s", sep $0; sep=" "}'

To convert between lower and upper case (tr '[:upper:]' '[:lower:]'):

gsed 's/[[:upper:]]/\l&/g'

(that l is a lowercase L, not the 1 digit).

or:

gawk '{print tolower($0)}'

For portability, perl is another alternative:

perl -Mopen=locale -pe 's/([^[:space:][:alpha:]]| )+/ /g'
perl -Mopen=locale -pe '$_=lc$_'

If you know the data can be represented in a single-byte character set, then you can process it in that charset:

(export LC_ALL=ru_RU.iso88595
 iconv -f utf-8 |
   tr -cs '[:alpha:][:space:]' ' ' |
   iconv -t utf-8) < Russian-file.utf8

Entering Unicode characters by name

I would think Autokey would do the job as described here. The idea is that Autokey is always running and can accept arbitrary strings as triggers.

As the post describes, you just need to set up Autokey to paste unicode characters and as a trigger it could accept something like /delta, which it then replaces with a δ.

Best Answer

Related Solutions

Linux Text Processing Unicode TR – How to Make tr Aware of Non-ASCII (Unicode) Characters?

Entering Unicode characters by name

Related Question