How to convert UTF-8 txt files to all uppercase in bash

localetext;trunicode

I have some UTF-8 .txt files which I would like to convert to all uppercase. If it was just ASCII, I could use:

tr [:lower:] [:upper:]

But since I'm working with diacritics and stuff, it doesn't seem to work. I guess it might work if I set the appropriate locale, but I need this script to be portable.

Best Answer

All of:

tr '[:lower:]' '[:upper:]'

(don't forget the quotes, otherwise that won't work if there's a file called :, l, ... or r in the current directory) or:

awk '{print toupper($0)}'

or:

dd conv=ucase

are meant to convert characters to uppercase according to the rules defined in the current locale. However, even where locales use UTF-8 as the character set and clearly define the conversion from lowercase to uppercase, at least GNU dd, GNU tr and mawk (the default awk on Ubuntu for instance) don't follow them. Also, there's no standard way to specify locales other than C or POSIX, so if you want to convert UTF-8 files to uppercase portably regardless of the current locale, you're out of luck with the standard toolchest.

As often, for portability, your best bet may be perl:

$ echo lľsšcčtťzž | PERLIO=:utf8 perl -pe '$_=uc'
LĽSŠCČTŤZŽ

Now, you need to beware that not everybody agrees on what the uppercase version of a specific character is.

For instance, in Turkish locales, the uppercase i is not I, but İ (<U0130>). Here with the heirloom toolchest tr instead of GNU tr:

$ echo ií | LC_ALL=C.UTF-8 tr '[:lower:]' '[:upper:]'
IÍ
$ echo ií | LC_ALL=tr_TR.UTF-8 tr '[:lower:]' '[:upper:]'
İÍ

On my system, the perl to-upper conversion is defined in /usr/share/perl/5.14/unicore/To/Upper.pl, and I find that it behaves differently on a few characters from the GNU libc toupper() in the C.UTF8 locale for instance, perl being more accurate. For instance perl correctly converts ɀ to Ɀ, the GNU libc (2.17) doesn't.

Related Question