How to convert UTF-8 txt files to all uppercase in bash

localetext;trunicode

I have some UTF-8 .txt files which I would like to convert to all uppercase. If it was just ASCII, I could use:

tr [:lower:] [:upper:]

But since I'm working with diacritics and stuff, it doesn't seem to work. I guess it might work if I set the appropriate locale, but I need this script to be portable.

Best Answer

All of:

tr '[:lower:]' '[:upper:]'

(don't forget the quotes, otherwise that won't work if there's a file called :, l, ... or r in the current directory) or:

awk '{print toupper($0)}'

or:

dd conv=ucase

are meant to convert characters to uppercase according to the rules defined in the current locale. However, even where locales use UTF-8 as the character set and clearly define the conversion from lowercase to uppercase, at least GNU dd, GNU tr and mawk (the default awk on Ubuntu for instance) don't follow them. Also, there's no standard way to specify locales other than C or POSIX, so if you want to convert UTF-8 files to uppercase portably regardless of the current locale, you're out of luck with the standard toolchest.

As often, for portability, your best bet may be perl:

$ echo lľsšcčtťzž | PERLIO=:utf8 perl -pe '$_=uc'
LĽSŠCČTŤZŽ

Now, you need to beware that not everybody agrees on what the uppercase version of a specific character is.

For instance, in Turkish locales, the uppercase i is not I, but İ (<U0130>). Here with the heirloom toolchest tr instead of GNU tr:

$ echo ií | LC_ALL=C.UTF-8 tr '[:lower:]' '[:upper:]'
IÍ
$ echo ií | LC_ALL=tr_TR.UTF-8 tr '[:lower:]' '[:upper:]'
İÍ

On my system, the perl to-upper conversion is defined in /usr/share/perl/5.14/unicore/To/Upper.pl, and I find that it behaves differently on a few characters from the GNU libc toupper() in the C.UTF8 locale for instance, perl being more accurate. For instance perl correctly converts ɀ to Ɀ, the GNU libc (2.17) doesn't.

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

UTF-8 in Terminator

I still want to know why my locale settings got so weird after only enabling the en_US locales during installation, but I was able to resolve the issue by adding

export LC_ALL=en_US.UTF-8
export LANG=en_us.UTF-8

to my ~/.bashrc

Best Answer

Related Solutions

How to do a regex search in a UTF-16LE file while in a UTF-8 locale

UTF-8 in Terminator

Related Question