I have some UTF-8 .txt files which I would like to convert to all uppercase. If it was just ASCII, I could use:
tr [:lower:] [:upper:]
But since I'm working with diacritics and stuff, it doesn't seem to work. I guess it might work if I set the appropriate locale, but I need this script to be portable.
Best Answer
All of:
(don't forget the quotes, otherwise that won't work if there's a file called
:
,l
, ... orr
in the current directory) or:or:
are meant to convert characters to uppercase according to the rules defined in the current locale. However, even where locales use UTF-8 as the character set and clearly define the conversion from lowercase to uppercase, at least GNU
dd
, GNUtr
andmawk
(the defaultawk
on Ubuntu for instance) don't follow them. Also, there's no standard way to specify locales other thanC
orPOSIX
, so if you want to convert UTF-8 files to uppercase portably regardless of the current locale, you're out of luck with the standard toolchest.As often, for portability, your best bet may be perl:
Now, you need to beware that not everybody agrees on what the uppercase version of a specific character is.
For instance, in Turkish locales, the uppercase
i
is notI
, butİ
(<U0130>
). Here with the heirloom toolchesttr
instead of GNU tr:On my system, the
perl
to-upper conversion is defined in/usr/share/perl/5.14/unicore/To/Upper.pl
, and I find that it behaves differently on a few characters from the GNU libctoupper()
in theC.UTF8
locale for instance,perl
being more accurate. For instanceperl
correctly converts ɀ to Ɀ, the GNU libc (2.17) doesn't.