Unix Command-Line – Convert Between Unicode Normalization Forms

command lineconversiontext processingunicode

In Unicode, some character combinations have more than one representation.

For example, the character ä can be represented as

  • "ä", that is the codepoint U+00E4 (two bytes c3 a4 in UTF-8 encoding), or as
  • "ä", that is the two codepoints U+0061 U+0308 (three bytes 61 cc 88 in UTF-8).

According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms.

The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?

Best Answer

You can use the uconv utility from ICU. Normalization is achieved through transliteration (-x).

$ uconv -x any-nfd <<<ä | hd
00000000  61 cc 88 0a                                       |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000  c3 a4 0a                                          |...|
00000003

On Debian, Ubuntu and other derivatives, uconv is in the libicu-dev package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu package.

Related Question