In Unicode, some character combinations have more than one representation.
For example, the character ä can be represented as
- "ä", that is the codepoint U+00E4 (two bytes
c3 a4
in UTF-8 encoding), or as - "ä", that is the two codepoints U+0061 U+0308 (three bytes
61 cc 88
in UTF-8).
According to the Unicode standard, the two representations are equivalent but in different "normalization forms", see UAX #15: Unicode Normalization Forms.
The unix toolbox has all kinds of text transformation tools, sed, tr, iconv, Perl come to mind. How can I do quick and easy NF conversion on the command-line?
Best Answer
You can use the
uconv
utility from ICU. Normalization is achieved through transliteration (-x
).On Debian, Ubuntu and other derivatives,
uconv
is in thelibicu-dev
package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in theicu
package.