Is the `utf8` in `en_US.utf8` a canonical character set

localeunicode

The output of locale seems to distinguish between upper and lowercase:

% locale -a 
C
en_AU.utf8
en_US.utf8
POSIX

More commonly, I've seen the hyphenated and uppercase UTF-8.

What is the canonical name for utf8 / UTF-8?

Best Answer

TL;DR: Nope.

  • utf8 doesn't refer to an IANA character set since it drops the - character.
  • IANA character set names are case INsensitive.
  • Therefore, the following all refer to RFC3629: UTF-8, a transformation format of ISO 10646:
    • UTF-8
    • utf-8
    • uTf-8 (Note all have a hyphen)
  • There is a case-sensitive alias of the above name: csUTF8

The details

POSIX.1-2017, section 8.2 Internationalization Variables

If the locale value has the form:

language[_territory][.codeset]

it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

But while POSIX.1 leaves the details implementation defined, IANA has something to say about it.

RFC2978 IANA Charset Registration Procedures

2.3. Naming Requirements defines a character set primary name:

 mime-charset = 1*mime-charset-chars
 mime-charset-chars = ALPHA / DIGIT /
            "!" / "#" / "$" / "%" / "&" /
            "'" / "+" / "-" / "^" / "_" /
            "`" / "{" / "}" / "~"
 ALPHA        = "A".."Z"    ; Case insensitive ASCII Letter
 DIGIT        = "0".."9"    ; Numeric digit

Note the Case insensitive ASCII Letter.

Interestingly, this means that ^-^ is a happy but valid character set name.

IANA Character Sets

These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation.

The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters. [emphasis mine]

IANA lists the character set as UTF-8.

While utf-8 (or uTf-8) is an official name for an IANA character set name, utf8 (sans hyphen) is not a IANA character set name.

Note that there is also a !case-sensitive! alias for the name UTF-8, namely: csUTF8.

The "cs" stands for character set and is provided for applications that need a lower case first letter but want to use mixed case thereafter that cannot contain any special characters, such as underbar ("_") and dash ("-").

If it's not IANA, where does utf8 likely come from?

glibc's _nl_normalize_codeset() does the following:

  • Only passes characters or a digits (goodbye hyphen)

  • Converts characters to lowercase

    for (cnt = 0; cnt < name_len; ++cnt)
      if (__isalpha_l ((unsigned char) codeset[cnt], locale))
        *wp++ = __tolower_l ((unsigned char) codeset[cnt], locale);
      else if (__isdigit_l ((unsigned char) codeset[cnt], locale))
        *wp++ = codeset[cnt];
    

The code comment incorrectly says:

There is no standard for the codeset names.

This comment doesn't seem cognisant of RFC2978 IANA Charset Registration Procedures, 2.3. Naming Requirements.

Related Question