Which Chinese locales are needed to avoid missing or mismatched characters

character encodinglanguagelocaleunicode

Arch Linux lists the following different Chinese locales in /etc/locale.gen:

#zh_CN.GB18030 GB18030
#zh_CN.GBK GBK
#zh_CN.UTF-8 UTF-8
#zh_CN GB2312
#zh_HK.UTF-8 UTF-8
#zh_HK BIG5-HKSCS
#zh_SG.UTF-8 UTF-8
#zh_SG.GBK GBK
#zh_SG GB2312
#zh_TW.EUC-TW EUC-TW
#zh_TW.UTF-8 UTF-8
#zh_TW BIG5

These locales are region specific (representing mainland China, Hong Kong, Singapore and Taiwan, and this doesn’t even include Japan and Korea), but there still are multiple locales for each region.

Due to the unsystematic nature of the characters and due to the spread over such a large usage area, the process of incorporating Chinese characters into UTF has not been trivial:
usage varies regionally. There are numerous variants of the same characters, there are also political and cultural factors playing a role and even within a single region, people may prefer to use certain character variants over the official one, not only in handwriting.

Some of the technical problems are explained here in layman’s terms.

The way I understand it is that characters that are the same in traditional and simplified Chinese (like 你) as well as different variations of the same character within the same category of “simplified” OR “traditional” get the same codepoint, and character variants are implemented using different fonts.

In contrast, sufficiently different versions (e. g. different simplified and traditional characters like 从 and 從) get different codepoints, and therefore, multiple versions of the same character could be included in the same font.

This interwovenness (simplified, traditional and character variants are all incorporated in UTF) leads to the question, why all these different locales are necessary for Chinese, and whether, as a user, it is necessary to install them all.

On a system with sufficient fonts installed (a glyph exists on the system for every character to be displayed):

Which of these locales do I really need to correctly display most characters?

Which Chinese encodings are already incorporated in another encoding (e. g. is UTF backward compatible with Big5 or other Chinese encodings like it is with ASCII)?

Best Answer

I'll be using Western examples as to avoid Chinese politics and to avoid my lack of knowledge of Chinese language ?.

I'll be using some characters that may not be correctly rendered in your computer if you lack a font that covers them. It is not a limitation of the Unicode or of this site HTML, but a limitation of your computer's fonts ^[a] (look at the end of this post).

Missing Codepoints

Your description:

These locales are region specific (representing mainland China, Hong Kong, Singapore and Taiwan, and this doesn’t even include Japan and Korea), but there still are multiple locales for each region.

Yes, but it doesn't matter character wise. All the same characters^[b] are available in all countries.

^[b] Well, technically, all the same codepoints are available in each of the encodings: UTF (all of them: 8,16,32,etc), GBK, GB2312, BIG5-HKSCS, EUC-TWUTF-8 and BIG5. That is similar to some difference(s) between Codepage 437 and Codepage 862 (to pick two random ones). Both share all the ASCII range 0-127 and each try to cover an specific language. Code page 437 is the character set of the original IBM PC (personal computer) and try to cover the generic use of Latin (not being good at any language). Code page 862 is a code page used under DOS for Hebrew. So, all GB2312 use the same numbers for the same codepoints and have available those same codepoints for any country.

Understand that all codepoints (and I do mean all) are available in UTF (any version).

The reason for having several countries is to deal with date formats, weekday names, currency name and all those other elements. Not characters.

That covers the first part of your title:

Which Chinese locales are needed to avoid missing ...

Pick any country you like (even US, GB, FR, etc. any) that use UTF (probably utf-8 in present days) and forget about missing characters, well, again, codepoints.

Mismatched images

Note that I did not use the word characters, but images, the reason is long to explain:

_Graphemes

Codepoints, as explained above are not the same as the images used to express a sound or idea. Human language is far more complex than that simplistic description. To have a working description of what a grapheme is we may cite Wikipedia (emphasis mine):

Grapheme In linguistics, a grapheme is the smallest unit of a writing system of any given language.1 An individual grapheme may or may not carry meaning by itself, and may or may not correspond to a single phoneme of the spoken language. Graphemes include alphabetic letters, typographic ligatures, Chinese characters, numerical digits, punctuation marks, and other individual symbols. A grapheme can also be construed as a graphical sign that independently represents a portion of linguistic material.

So: somewhat similar to saying: graphical symbol.

But the symbol might not be in a one-to-one relation to sound (in western scripts), as explained by:

The Word Burger

There is one sound for two o.

Nor is it connected to a single idea (meaning) as the image of the word spoon, that is: Spoon may be connected to several ideas.

_Morphemes

And Chinese is even more copious (in character count):

... the Chinese writing system is made up of an unlimited set of characters or logographs that represent a unit of meaning or morpheme (i.e., a word). Like any other language, Chinese has thousands of words. Thus, the Chinese writing system requires thousands of characters to represent each of its unique morphemes.

_Glyphs

And each logograph (that represent a unit of meaning) may have several images.

Various glyphs representing the lower case letter "a"; they are allographs of the grapheme "a"

So, the answer to the second part of your title is inherently more complex than the simplistic "list of character numbers" (codepoint).

What seems to be a simple "R", may be a Latin character or a Mathematical symbol ( ?, ?, ?, ? and ? are just some examples of math symbols) or even accented (Ŕ, Ř, Ȑ, Ȓ or Ɍ and more). So, in fact, there is not one simple "R", but some of them.

Fonts

^[a]To be able to present all this different codepoints you need fonts.

Fonts have a concept called coverage. How many codepoints are included.

An extreme (very big file) example is the coverage of all (visible) codepoints in the BMP by Unifont.

For example, ? (U+1110C CHAKMA LETTER CAA)Chakma Alphabet is not a letter you plan on seeing but it is included in Unifont.

There are other open (free) fonts with excellent coverage, like Code2000

And Noto

An example of different fonts in Chinese is in this wikipedia web page

Best Answer

Missing Codepoints

Mismatched images

Graphemes

Morphemes

Glyphs

Fonts

Related Solutions

The relevance of ‘en_AU’ in ‘LC_CTYPE’? and what is `locale LC_CTYPE` output all about

Linux Filesystems – Questions About Character Encoding

1. Is it possible to config linux filesystem use fixed character encoding to store file names regardless of LANG/LC_ALL environment?

2. Is it possible to let different file names refer to same file?

3. Is it possible to patch the kernel to translate character encoding between file-system and current environment?

Related Question

_Graphemes

_Morphemes

_Glyphs