This question was stimulated by asking the question
Chromium browser does not allow setting the default paper size for "Print to File", and also by a conversation with @Gilles on chat. As pointed out by @don_crissti, and as verified by me, changing the locale (at least LC_PAPER
) makes a difference in what paper size is selected.
I had never given much thought to what to select, and had always gone with en_US.UTF-8
because it seemed like a reasonable default choice.
However, per @Gilles on chat (see conversation starting at http://chat.stackexchange.com/transcript/message/17017095#17017095). Extracts:
Gilles: LC_PAPER defaults to $LANG
Gilles: You must have LANG=en_US.UTF-8. That's a bad idea: it sets
LC_COLLATE and that's almost always a bad thingGilles: LC_COLLATE doesn't describe correct collation, it's too
restrictive (it goes character by character) remove LANG and instead
set LC_CTYPE and LC_PAPERGilles: plus LC_MESSAGES if you want messages in a language other than
English
Clearly, there are issues here I am not aware of, and I am sure many others are as well. So, what issues should you consider when setting locales, and how should you set them? I've always just run dpkg-reconfigure locales
in Debian, and not thought twice about it.
Specific question: Should I set my locale to en_IN.UTF-8? Are there any drawbacks of doing so?
Best Answer
Locale settings are user preferences that relate to your culture.
Locale names
On all current unix variants that I know of (but not on a few antiques), locale names follow the same pattern:
en
for English,de
for German,ja
for Japanese,uk
for Ukrainian,ber
for Berber, …_
followed by an ISO 3166 uppercase two-letter country code. Thus:en_US
for US English,en_UK
for British English,fr_CA
Canadian (Québec) French,de_DE
for German of Germany,de_AT
for German of Austria,ja_JP
for Japanese (of Japan), etc..
followed by the name of a character encoding such asUTF-8
,ISO-8859-1
,KOI8-U
,GB2312
,Big5
, etc. With GNU libc at least (I don't know how widespread this is), case and punctuation is ignored in encoding names. For example,zh_CN.UTF-8
is Mandarin (simplified) Chinese encoded in UTF-8, whilezh_CN
is Mandarin Chinese encoded in GB2312, andzh_TW
is Taiwanese (traditional) Chinese encoded in Big5.@
followed by the name of a variant. The meaning of variants is locale-dependent. For example, many European countries have an@euro
locale variant where the currency sign is € and where the encoding is one that includes this character (ISO 8859-15 or ISO 8859-16), as opposed to the unadorned variant with the older currency sign. For example,en_IE
(English, Ireland) uses the latin1 (ISO 8859-1) encoding and £ as the currency symbol whileen_IE@euro
uses the latin9 (ISO 8859-15) encoding and € as the currency symbol.In addition, there are two locale names that exist on all unix-like system:
C
andPOSIX
. These names are synonymous and mean computerese, i.e. default settings that are appropriate for data that is parsed by a computer program.Locale settings
The following locale categories are defined by POSIX:
LC_CTYPE
: the character set used by terminal applications: classification data (which characters are letters, punctuation, spaces, invalid, etc.) and case conversion. Text utilities typically heedLC_CTYPE
to determine character boundaries.LC_COLLATE
: collation (i.e. sorting) order. This setting is of very limited use for several reasons:LC_COLLATE
.LC_COLLATE
can have nasty side effects, in particular because it causes the sort order A < a < B < …, which makes “between A and Z” include the lowercase letters a through y. In particular, very common regular expressions like[A-Z]
break some applications.LC_MESSAGES
: the language of informational and error messages.LC_NUMERIC
: number formatting: decimal and thousands separator.Many applications hard-code
.
as a decimal separator. This makesLC_NUMERIC
not very useful and potentially dangerous:.
to be the decimal point, or,
to be a field separator.LC_MONETARY
: likeLC_NUMERIC
, but for amounts of local currency.Very few applications use this.
LC_TIME
: date and time formatting: weekday and month names, 12 or 24-hour clock, order of date parts, punctuation, etc.GNU libc, which you'll find on non-embedded Linux, defines additional locale categories:
LC_PAPER
: the default paper size (defined by height and width).LC_NAME
,LC_ADDRESS
,LC_TELEPHONE
,LC_MEASUREMENT
,LC_IDENTIFICATION
: I don't know of any application that uses these.Environment variables
Applications that use locale settings determine them from environment variables.
LANG
environment variable is used unless overridden by another setting. IfLANG
is not set, the default locale isC
.LC_xxx
names can be used as environment variables.LC_ALL
is set, then all other values are ignored; this is primarily useful to setLC_ALL=C
run applications that need to produce the same output regardless of where they are run.LANGUAGE
to define fallbacks forLC_MESSAGES
(e.g.LANGUAGE=fr_BE:fr_FR:en
to prefer Belgian French, or if unavailable France French, or if unavailable English).Installing locales
Locale data can be large, so some distributions don't ship them in a usable form and instead require an additional installation step.
dpkg-reconfigure locales
and select from the list in the dialog box, or edit/etc/locale.gen
and then runlocale-gen
.locale-gen
with the names of the locales as arguments.You can define your own locale.
Recommendation
The useful settings are:
LC_CTYPE
to the language and encoding that you encode your text files in. Ensure that your terminals use that encoding.For most languages, only the encoding matters. There are a few exceptions; for example, an uppercase
i
isI
in most languages butİ
in Turkish (tr_TR
).LC_MESSAGES
to the language that you want to see messages in.LC_PAPER
toen_US
if you want US Letter to be the default paper size and just about anything else (e.g.en_GB
) if you want A4.LC_TIME
to your favorite time format.As explained above, avoid setting
LC_COLLATE
andLC_NUMERIC
. If you useLANG
, explicitly override these two categories by setting them toC
.