I am a little confused over the difference between encoding and locale as it pertains to the postgresql database.
When initializing the database, you can specify both an encoding and a locale. Correct me if I am wrong but I assume the encoding defines what the database is actually stored as on the computer.
So if I specify UTF8
, all the characters in the UTF8
character set would be valid characters. If I specified WIN1252
, all those characters would be valid characters. What I dont get is where the locale comes into play. If I specify my encoding as UTF8
and then specify my locale as English_United States.1252
, what does that exactly mean?
I think the WIN1252
character set is a subset of UTF8
so is the locale just specifying the subset of characters to be used from the UTF8
character set?
From what I have read is that UTF8 can be used with ANY locale so what is the point in specifying different encodings if UTF8 is so ubiquitous and the locale is really the one specifying the specific character set to be used?
Also, on Linux, the locale can be specified like so: en_us.utf8
. So the database encoding is specified in the locale? If the encoding is specified in the locale, why even have a -encoding
flag when initializing the database?
Best Answer
The
encoding
defines the very basic rules how characters are represented in binary format (like @a_horse explains in his comment). It should be mentioned that the server encoding has to match the client encoding for successful communication. Postgres can translate if necessary, there is a dedicated settingclient_encoding
for this.The
locale
is a superset of settings, which can be split up for PostgreSQL intoThe settings of particular interest for you are
LC_COLLATE
(defines how strings are sorted) andLC_TYPE
(defines the type of characters).In older versions, these two settings could not be changed after a database had been initialized. Since Postgres 9.1 you can at least override the collation setting when needed.