PostgreSQL – Impact of LC_CTYPE on Database

collationpostgresql

So, I've few Debian servers with PostgreSQL on it. Historically, those servers and PostgreSQL are localized with the Latin 9 charset and back then it was fine. Now we have to handle things like Polish, Greek or Chinese, so changing it become a growing issue.

When I tried to create an UTF8 database, I got the message:

ERROR: encoding UTF8 does not match locale fr_FR Detail: The chosen LC_CTYPE setting requires encoding LATIN9.

Few times I made some research on the subject with my old pal Google, and all I could find was some over-complicated procedures like updating the Debian LANG, recompile PostgreSQL with the correct charset, editing all the LC_ system variables and other obscure solutions. So for the time being, we let this issue aside.

Recently, it came back again, the Greeks want the stuff and Latin 9 don't want to. And while I was looking into this issue again, one colleague come at me and said “Nah, it's easy, look.”

He edited nothing, didn't do magic tricks, he just make this SQL query :

CREATE DATABASE my_utf8_db
  WITH ENCODING='UTF8'
       OWNER=admin
       TEMPLATE=template0
       LC_COLLATE='C'
       LC_CTYPE='C'
       CONNECTION LIMIT=-1
       TABLESPACE=pg_default;

And it worked fine.

I actually didn't know about LC_CTYPE='C' and I was surprised that using this wasn't on the first solutions on Google and even on Stack Overflow. I looked around and I only found a mention on the PostgreSQL documentation.

When LC_CTYPE is C or POSIX, any character set is allowed, but for other settings of LC_CTYPE there is only one character set that will work correctly. Since the LC_CTYPE setting is frozen by initdb, the apparent flexibility to use different encodings in different databases of a cluster is more theoretical than real, except when you select C or POSIX locale (thus disabling any real locale awareness).

So it made me wonder, this is too easy, too perfect, what are the downside? And I've a hard time finding an answer yet. So here I come posting here:

tl;dr: What are the downside of using LC_CTYPE='C' over a specific localization? Is it bad to do so? What should I expect to break?

Best Answer

What are the downside of using LC_CTYPE='C' over a specific localization

The documentation mentions the relationship between locales and SQL features in Locale Support:

The locale settings influence the following SQL features:

Sort order in queries using ORDER BY or the standard comparison operators on textual data

The upper, lower, and initcap functions

Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions

The to_char family of functions

The ability to use indexes with LIKE clauses

The first item (sort order) is about LC_COLLATE and the others seem all to be about LC_CTYPE.

LC_COLLATE

LC_COLLATE affects comparisons between strings. In practice, the most visible effect is the sort order. LC_COLLATE='C' (or POSIX which is a synonym) means that it's the byte order that drives comparisons, whereas a locale in the language_REGION form means that cultural rules will drive the comparisons.

An example with french names, executed from inside an UTF-8 database:

select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris'))
 AS l(firstname)
order by firstname collate "fr_FR";

Result:

 firstname 
-----------
 béatrice
 bérénice
 bernard
 boris

béatrice comes before boris, because the accented E compares against O as if it was non-accented. It's a cultural rule.

This differs from what happens with a C locale:

select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris')) 
 AS l(firstname)
order by firstname collate "C";

Result:

 firstname 
-----------
 bernard
 boris
 béatrice
 bérénice

Now the names with accented E are pushed at the end of the list. The byte representation of é in UTF-8 is the hexadecimal C3 A9 and for o it's 6f. c3 is greater than 6f so under the C locale, 'béatrice' > 'boris'.

It's not just accents. There a more complex rules with hyphenation, punctuation, and weird characters like œ. Weird cultural rules are to be expected in every locale.

Now if the strings to compare happen to mix different languages, as when having a firstname column for people from all other the world, it might be that any particular locale should not dominate, anyway, because different alphabets for different languages have not been designed to be sorted against each other.

In this case C is a rational choice, and it has the advantage of being faster, because nothing can beat pure byte comparisons.

LC_CTYPE

Having LC_CTYPE set to 'C' implies that C functions like isupper(c) or tolower(c) give expected results only for characters in the US-ASCII range (that is, up to codepoint 0x7F in Unicode).

Because SQL functions like upper(), lower() or initcap are implemented in Postgres on top of these libc functions, they're affected by this as soon as there are non US-ASCII characters in strings.

Example:

test=> show lc_ctype;
  lc_ctype   
-------------
 fr_FR.UTF-8
(1 row)

-- Good result
test=> select initcap('élysée');
 initcap 
---------
 Élysée
(1 row)

-- Wrong result
-- collate "C" is the same as if the db has been created with lc_ctype='C'
test=> select initcap('élysée' collate "C");
 initcap 
---------
 éLyséE
(1 row)

For the C locale, é is treated as an uncategorizable character.

Similarly wrong results are also obtained with regular expressions:

test=> select 'élysée' ~ '^\w+$';
 ?column? 
----------
 t
(1 row)

test=> select 'élysée' COLLATE "C" ~ '^\w+$';
 ?column? 
----------
 f
(1 row)

Related Solutions

Postgresql – pg_restore: [archiver] unsupported version (1.11) in file header postgresql

PostgreSQL really doesn't put much effort into supporting moves from newer major versions back to older ones. One big reason for that is that newer versions usually support features which you just can't install on older versions; although any given database might not be using any of the new features, the project philosophy is that they would rather not support it than to support it in a buggy or incomplete fashion.

Your best bet might be to generate a "plain" format dump rather than a "custom" format dump. This is just a text file of SQL commands. If you aren't using any new features, it might just apply cleanly. If not, you can look into the errors and decide what to do for a back-port.

Postgresql – Confused over encoding/locale in postgresql

The encoding defines the very basic rules how characters are represented in binary format (like @a_horse explains in his comment). It should be mentioned that the server encoding has to match the client encoding for successful communication. Postgres can translate if necessary, there is a dedicated setting client_encoding for this.

The locale is a superset of settings, which can be split up for PostgreSQL into

LC_COLLATE
LC_CTYPE
LC_MESSAGES
LC_MONETARY
LC_NUMERIC
LC_TIME

The settings of particular interest for you are LC_COLLATE (defines how strings are sorted) and LC_TYPE (defines the type of characters).
In older versions, these two settings could not be changed after a database had been initialized. Since Postgres 9.1 you can at least override the collation setting when needed.