Postgresql – Collations and codesets in PostgreSQL 9.x

character-setcollationpostgresql

In Windows 7, the collation is set to "English_United States.1252" when I select the "English, United States" locale in the installer. In Linux, the collation is set to "en_US.UTF-8" by default.

I haven't found any way to set the collation's codeset to UTF-8 in Windows so I'm just wondering if the databases will behave differently in these example cases? Or in general, what's the impact of the codeset part of a collation to begin with?

I have the encoding set to UTF-8 in both databases and the question is whether the difference in codesets of collations will cause difference in behavior.

Best Answer

The documentation is not very clear with respect to the relationship between the encoding / character set of the database and the ctype / codeset of the collation. All that it mentions are the following statements (all found on the 22.3. Character Set Support documentation page):

  • each database's character set must be compatible with the database's LC_CTYPE (character classification) and LC_COLLATE (string sort order) locale settings

  • On Windows, ..., UTF-8 encoding can be used with any locale.

  • Important: On most modern operating systems, PostgreSQL can determine which character set is implied by the LC_CTYPE setting, and it will enforce that only the matching database encoding is used. On older systems it is your responsibility to ensure that you use the encoding expected by the locale you have selected. A mistake in this area is likely to lead to strange behavior of locale-dependent operations such as sorting.

There is an implication here that the LC_CTYPE value only has rules for the characters in its Character Set. A value of 1252 indicates an Extended ASCII Code Page for Windows Latin1. All of those characters can be encoded into UTF-8 (you current encoding), but that doesn't necessarily mean that locale-aware functions such as upper, lower, initcap, etc will behave as expected when operating on characters that exist outside of the Code Page. This should be testable by running one of those functions on a character that is not in the Windows Latin1 / Code Page 1252 character set. For example, Latin Small Letter Nj U+01CC:

nj

should capitalize to:

NJ

So, if (sorry, I don't have PostgreSQL at the moment to test with) the following;

SELECT upper('nj'), lower('NJ');

returns:

NJ   nj

then it looks very positive that the "1252" LC_CTYPE value is not adversely affecting anything. It would still be good to try a SELECT with an ORDER BY since sorting was mentioned a few times in the documentation as an area that would likely be affected if these values -- encoding and LC_CTYPE -- were in conflict.


You cannot change either LC_COLLATE or LC_CTYPE once a database has been created, so you could try creating a new database to see if you can get the desired settings, regardless of what the installer thinks you should have:

CREATE DATABASE my_db_name WITH
            ENCODING 'UTF8' 
            LC_COLLATE='English_United States.UTF8'
            LC_CTYPE='English_United States.UTF8'
            TEMPLATE=template0;

You might need to look in the pg_collation system catalog to see what is available.