Postgresql – Collations and codesets in PostgreSQL 9.x

character-setcollationpostgresql

In Windows 7, the collation is set to "English_United States.1252" when I select the "English, United States" locale in the installer. In Linux, the collation is set to "en_US.UTF-8" by default.

I haven't found any way to set the collation's codeset to UTF-8 in Windows so I'm just wondering if the databases will behave differently in these example cases? Or in general, what's the impact of the codeset part of a collation to begin with?

I have the encoding set to UTF-8 in both databases and the question is whether the difference in codesets of collations will cause difference in behavior.

Best Answer

The documentation is not very clear with respect to the relationship between the encoding / character set of the database and the ctype / codeset of the collation. All that it mentions are the following statements (all found on the 22.3. Character Set Support documentation page):

each database's character set must be compatible with the database's LC_CTYPE (character classification) and LC_COLLATE (string sort order) locale settings
On Windows, ..., UTF-8 encoding can be used with any locale.
Important: On most modern operating systems, PostgreSQL can determine which character set is implied by the LC_CTYPE setting, and it will enforce that only the matching database encoding is used. On older systems it is your responsibility to ensure that you use the encoding expected by the locale you have selected. A mistake in this area is likely to lead to strange behavior of locale-dependent operations such as sorting.

There is an implication here that the LC_CTYPE value only has rules for the characters in its Character Set. A value of 1252 indicates an Extended ASCII Code Page for Windows Latin1. All of those characters can be encoded into UTF-8 (you current encoding), but that doesn't necessarily mean that locale-aware functions such as upper, lower, initcap, etc will behave as expected when operating on characters that exist outside of the Code Page. This should be testable by running one of those functions on a character that is not in the Windows Latin1 / Code Page 1252 character set. For example, Latin Small Letter Nj U+01CC:

ǌ

should capitalize to:

Ǌ

So, if ^{(sorry, I don't have PostgreSQL at the moment to test with)} the following;

SELECT upper('ǌ'), lower('Ǌ');

returns:

Ǌ   ǌ

then it looks very positive that the "1252" LC_CTYPE value is not adversely affecting anything. It would still be good to try a SELECT with an ORDER BY since sorting was mentioned a few times in the documentation as an area that would likely be affected if these values -- encoding and LC_CTYPE -- were in conflict.

You cannot change either LC_COLLATE or LC_CTYPE once a database has been created, so you could try creating a new database to see if you can get the desired settings, regardless of what the installer thinks you should have:

CREATE DATABASE my_db_name WITH
            ENCODING 'UTF8' 
            LC_COLLATE='English_United States.UTF8'
            LC_CTYPE='English_United States.UTF8'
            TEMPLATE=template0;

You might need to look in the pg_collation system catalog to see what is available.

Related Solutions

Sql-server – the SQL Server equivalent of MySQL’s utf8mb4 character set

First: the particular "edition" of SQL Server (i.e. Express, Standard, Enterprise, etc) does not matter. All editions for a particular version will behave the same.

When I switch to "Standard" it chooses the Latin1_General_CI_AS collation.

Well, that is better than a collation starting with SQL_, but still not ideal. If you are using SQL Server 2008 or newer, then you should be using a version 100 collation (or a version 140 collation IF using SQL Server 2017 or newer and specifying a Japanese collation). AND, if using SQL Server 2012 or newer, then you should be using a collation that supports supplementary characters, which means your choices are:

version 100 collations with names ending in _SC, or
version 140 collations (only Japanese collations have a version 140 set, but none of these end in _SC since supplementary character support is built into all version 140 collations)

In your case, most likely you want: Latin1_General_100_CI_AI_SC

Technically speaking, the closest equivalent to utf8mb4_general_ci (there is no utf8mb4_Standard, and your screen shot even shows utf8mb4_general_ci) is Latin1_General_CI_AI. Reason being:

The utf8mb4 character set allows you to store supplementary characters (which NVARCHAR also does, regardless of collation),
the general part of the MySQL collation means that supplementary characters all have the same weight. This pre-level 100 SQL Server collation is similar in that all supplementary characters have the same weight, it's just that they have no weight at all.
the ci in the MySQL collation implies ai since as isn't specified.

Still, you want to stick with: Latin1_General_100_CI_AI_SC.

I do not know which character encoding this is and how it would affect the scenario if we would like to take over some data from utf7mb8 MySQL tables into SQL Server.

The encoding is determined by a combination of the datatype and the collation:

NVARCHAR (and NCHAR / NTEXT) is always UTF-16 LE (Little Endian).
VARCHAR (and CHAR / TEXT) is always an 8-bit encoding, the specific encoding determined by the code page associated with the collation being used.

That said, it shouldn't matter what the source encoding is as long as the destination encoding can handle all of the incoming characters (and behave in a similar manner, of course, which is where the culture and sensitivities come in). Assuming you will be storing everything in NVARCHAR (maybe occasionally NCHAR, but never NTEXT as that has been deprecated since SQL Server 2005), then the data transfer tool will handle any necessary conversions.

what does the collation Latin1_General_CI_AS mean?

It means:

Because the name does not start with SQL_, this is a Windows collation, not a SQL Server collation (and this is a good thing as the SQL Server collations — those starting with SQL_ — are mainly for pre-SQL Server 2000 compatibility, although quite unfortunately SQL_Latin1_General_CP1_CI_AS is very common due to it being the default when installing on an OS using US English as its language)
Latin1_General is the culture / locale.
- For NVARCHAR data this determines the linguistic rules used for sorting and comparison.
- For VARCHAR data this determines the:
  - linguistic rules used for sorting and comparison.
  - code page used to encode the characters. For example, Latin1_General collations use code page 1252, Hebrew collations use code page 1255, and so on.
{version}, while not present in this collation name, refers to the SQL Server version in which the collation was introduced (for the most part). Windows collations with no version number in the name are version 80 (meaning SQL Server 2000 as that is version 8.0). Not all versions of SQL Server come with new collations, so there are gaps in the version numbers. There are some that are 90 (for SQL Server 2005, which is version 9.0), most are 100 (for SQL Server 2008, version 10.0), and a small set has 140 (for SQL Server 2017, version 14.0).

I said "for the most part" because the collations ending in _SC were introduced in SQL Server 2012 (version 11.0), but the underlying data wasn't new, they merely added support for supplementary characters for the built-in functions. So, those endings exist for version 90 and 100 collations, but only starting in SQL Server 2012.
Next you have the sensitivities, that can be in any combination of the following, but always specified in this order:
- CS = case-sensitive or CI = case-insensitive
- AS = accent-sensitive or AI = accent-insensitive
- KS = Kana type-sensitive or missing = Kana type-insensitive
- WS = width-sensitive or missing = width insensitive
- VSS = variation selector sensitive (only available in the version 140 collations) or missing = variation selector insensitive
Optional last piece:
- _SC at the end means "Supplementary Character support". The "support" only affects how the built-in functions interpret surrogate pairs (which are how supplementary characters are encoded in UTF-16). Without _SC at the end (or _140_ in the middle), built-in functions don't see a single supplementary character, but instead see two meaningless code points that make up the surrogate pair. This ending can be added to any non-binary, version 90 or 100 collation.
- _BIN or _BIN2 at the end means "binary" sorting and comparison. Data is still stored the same, but there are no linguistic rules. This ending is never combined with any of the 5 sensitivities or _SC. _BIN is the older style, and _BIN2 is the newer, more accurate style. If using SQL Server 2005 or newer, use _BIN2. For details on the differences between _BIN and _BIN2, please see: Differences Between the Various Binary Collations (Cultures, Versions, and BIN vs BIN2).
- _UTF8 is a new option as of SQL Server 2019. It's an 8-bit encoding that allows for Unicode data to be stored in VARCHAR and CHAR datatypes (but not the deprecated TEXT datatype). This option can only be used on collations that support supplementary characters (i.e. version 90 or 100 collations with _SC in their name, and version 140 collations). There is also a single binary _UTF8 collation (_BIN2, not _BIN).
  
  PLEASE NOTE: UTF-8 was designed / created for compatibility with environments / code that are set up for 8-bit encodings yet want to support Unicode. Even though there are a few scenarios where UTF-8 can provide up to 50% space savings as compared to NVARCHAR, that is a side-effect and has a cost of a slight hit to performance in many / most operations. If you need this for compatibility, then the cost is acceptable. If you want this for space-savings, you had better test, and TEST AGAIN. Testing includes all functionality, and more than just a few rows of data. Be warned that UTF-8 collations work best when ALL columns, and the database itself, are using VARCHAR data (columns, variables, string literals) with a _UTF8 collation. This is the natural state for anyone using this for compatibility, but not for those hoping to use it for space-savings. Be careful when mixing VARCHAR data using a _UTF8 collation with either VARCHAR data using non-_UTF8 collations or NVARCHAR data, as you might experience odd behavior / data loss. For more details on the new UTF-8 collations, please see: Native UTF-8 Support in SQL Server 2019: Savior or False Prophet?

Especially how would this behave if you had Chinese characters, for example?

If you are storing those characters in a VARCHAR column or variable, you would most likely lose those characters as they get converted to ?. There are a few Chinese locale collations that use a double-byte character set (DBCS) for VARCHAR that can store more than 256 different characters, but it's still not even close to what is available in Unicode.
If you are storing those characters in an NVARCHAR column or variable, there would be no data loss. However, with the Latin1_General culture / locale (Western European / US English), you won't get any Chinese-specific linguistic rules, so sorting and comparisons of the Chinese characters (anything different than what is in the default definition) will likely not behave as would be appropriate for that language. In which case you simply use a Chinese collation, version 100, and with _SC in the name.

SQL Server Collation – Where to Find Exact Rules

Unfortunately no, no such documentation exists. The most likely reason for this is that it would be much more difficult to document than most people are aware of, due to the actual complexity of the rules. While there is a default weight to each character (even if that weight is 0), how that weight is used is determined by several layers of rules. Some of those rules are dependent on the Collation (e.g. case, accent, Kana, and width sensitivity) and some are simply hard-coded (e.g. sorting lower-case before upper-case in most, but not all, cases). But those types of rules are just some of the types of rules that are processed.

Yes, there are simple mappings of æ = ae, but there are more complex mappings involving combining characters (characters that are not displayed on their own, but are super-imposed over base characters). For example:

SELECT NCHAR(252), NCHAR(0x0308), N'u' + NCHAR(0x0308)
-- ü    ̈   ü

IF (NCHAR(252) = N'u' + NCHAR(0x0308)) SELECT 'Same!' ELSE SELECT 'Nope.';
-- Same!

IF (NCHAR(252) = N'u') SELECT 'Same!' ELSE SELECT 'Nope.';
-- Nope.

IF (NCHAR(252) = N'u' COLLATE Latin1_General_100_CI_AI) SELECT 'Same!' ELSE SELECT 'Nope.';
-- Same!

Of course, some "characters" are comprised of a base character and multiple combining characters.

Also, some weights depend on context (i.e. where the character is placed in relation to other characters). For example, a hyphen (i.e. == minus sign == - ) by itself sorts before letters. A simple test shows this behavior:

SELECT * FROM (VALUES ('a'), ('c'), ('-')) t(c)
ORDER BY t.c COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

SELECT * FROM (VALUES ('a'), ('c'), ('-')) t(c)
ORDER BY t.c COLLATE Latin1_General_CI_AS ASC;

Both Collations have the same ordering here:

-
a
c

Now let's place the hyphen between some letters. First we will test with a SQL Server Collation (which uses the older, simpler "string sort"):

SELECT * FROM (VALUES ('a'), ('c'), ('-'), ('aaa'), ('aca'), ('a-b'), ('a-d')) t(c)
ORDER BY t.c COLLATE SQL_Latin1_General_CP1_CI_AS ASC;

That returns:

-
a
a-b
a-d
aaa
aca
c

Here there is no difference in that the - still comes before the letters, which is why both a- rows came before both the aa and ac rows. Now use a Windows Collation:

SELECT * FROM (VALUES ('a'), ('c'), ('-'), ('aaa'), ('aca'), ('a-b'), ('a-d')) t(c)
ORDER BY t.c COLLATE Latin1_General_CI_AS ASC;

That returns:

-
a
aaa
a-b
aca
a-d
c

What you are seeing here is that the hyphen, due to being between letters, is ignored (and yes, it is ignored only when sorting; when doing a comparison, it is not ignored, which can be seeing by running: SELECT 1 WHERE 'a-b' = 'ab' COLLATE Latin1_General_CI_AS;). This is why a-b sorts between aaa and aca as it is seen as being ab, and a-d comes after aca as it is seen as being ad.

Two other complicating factors are:

What versions of the Unicode Character Database (UCD) and Common Locale Data Repository (CLDR) are being used. It is not as straight-forward as one (or many) would hope, and the older versions of the Unicode standard either do not seem to have all of the base info, or at least they certainly don't make it easy to find
How precise is Microsoft's implementation of the Unicode standard? After all, it is just a standard and can be implemented differently by different vendors.

All that being said, you can still probably get pretty close to what you are looking for by examining some of the base data files from Unicode.org. But first, we will need to know which locale-specific customization file to look at. We can find this by getting the LCID of the Collation in question:

SELECT COLLATIONPROPERTY('Finnish_Swedish_CI_AS', 'LCID'),
       CONVERT(VARBINARY(3), COLLATIONPROPERTY('Finnish_Swedish_CI_AS', 'LCID'));
-- 1035       0x00040B

Now we can look for 040B on the following page: LCID Structure.

We see that 0x040B refers to fi-FI, which means that we will be looking for a file having either fi-FI or just fi in its name.

Explanations (some "light" reading ;-):
Data:
- Transformations (æ → ae, etc.)
- Remapped Unicode Element Table for use with CLDR
- Root collation order
- Finnish customizations (based on LCID of 1035 / 0x040B meaning "fi")

Best Answer

ǌ

Ǌ

Related Solutions

Sql-server – the SQL Server equivalent of MySQL’s utf8mb4 character set

SQL Server Collation – Where to Find Exact Rules

Related Question