PostgreSQL – Difference Between Collations ‘C’ and ‘C.UTF-8’

collationencodinglocalespostgresql

In PostgreSQL, what is the difference between collations C and C.UTF-8?

Both show up in rows of pg_collation. Is it perhaps the case that C.UTF-8 is the same as C with encoding UTF-8 regardless or what the actual encoding of a database is?

Best Answer

The PostgreSQL documentation leaves a lot to be desired (just sayin' ? ).

To start with, there is only one encoding for a particular database, so C and C.UTF-8 in your UTF-8 database are both using the UTF-8 encoding.

For libc collations: typically collation names, by convention, are truly two-part names of the following structure:

{locale_name}.{encoding_name}

A "locale" (i.e. "culture") is the set of language-specific rules for sorting (LC_COLLATE) and capitalization (LC_CTYPE). Even though there is sometimes overlap, this really doesn't have anything to do with how this data is stored.

An "encoding" is how the data is stored (i.e. what byte sequence equates to which character). Even though there is sometimes overlap, this really doesn't have anything to do with the sorting and capitalization rules of any particular language that uses the encoding (some encodings can be used by multiple languages that can have quite different rules in one or both of those areas).

To illustrate, consider storing Korean data:

ko_KR is the locale.
Possible encodings that work with this locale are:
- EUC_KR (Extended UNIX Code-KR)
- JOHAB
- UHC (Unified Hangul Code / Windows949)
- UTF8 (Unicode's 8-bit encoding)

Also consider the following, taken from the "Collation Support: libc collations" documentation (emphasis added):

For example, the operating system might provide a locale named de_DE.utf8. initdb would then create a collation named de_DE.utf8 for encoding UTF8 ... It will also create a collation with the .utf8 tag stripped off the name. So you could also use the collation under the name de_DE, which is less cumbersome to write and makes the name less encoding-dependent...

...

Within any particular database, only collations that use that database's encoding are of interest. Other entries in pg_collation are ignored. Thus, a stripped collation name such as de_DE can be considered unique within a given database even though it would not be unique globally. Use of the stripped collation names is recommended, since it will make one less thing you need to change if you decide to change to another database encoding. Note however that the default, C, and POSIX collations can be used regardless of the database encoding.

Meaning, in a database that uses the UTF-8 encoding, en_US and en_US.UTF8 are equivalent. BUT, between that database and a database that uses the LATIN1 encoding, the en_US collations are not equivalent.

So, does this mean that C and C.UTF-8 are the same?

NO, that would be too easy!!! The C collation is an exception to the above-stated behavior. The C collation is a simple set of rules that is available regardless of the database's encoding, and the behavior should be consistent across encodings (which is made possible by only recognizing the US English alphabet — "a-z" and "A-Z" — as "letters", and sorting by byte value, which should be the same for the encodings available to you).

The C.UTF-8 collation is actually a slightly enhanced set of rules, as compared to the base C rules. This difference can actually be seen in pg_collation since the values for the collcollate and collctype columns are different between the rows for C and C.UTF-8.

I put together a set of test queries to illustrate some of the similarities and differences between these two collations, as well as compared to en_GB (and implicitly en_GB.utf8). I started with the queries provided in Daniel Vérité's answer, enhanced them to hopefully be clearer about what is and is not being shown, and added a few queries. The results show us that:

C and C.UTF-8 are actually different sets of rules, even if only slightly different, based on their respective values in the collcollate and collctype columns in pg_collation (final query)
C.UTF-8 expands the characters that are considered "letters"
C.UTF-8, unlike C (but like en_GB), recognizes invalid Unicode code points (i.e. U+0378) and sorts them towards the top
C.UTF-8, like C (but unlike en_GB), sorts non-US-English-letter characters by code point
ucs_basic appears to be equivalent to C (which is stated in the documentation)

You can find, and execute, the queries on: db<>fiddle

Related Solutions

PostgreSQL difference between VACUUM FULL and CLUSTER

To check what CLUSTER does, I took a table fo mine from an earlier experiment which basically contained the first 10 million positive integers. I already deleted some rows and there is an other column as well but these only affect the actual table size, so it is not that interesting.

First, having run VACUUM FULL on the table fka, I took its size:

\dt+ fka
                    List of relations
 Schema | Name | Type  |  Owner   |  Size  | Description 
--------+------+-------+----------+--------+-------------
 public | fka  | table | test     | 338 MB |

Then let's see the physical order of the data from the very beginning of the table:

SELECT *, ctid FROM fka ORDER BY ctid LIMIT 5;

 id  | col1 |  ctid   
-----+------+---------
   2 | 2    | (0,1)
   3 | 3    | (0,2)
   4 | 4    | (0,3)
   5 | 5    | (0,4)
   6 | 6    | (0,5)

Now let's delete some rows:

DELETE FROM fka WHERE id % 10 = 5;
--DELETE 1000000

After this, the reported table size did not change. So let's see now what CLUSTER does:

CLUSTER fka USING fka_pkey;

SELECT *, ctid FROM fka ORDER BY ctid LIMIT 5;

 id  | col1 |  ctid   
-----+------+---------
   2 | 2    | (0,1)
   3 | 3    | (0,2)
   4 | 4    | (0,3)
   6 | 6    | (0,4)
   7 | 7    | (0,5)

After the operation the table size changed from 338 to 296 MB. From the ctid column, which describes the physical place of the tuple in the page, you also see that there is no gap where the row matching id = 5 used to be.

As the tuples were reordered, indexes should have been recreated so that they point to the correct places.

So the difference looks to be that VACUUM FULL does not order the rows. As far as I know, there is some difference in the mechanism the two commands use but from a practical point of view this seems to be the main (only?) difference.

Postgresql – Confused over encoding/locale in postgresql

The encoding defines the very basic rules how characters are represented in binary format (like @a_horse explains in his comment). It should be mentioned that the server encoding has to match the client encoding for successful communication. Postgres can translate if necessary, there is a dedicated setting client_encoding for this.

The locale is a superset of settings, which can be split up for PostgreSQL into

LC_COLLATE
LC_CTYPE
LC_MESSAGES
LC_MONETARY
LC_NUMERIC
LC_TIME

The settings of particular interest for you are LC_COLLATE (defines how strings are sorted) and LC_TYPE (defines the type of characters).
In older versions, these two settings could not be changed after a database had been initialized. Since Postgres 9.1 you can at least override the collation setting when needed.

Best Answer

Related Solutions

PostgreSQL difference between VACUUM FULL and CLUSTER

Postgresql – Confused over encoding/locale in postgresql

Related Question