Postgresql – Postgres – Encoding, Collation and CType

encodinglocalizationpostgresql

                                                        List of databases 
          Name           |  Owner   | Encoding |          Collate           |           Ctype            |
-------------------------+----------+----------+----------------------------+----------------------------|
 MyDatabase              | postgres | UTF8     | English_United States.1252 | English_United States.1252 |

Can someone explain how Encoding, Collate, and Ctype relate to one another?

I know that encoding affects how the information is actually stored (IE whether 'A' requires one byte or multiple bytes, and what value those bytes have, depend on the encoding).

I've been told that collate specifies rules for comparing characters. If you were to sort a bunch of strings, the collate type would dictate the order.

I've struggled to find what Ctype is; possibly related to concepts like uppercase and lowercase (given 'a' knowing that 'A' is the uppercase form?).

I don't understand how (as in my example) I can have a UTF8 encoded database and use a collate value of English 1252. UTF8 has many characters that win1252 does not; what happens if I attempt to sort or compare them? Is my current setup nonsensical….it seems like I would always want the Encoding/Collate/Ctype to agree?

Best Answer

I know that encoding affects how the information is actually stored (IE whether 'A' requires one byte or multiple bytes, and what value those bytes have, depend on the encoding).

Yes. Encoding is the character-to-byte conversion algorithm. For mono-byte encoding such as LATIN1, it's trivially byte value = character number, but for UTF-8 it's more complicated.

I've been told that collate specifies rules for comparing characters. If you were to sort a bunch of strings, the collate type would dictate the order.

Yes, collate specifies how strings are compared. The collation service is provided by the operating system or optionally by the ICU library for PostgreSQL 10 or newer. Sorting arbitrary strings with linguistic rules is a complicated business. The Unicode standard and ISO provide rules but they're heavily customizable and not all C libraries implement them fully anyway. See for instance https://en.wikipedia.org/wiki/Unicode_collation_algorithm for more.

I've struggled to find what Ctype is; possibly related to concepts like uppercase and lowercase (given 'a' knowing that 'A' is the uppercase form?).

LC_CTYPE in POSIX is related to the functions in ctype.h Postgres is strongly influenced by POSIX in that matter and uses a per-database lc_ctype more or less like POSIX uses LC_CTYPE.

Aside from upper() and lower(), it's also relevant for regular expressions and full text search (case-folding of tokens).

I think that Postgres could do without an independant lc_ctype (using the same value as lc_collate) except for one thing: it's interesting to have lc_ctype to something.UTF-8 for instance, to have the full range of characters properly supported, and lc_collate to C because C as a collation is much faster than any linguistic sort implied by lc_collate=lang_country.UTF-8.

Robert Haas (postgres committer) wrote an interesting post about essentially that subject when the COLLATE support was enhanced in Postgres: "The Perils of Collation-Aware Comparisons"

I don't understand how (as in my example) I can have a UTF8 encoded database and use a collate value of English 1252. UTF8 has many characters that win1252 does not

This is a Windows-specific thing, due to Windows not following the POSIX model with locales and collations. On Windows and with non-ICU collations, Postgres will convert the strings from the db encoding to UTF-16 (wchar_t) and call wcscoll_l(). This is why the encoding is decorrelated from the collations.

With ICU collations, either Postgres passes directly UTF-8 contents if it can, or it converts the strings to UTF-16, so again collations are not tied to a particular database encoding, contrary to the POSIX model and its family of strcoll functions.