Postgresql – Thai and English in a postgresql

collationencodingpostgresql

I have a postgresql DB with Thai and English values/strings in several tables.

Currently I have the following settings

| Encoding |   Collate   |    Ctype |
| UTF8     | en_US.UTF-8 | en_US.UTF-8  |

If I order by a column with Thai values, the sort order is not correct.

What settings should I change to have postgresql sort correctly?

Best Answer

On PostgreSQL 9.1 and newer, you can use the COLLATE qualifier on an operation to override the database's default collation. See the manual for information on collation support.

E.g.

SELECT a, b FROM mytable ORDER BY c COLLATE 'th_TH.UTF-8';

Note that PostgreSQL can't mix different collations, using a dynamic collation based on detected language. It doesn't work like that.

On prior versions you must use a single database-wide collation. So you'd have to dump your database, CREATE DATABASE ... ENCODING 'UTF-8' LC_COLLATE 'th_TH.UTF-8', and re-load.

LC_COLLATE

LC_COLLATE affects comparisons between strings. In practice, the most visible effect is the sort order. LC_COLLATE='C' (or POSIX which is a synonym) means that it's the byte order that drives comparisons, whereas a locale in the language_REGION form means that cultural rules will drive the comparisons.

An example with french names, executed from inside an UTF-8 database:

select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris'))
 AS l(firstname)
order by firstname collate "fr_FR";

Result:

 firstname 
-----------
 béatrice
 bérénice
 bernard
 boris

béatrice comes before boris, because the accented E compares against O as if it was non-accented. It's a cultural rule.

This differs from what happens with a C locale:

select firstname from (values ('bernard'), ('bérénice'), ('béatrice'), ('boris')) 
 AS l(firstname)
order by firstname collate "C";

Result:

 firstname 
-----------
 bernard
 boris
 béatrice
 bérénice

Now the names with accented E are pushed at the end of the list. The byte representation of é in UTF-8 is the hexadecimal C3 A9 and for o it's 6f. c3 is greater than 6f so under the C locale, 'béatrice' > 'boris'.

It's not just accents. There a more complex rules with hyphenation, punctuation, and weird characters like œ. Weird cultural rules are to be expected in every locale.

Now if the strings to compare happen to mix different languages, as when having a firstname column for people from all other the world, it might be that any particular locale should not dominate, anyway, because different alphabets for different languages have not been designed to be sorted against each other.

In this case C is a rational choice, and it has the advantage of being faster, because nothing can beat pure byte comparisons.

LC_CTYPE

Having LC_CTYPE set to 'C' implies that C functions like isupper(c) or tolower(c) give expected results only for characters in the US-ASCII range (that is, up to codepoint 0x7F in Unicode).

Because SQL functions like upper(), lower() or initcap are implemented in Postgres on top of these libc functions, they're affected by this as soon as there are non US-ASCII characters in strings.

Example:

test=> show lc_ctype;
  lc_ctype   
-------------
 fr_FR.UTF-8
(1 row)

-- Good result
test=> select initcap('élysée');
 initcap 
---------
 Élysée
(1 row)

-- Wrong result
-- collate "C" is the same as if the db has been created with lc_ctype='C'
test=> select initcap('élysée' collate "C");
 initcap 
---------
 éLyséE
(1 row)

For the C locale, é is treated as an uncategorizable character.

Similarly wrong results are also obtained with regular expressions:

test=> select 'élysée' ~ '^\w+$';
 ?column? 
----------
 t
(1 row)

test=> select 'élysée' COLLATE "C" ~ '^\w+$';
 ?column? 
----------
 f
(1 row)

MySQL Cluster node utf8

Show us what you did to (1) use utf8 in the client(s), (2) how you established the charset from client to server, and (3) SHOW CREATE TABLE.

If you left out any one of those, that that is likely to be the problem.

Do not use utf16, only use utf8. (Unless you have some good argument to the contrary.)

If you have INSERTed some data, let's see SELECT col, HEX(col)... to see if it was garbled on the way in.

Best Answer

Related Solutions

PostgreSQL – Impact of LC_CTYPE on Database

LC_COLLATE

LC_CTYPE

MySQL Cluster node utf8

Related Question