PostgreSQL – Equivalent of UTF8_UNICODE_CI Collation

case sensitivecollationencodingpattern matchingpostgresql

I would like a column in a table inside a PostgreSQL database (I am using version 9.6). I know of the UTF8_UNICODE_CI collation on MySQL, so I tried:

CREATE TABLE thing (
    id    BIGINT PRIMARY KEY
   ,name  VARCHAR(120) NOT NULL COLLATE "UTF8_UNICODE_CI"
);

but I get:

ERROR: collation "UTF8_UNICODE_CI" for encoding "UTF8" does not exist

Reading around, I found that the pg_collation table displays collations, which showed:

=# SELECT * from pg_collation;
 collname | collnamespace | collowner | collencoding | collcollate | collctype
----------+---------------+-----------+--------------+-------------+-----------
 default  |            11 |        10 |           -1 |             |
 C        |            11 |        10 |           -1 | C           | C
 POSIX    |            11 |        10 |           -1 | POSIX       | POSIX
(3 rows)

So it's clearly not there… Is there any way to add it, or obtain the behavior I want? (I want a search WHERE name LIKE '%lala%' to match 'lalá', 'LÂLÄ', etc.)

Best Answer

In Postgres, you would typically solve this differently.

For starters, just use ILIKE for case insensitive matching. The manual:

The key word ILIKE can be used instead of LIKE to make the match case-insensitive according to the active locale. This is not in the SQL standard but is a PostgreSQL extension.

WHERE name ILIKE '%lala%'

You do not need a different collation for this. Trigram indexes support both case sensitive and insensitive pattern matching:

Or use the more portable lower(col) LIKE

WHERE lower(name) LIKE '%lala%'  -- pattern must be lower case, too.

Generic Ruby solution for SQLite3 “LIKE” or PostgreSQL “ILIKE”?

Or you could use the special data type citext provided by the additional module citext:

Index on column with data type citext not used

If you insist on a special collation, you can create your own:

Add a new collation to a Postgres database

I would not do that.

If you want to take string "normalization" one step further (remove all diacritic signs), look to unaccent():

Does PostgreSQL support “accent insensitive” collations?

You can combine both, instructions in the linked answer. More here:

PostgreSQL accent + case insensitive search

Overview:

Pattern matching with LIKE, SIMILAR TO or regular expressions in PostgreSQL

I would just use ILIKE and support it with a trigram index for big tables. Preferably use the latest version Postgres 9.6. It has several improvements for trigram GIN indexes.

Related Solutions

Postgresql – PL/pgSQL – Indexes vs Collation vs Pattern Ops

consider using the citext data type available in contrib. you don't have to do this upper / lower nightmare anymore. for the regular expression type you can use gist along with gist_trgm_ops. it will boost regular expressions nicely.

SQL Server Collation – Using COLLATE: When to Use and When Not to Use

If you are going to use custom collations for specific databases then yes, you'll need to make the collations match whenever you are joining or unioning data from the two databases.

In fact you will need to do this with many metadata queries anyway. Just look at catalog views like sys.tables:

SELECT c.name, c.collation_name
FROM sys.all_columns AS c
INNER JOIN sys.all_views AS v
ON c.[object_id] = v.[object_id]
INNER JOIN sys.schemas AS s
ON v.[schema_id] = s.[schema_id]
WHERE s.name = N'sys' AND v.name = N'tables'
AND c.collation_name IS NOT NULL;

Results:

name                    SQL_Latin1_General_CP1_CI_AS
type                    Latin1_General_CI_AS_KS_WS
type_desc               Latin1_General_CI_AS_KS_WS
lock_escalation_desc    Latin1_General_CI_AS_KS_WS
durability_desc         Latin1_General_CI_AS_KS_WS

So here we have columns with two different collations inside a single system object. No database can have a DATABASE_DEFAULT that matches both...

You don't have any control over the collation of your customer's columns or databases, or the server collation. So really the only ways to resolve the conflict are to:

use a method that doesn't hard-code a specific collation (like DATABASE_DEFAULT)
hard-code some specific, compatible collation on both sides

Since the latter is more work, makes for more complex queries, and introduces more opportunities to turn seeks into scans, I think the former is really your best option.

Best Answer

Related Solutions

Postgresql – PL/pgSQL – Indexes vs Collation vs Pattern Ops

SQL Server Collation – Using COLLATE: When to Use and When Not to Use

Related Question