Postgresql – Supporting non-ascii characters in Postgres full text (english)

full-text-searchpostgresqlunicode

We're implementing Postgres full-text search. While the bulk of the text is normal English, there are a number of place names which include diacriticals (ñ,á,ó, etc)

More specifically, there are words like cañon — I want to make sure that users who type canon also get matches for cañon.

Is there specific support for this? Or is the solution simply to manually manage the specific special cases using a thesaurus?

Best Answer

You can use unaccent extension, either by pre-processing the text with unaccent function, or creating your own text search configuration. For instance (based on the example on unaccent's doc):

CREATE TEXT SEARCH CONFIGURATION my_conf ( COPY = english );
ALTER TEXT SEARCH CONFIGURATION my_conf
    ALTER MAPPING FOR hword, hword_part, word
    WITH unaccent, english_stem;

So you can match it:

mydb=> SELECT to_tsvector('english','cañon'), to_tsvector('my_conf', 'cañon');
 to_tsvector | to_tsvector 
-------------+-------------
 'cañon':1   | 'canon':1
(1 row)

mydb=> SELECT
mydb->     to_tsvector('english','cañon') @@ to_tsquery('english', 'cañon'),
mydb->     to_tsvector('english','canon') @@ to_tsquery('english', 'canon'),
mydb->     to_tsvector('english','cañon') @@ to_tsquery('english', 'canon'),
mydb->     to_tsvector('english','canon') @@ to_tsquery('english', 'cañon'),
mydb->     to_tsvector('my_conf','cañon') @@ to_tsquery('my_conf', 'cañon'),
mydb->     to_tsvector('my_conf','canon') @@ to_tsquery('my_conf', 'canon'),
mydb->     to_tsvector('my_conf', 'cañon') @@ to_tsquery('my_conf', 'canon'),
mydb->     to_tsvector('my_conf', 'canon') @@ to_tsquery('my_conf', 'cañon');
 ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? | ?column? 
----------+----------+----------+----------+----------+----------+----------+----------
 t        | t        | f        | f        | t        | t        | t        | t
(1 row)

See the last query, I'm matching combinations of cañon and canon, when using english configuration (first 4 columns) will only match if both are the same (first 2 ones), but my_conf configuration (last 4 columns) will match in all of them.