Postgresql – Why is the Mac install of Postgres not tokenising Thai words

full-text-searchpostgresql

On my Mac install of PG:

=# select to_tsvector('english', 'abcd สวัสดี');
 to_tsvector
-------------
 'abcd':1
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
 blank     | Space symbols   |  สวัสดี | {}             |              |
(2 rows)

On my Linux install of PG:

=# select to_tsvector('english', 'abcd สวัสดี');
    to_tsvector
-------------------
 'abcd':1 'สวัสดี':2
(1 row)

=# select * from ts_debug('hello สวัสดี');
   alias   |    description    | token |  dictionaries  |  dictionary  | lexemes
-----------+-------------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII   | hello | {english_stem} | english_stem | {hello}
 blank     | Space symbols     |       | {}             |              |
 word      | Word, all letters | สวัสดี  | {english_stem} | english_stem | {สวัสดี}
(3 rows)

So something is clearly different about the way the tokenisation is defined in PG. My question is, how do I figure out what is different and how do I make my mac install of PG work like the Linux one?

On both installs:

# SHOW default_text_search_config;
 default_text_search_config
----------------------------
 pg_catalog.english
(1 row)

# show lc_ctype;
  lc_ctype
-------------
 en_US.UTF-8
(1 row)

So somehow this mac install thinks that thai letters are spaces… how do I debug this and fix the "Space Symbol" definition here.

Interestingly this install works with Armenian, but falls over when we reach Hebrew

=# select * from ts_debug('ԵԵԵ');
 alias |    description    | token |  dictionaries  |  dictionary  | lexemes
-------+-------------------+-------+----------------+--------------+---------
 word  | Word, all letters | ԵԵԵ   | {english_stem} | english_stem | {եեե}
(1 row)

=# select * from ts_debug('אאא');
 alias |  description  | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
 blank | Space symbols | אאא   | {}           |            |
(1 row)

Only significant diff I am seeing is that one is compiled with clang and one with gcc

PostgreSQL 11.2 on x86_64-apple-darwin18.2.0, compiled by Apple LLVM version 10.0.0 (clang-1000.11.45.5), 64-bit

VS 

 PostgreSQL 11.2 (Ubuntu 11.2-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0, 64-bit

Best Answer

My question is, how do I figure out what is different

show lc_ctype;

in both databases

and how do I make my mac install of PG work like the Linux one?

Recreate the database on the mac with a proper utf-8 lc_ctype.

Related: Postgres to_tsvector is case sensitive on mac os