On my Mac install of PG:
=# select to_tsvector('english', 'abcd สวัสดี');
to_tsvector
-------------
'abcd':1
(1 row)
=# select * from ts_debug('hello สวัสดี');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-----------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | สวัสดี | {} | |
(2 rows)
On my Linux install of PG:
=# select to_tsvector('english', 'abcd สวัสดี');
to_tsvector
-------------------
'abcd':1 'สวัสดี':2
(1 row)
=# select * from ts_debug('hello สวัสดี');
alias | description | token | dictionaries | dictionary | lexemes
-----------+-------------------+-------+----------------+--------------+---------
asciiword | Word, all ASCII | hello | {english_stem} | english_stem | {hello}
blank | Space symbols | | {} | |
word | Word, all letters | สวัสดี | {english_stem} | english_stem | {สวัสดี}
(3 rows)
So something is clearly different about the way the tokenisation is defined in PG. My question is, how do I figure out what is different and how do I make my mac install of PG work like the Linux one?
On both installs:
# SHOW default_text_search_config;
default_text_search_config
----------------------------
pg_catalog.english
(1 row)
# show lc_ctype;
lc_ctype
-------------
en_US.UTF-8
(1 row)
So somehow this mac install thinks that thai letters are spaces… how do I debug this and fix the "Space Symbol" definition here.
Interestingly this install works with Armenian, but falls over when we reach Hebrew
=# select * from ts_debug('ԵԵԵ');
alias | description | token | dictionaries | dictionary | lexemes
-------+-------------------+-------+----------------+--------------+---------
word | Word, all letters | ԵԵԵ | {english_stem} | english_stem | {եեե}
(1 row)
=# select * from ts_debug('אאא');
alias | description | token | dictionaries | dictionary | lexemes
-------+---------------+-------+--------------+------------+---------
blank | Space symbols | אאא | {} | |
(1 row)
Only significant diff I am seeing is that one is compiled with clang and one with gcc
PostgreSQL 11.2 on x86_64-apple-darwin18.2.0, compiled by Apple LLVM version 10.0.0 (clang-1000.11.45.5), 64-bit
VS
PostgreSQL 11.2 (Ubuntu 11.2-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0, 64-bit
Best Answer
in both databases
Recreate the database on the mac with a proper utf-8
lc_ctype
.Related: Postgres to_tsvector is case sensitive on mac os