PostgreSQL – How to Use to_tsquery and to_tsvector

full-text-searchpostgresql

I was quite surprised to find that TO_TSVECTOR('english', '平仮名ひらがな'); results in a vector of '平仮名ひらがな':1. For a system that omitted all non-alphanumeric characters and emojis I threw at it, I expected it to also omit all foreign characters; i.e. the desired behavior is to only index alphanumeric characters.

How does it decide which characters to omit (is it a whitelist/blacklist, a range of character codes, etc)? I don't really mind it automatically converting accented characters to their normal alphanumeric counterpart, but I would like to efficiently omit all characters that don't have an alphanumeric representation.


edit – based from jjane's answer, this is the final solution I'm going with:

CREATE TEXT SEARCH CONFIGURATION basic (parser = pg_catalog.default);
ALTER TEXT SEARCH CONFIGURATION basic ADD MAPPING FOR asciiword, asciihword, hword_asciipart WITH english_stem;
--optionally ALTER TEXT SEARCH CONFIGURATION basic ADD MAPPING FOR email, host WITH english_stem;

Best Answer

They are alphanumeric in some alphabet, just not the one you want to use.

You can alter your text search configuration to drop the types of tokens you do not want. You should probably copy and alter, not alter in place.

CREATE TEXT SEARCH CONFIGURATION ascii_only ( COPY = pg_catalog.english );
alter text search configuration ascii_only drop mapping for word;
select * from to_tsvector('ascii_only', '平仮名ひらがな dogs');

Yields:

"'dog':1"

This is just a minimal example, you will probably need to drop the mappings for other token types as well, chosen from this list. You might not be able to get exactly what you want, for example there is no asciinumword, so there is no way keep 'dog1' but reject '平仮名ひらがな1'. So in that case you would have to implement your own non-default parser, or use functions (hopefully making them inefficient is optional--we might be able to help with that).

You can use ts_debug to see what is going on. A word that is entirely ASCII letters [a-zA-Z] gets tokenized to 'asciiword', and not tokenized to 'word'. So a better name for 'word' might be 'word excluding asciiword'. So by dropping the mapping for 'word', you are excluding words other than asciiwords. This means dropping not only words that are entirely foreign, but also words with accents. It sounded like you were OK with either dropping the accent, or dropping the entire word, but maybe not.

But something with a mixture of letters and numbers is not tokenized to either 'word' or 'asciwword', but rather to 'numword', so would not be excluded unless you also dropped 'numword' in addition to 'word'.

If these don't do what you want and you need to use a function, I don't know what you want the function to do precisely, so it is hard to give an example. If you go that route and implement the function that does what you want and find it not efficient enough, you can start a new post on that topic.