Postgresql – Full text search in PostgreSQL with search term including unrelated hyphen and ending

full-text-searchpattern matchingpostgresql

I've a problem in my DB (PGSQL 9.6) that I've not been able to resolve and I don't really even understand what options I might have (if any). Simplified I have a DB of multi-language news articles where I've created full text search index (tsvector) on headlines and text. I need to search through the DB to find articles with a certain game name.

Example of an index that would match what I do:

SELECT to_tsvector('simple', 'Yakumo Shionome featured in new Code Vein-trailer');

This results in the following full text search index:

'code':6 'features':3 'in':4 'new':5 'shionome':2 'trailer':9 'vein':8
'vein-trailer':7 'yakumo':1

The challenge come when I want to search for "Code Vein" as game title. In this case I would normally use "phraseto_tsquery". A raw test query could be similar to this:

SELECT to_tsvector('simple','Yakumo Shionome featured in new Code Vein-trailer') 
  @@ phraseto_tsquery('simple','Code Vein');

This would result in FALSE in this case. I would have to search for "Code Vein-Trailer" to get a TRUE match. So the "hyphen" followed by "some word" in essence is causing me trouble and I don't really get what options I might have as far as creating a way around this (there can be many different ending words, many I also might not know ahead of time).

Normally the fact that my Boolean searches are word boundary based is good as it avoid potential false positives, but this seems to be a challenging drawback. I use "simple" due to multiple languages and that game names and elements I sometime search for included stop words – which I felt ended up being problematic or more prune to issues if not taking this approach. I also do require exact matches and not ranked results.

Any thoughts and help welcome.

Best Answer

The problem is that the token 'vein-trailer' gets emitted and is located before the bare token 'vein' is, so is between 'code' and 'vein'.

One way to fix this is to make a new text search configuration which doesn't emit hyphenated words:

create text search configuration simple2 (copy=simple);
alter text search configuration simple2 drop mapping for asciihword;

Then the following returns true:

select 
  to_tsvector('simple2','Yakumo Shionome featured in new Code Vein-trailer') 
  @@
  phraseto_tsquery('simple2','Code Vein');

Whether this will fix all your problems, or create new problems, is hard to tell based on your description and also hard to intuit, but it does fix the literal problem you presented. You probably need to do something with non-ascii hwords as well.