Postgresql – How to search hyphenated words in PostgreSQL full text search

full-text-searchpattern matchingpostgresql

I have to search for hyphenated words like 'good-morning', 'good-evening', etc.

My query is:

select id, ts_headline(content,
                       to_tsquery('english','good-morning'),
                       'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') 
from table 
where ts_content @@ to_tsquery('english','good-morning');

When executing this query I also get results of 'good' and 'morning' separately. But I want exactly matching words and fragments.
(For ts_content I used the same default config english to create the tsvector.)

How can I search such hyphenated words in PostgreSQL full text search?

Best Answer

The key word here is phrase search, introduced with Postgres 9.6.

Use the tsquery FOLLOWED BY operator <-> or one of the related <N> operators. Or better yet, use the function phraseto_tsquery() to generate your tsquery.
Quoting the manual, it ...

produces tsquery that searches for a phrase, ignoring punctuation

And:

phraseto_tsquery behaves much like plainto_tsquery, except that it inserts the <-> (FOLLOWED BY) operator between surviving words instead of the & (AND) operator. Also, stop words are not simply discarded, but are accounted for by inserting <N> operators rather than <-> operators. This function is useful when searching for exact lexeme sequences, since the FOLLOWED BY operators check lexeme order not just the presence of all the lexemes.

Your query would work like this:

select id
     , ts_headline(content, phraseto_tsquery('english', 'good-morning')
                          , 'HighlightAll=true MaxFragments=100 FragmentDelimiter=$') 
from   tbl 
where  ts_content @@ phraseto_tsquery('english','good-morning');

phraseto_tsquery('english', 'good-morning') generates this tsquery:

'good-morn' <-> 'good' <-> 'morn'

Since "good-morning" is identified as asciihword (hyphenated ASCII word), the stemmed complete word is added before the components. The manual:

It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component: (followed by an example)

to_tsvector() basically does the same on the other end, so everything matches up. This allows for fine-grained options with hyphenated words. The above only finds "good-morning" with a hyphen (or variants stemming to the same). To find all strings with "good" followed by "morn" (or variants stemming to the same) use phraseto_tsquery('english','good morning') generating this tsquery: 'good' <-> 'morn'

OTOH, you can enforce exact matches by adding another filter like:

...
AND content ~* 'good-morning'  -- case insensitive regexp match

Or:

...
AND content ILIKE '%good-morning%'

Seems a bit redundant to the human eye, but this way you get fast full text index support and exact matches.

The latter is mostly equivalent, but different (fewer) characters have special meaning in the LIKE pattern and might need escaping. Related:

Example to demonstrate the operator <N>:

phraseto_tsquery('english', 'Juliet and the Licks') generates this tsquery:

'juliet' <3> 'lick'

<3> meaning that lick must be the third lexeme after juliet.