Postgresql – finding keywords in text using full text search in PostgreSQL

full-text-searchpostgresql

I'd like to determine if a text input contains any keywords from a known list, but also show which ones are found.

E.g with Keywords: ['fox', 'jump', 'dog', 'foo', 'bar']

'the quick brown foxes jumped over the lazy dogs'

yields

['fox','jump','dog']

I appreciate that I can do boolean matching, but I can't work out if I can return the matched keywords, as opposed to a highlighted input snippet.

I'd like to use FTS so i can benefit from finding some synonyms etc.

Is this possible in Postgres?

Best Answer

You have two components of this problem

  1. getting your array to a ts_query,
  2. getting the keywords highlighted
  3. getting the keywords out of the returned string

Now first and foremost, this is FTS. All words in the search string get tokenized as a ts_query. From that point on the original search string is lost. You no longer have words you have lexemes -- tokens which have been stemmed and normalized.

Getting your array to a ts_query,

Essentially, a ts_query looks like, 'fox' & 'jump' & 'dog' & 'foo' & 'bar' You should review plainto_tsquery which does slightly more voodoo to get text to a tsquery, and to_tsquery which assumes the string is already properly formatted.

To get the array to a ts_query, you can do something like this

SELECT to_tsquery(array_to_string(myarray, ' & '))
FROM ( VALUES (ARRAY['fox','jump','dog','foo','bar']) )
  AS t(myarray);

Highlighting

To get highlighting working you need something like this,

SELECT ts_headline(
  'quick brown foxes jumped over the lazy dogs',
  to_tsquery(array_to_string(myarray, ' & '))
)
FROM ( VALUES (ARRAY['fox','jump','dog','foo','bar']) )
  AS t(myarray);

That returns

                           ts_headline                            
------------------------------------------------------------------
 quick brown <b>foxes</b> <b>jumped</b> over the lazy <b>dogs</b>
(1 row)

There are quite a few options on ts_headline. All of the docs for this are found in Controlling Text Search

Getting out the words

The only other thing you can do here is to pull out the words which triggered those tokens

SELECT array_agg(rm[1]) AS WORDS
FROM ( 
  SELECT ts_headline(
    'quick brown foxes jumped over the lazy dogs',
    to_tsquery(array_to_string(myarray, ' & '))
  ) 
  FROM ( VALUES (ARRAY['fox','jump','dog','foo','bar']) ) 
    AS t(myarray)
)
  AS t(s)
CROSS JOIN LATERAL regexp_matches(s,'<b>(.*?)</b>', 'g') AS rm(matches);

        words        
---------------------
 {foxes,jumped,dogs}
(1 row)