Postgresql – Order by exact matches (jsonb array), then lexeme similarity

full-text-searchjsonpostgresql

Say I have a table like this in Postgres 9.5:

CREATE TABLE public.posts
(
    content text,
    tags jsonb,
)

I'd like to design a single query that:

  • Finds results based on exact tag OR matches (tags ?! array['tag1','tag2']) AND on matched free-form text (to_tsvector(content) @@ plainto_tsquery('some phrase'))

  • Orders by tags matches first, based on the # of matches – i.e. if row A has [apple, orange] and row B has just [apple], then a search for array['orange', 'apple'] would yield row A higher than row B (but they'd both be returned)

  • Orders by content second, based on the weight/similarity of the returned result. So a search for the keyword 'french hello' would yield a row containing content of "how do I say 'hello' in French?' higher than 'what's the weather like in the French Riviera?'

How would I go about combining the above in a single query, so that both exact tags matches and/or fuzzy content matches yield results using the weighting above?

Best Answer

Check if this is what you're looking for [I didn't fully understand your first OR / AND condition. I assumed it was just an OR].

WITH posts AS
(
SELECT 
    * 
FROM
    (VALUES
        ('how do I say ''hello'' in French?', '{"orange":1, "apple":2}'::jsonb),
        ('what''s the weather like in the French Riviera?', '{"peach":3, "lemon":4}'::jsonb),
        ('awful weather in England', '{"peach":5, "lemon":6}'::jsonb),
        ('awful weather in England', '{"pineapple":5, "strawberry":6}'::jsonb),
        ('doubtful french fries', '{"blueberry":5, "pear":6}'::jsonb),
        ('the rain, in Spain, is mainly in the plain', '{"melon":7, "watermelon":8, "banana":9}'::jsonb)
    ) AS posts(content, tags)
) 

SELECT 
    *, 
    /* Use ts_rank to compare level of full text search coincidence */
    ts_rank(to_tsvector(content), plainto_tsquery('french') ||
          plainto_tsquery('hello')) AS rank,
    /* Subquery to count number of tag matches */
    (SELECT 
          count(case when tags ? a then 1 end) 
     FROM 
          unnest(array['melon', 'banana', 'lemon']) AS a
    ) AS number_of_matching_tags
FROM 
    posts 
WHERE
    /* Check for any of the tags */
    tags ?| array['melon', 'banana', 'lemon']
    OR
    /* Check for any of the search terms. You have to || tsqueries */
    to_tsvector(content) @@ 
        (plainto_tsquery('french') || plainto_tsquery('hello'))
ORDER BY
    number_of_matching_tags desc nulls last,
    rank desc ;

(The inclusion of number_of_matching_tags and rank columns is only to clarify results)