Postgresql – Order by exact matches (jsonb array), then lexeme similarity

full-text-searchjsonpostgresql

Say I have a table like this in Postgres 9.5:

CREATE TABLE public.posts
(
    content text,
    tags jsonb,
)

I'd like to design a single query that:

Finds results based on exact tag OR matches (tags ?! array['tag1','tag2']) AND on matched free-form text (to_tsvector(content) @@ plainto_tsquery('some phrase'))
Orders by tags matches first, based on the # of matches – i.e. if row A has [apple, orange] and row B has just [apple], then a search for array['orange', 'apple'] would yield row A higher than row B (but they'd both be returned)
Orders by content second, based on the weight/similarity of the returned result. So a search for the keyword 'french hello' would yield a row containing content of "how do I say 'hello' in French?' higher than 'what's the weather like in the French Riviera?'

How would I go about combining the above in a single query, so that both exact tags matches and/or fuzzy content matches yield results using the weighting above?

Best Answer

Check if this is what you're looking for [I didn't fully understand your first OR / AND condition. I assumed it was just an OR].

WITH posts AS
(
SELECT 
    * 
FROM
    (VALUES
        ('how do I say ''hello'' in French?', '{"orange":1, "apple":2}'::jsonb),
        ('what''s the weather like in the French Riviera?', '{"peach":3, "lemon":4}'::jsonb),
        ('awful weather in England', '{"peach":5, "lemon":6}'::jsonb),
        ('awful weather in England', '{"pineapple":5, "strawberry":6}'::jsonb),
        ('doubtful french fries', '{"blueberry":5, "pear":6}'::jsonb),
        ('the rain, in Spain, is mainly in the plain', '{"melon":7, "watermelon":8, "banana":9}'::jsonb)
    ) AS posts(content, tags)
) 

SELECT 
    *, 
    /* Use ts_rank to compare level of full text search coincidence */
    ts_rank(to_tsvector(content), plainto_tsquery('french') ||
          plainto_tsquery('hello')) AS rank,
    /* Subquery to count number of tag matches */
    (SELECT 
          count(case when tags ? a then 1 end) 
     FROM 
          unnest(array['melon', 'banana', 'lemon']) AS a
    ) AS number_of_matching_tags
FROM 
    posts 
WHERE
    /* Check for any of the tags */
    tags ?| array['melon', 'banana', 'lemon']
    OR
    /* Check for any of the search terms. You have to || tsqueries */
    to_tsvector(content) @@ 
        (plainto_tsquery('french') || plainto_tsquery('hello'))
ORDER BY
    number_of_matching_tags desc nulls last,
    rank desc ;

(The inclusion of number_of_matching_tags and rank columns is only to clarify results)

Test setup

Based on your table definition and example:

CREATE TABLE search (id int PRIMARY KEY, search_on text, comment text);
    
INSERT INTO search (id, search_on, comment) VALUES
   ( 1, 'abc123456789', 'leading')
 , ( 2, '123abc456789', 'nested')
 , ( 3, '123456789abc', 'trailing')
 , ( 4, 'abc123abc456', 'leading, nested 1x')
 , ( 5, '123abc456abc', 'trailing,nested 1x')
 , ( 6, 'abcabcabc123', 'leading, nested 2x')
 , ( 7, '123abcabcabc', 'trailing nested 2x')
 , ( 8, '1abcabcabc23', 'nested 3x')
 , (10, 'abc12'       , 'leading short')
 , (11, '12abc'       , 'trailing short')
 , (12, '1abc2'       , 'nested short');

CREATE INDEX index_search_search_on ON search USING gist (search_on gist_trgm_ops);

Not using your odd type bpchar (blank padded character type) for id - and I suggest you don't either. text or varchar should serve better:

Any downsides of using data type “text” for storing strings?

Queries

We need a low threshold for the demo:

SET pg_trgm.similarity_threshold = .01;  -- show weak matches, too

Demonstrating the built-in bias in your favor:

SELECT *, search_on <-> 'abc' AS distance
FROM   search
WHERE  search_on % 'abc'
ORDER  BY search_on <-> 'abc';

id | search_on    | comment            | distance
-: | :----------- | :----------------- | :-------
10 | abc12        | leading short      | 0.571429
 6 | abcabcabc123 | leading, nested 2x | 0.7     
11 | 12abc        | trailing short     | 0.75    
 4 | abc123abc456 | leading, nested 1x | 0.769231
 1 | abc123456789 | leading            | 0.785714
 7 | 123abcabcabc | trailing nested 2x | 0.818182
 5 | 123abc456abc | trailing,nested 1x | 0.857143
 3 | 123456789abc | trailing           | 0.866667
12 | 1abc2        | nested short       | 0.888889
 8 | 1abcabcabc23 | nested 3x          | 0.916667
 2 | 123abc456789 | nested             | 0.9375

As you can see, leading matches have more weight. But it's still just a relative bias.

To make this absolute:

... at the beginning of my column's text, I'd want that factored into the ordering to come higher ...

SELECT *, search_on <-> 'abc' AS distance, search_on ILIKE 'abc%' AS prefix_match
FROM   search
WHERE  search_on % 'abc'
ORDER  BY search_on NOT ILIKE 'abc%'  -- prefix matches first
        , search_on <-> 'abc';        -- then sort by distance

id | search_on    | comment            | distance | prefix
-: | :----------- | :----------------- | :------- | :-----
10 | abc12        | leading short      | 0.571429 | t           
 6 | abcabcabc123 | leading, nested 2x | 0.7      | t           
 4 | abc123abc456 | leading, nested 1x | 0.769231 | t           
 1 | abc123456789 | leading            | 0.785714 | t           
11 | 12abc        | trailing short     | 0.75     | f           
 7 | 123abcabcabc | trailing nested 2x | 0.818182 | f           
 5 | 123abc456abc | trailing,nested 1x | 0.857143 | f           
 3 | 123456789abc | trailing           | 0.866667 | f           
12 | 1abc2        | nested short       | 0.888889 | f           
 8 | 1abcabcabc23 | nested 3x          | 0.916667 | f           
 2 | 123abc456789 | nested             | 0.9375   | f

I chose the expression search_on NOT ILIKE 'abc%' to still sort NULL values last. Equivalent: search_on ILIKE 'abc%' DESC NULLS LAST. Related:

PostgreSQL sort by datetime asc, null first?

You could sort trailing matches in a similar fashion or combine both:

...
ORDER  BY search_on NOT ILIKE 'abc%'  -- prefix matches first
        , search_on NOT ILIKE '%abc'  -- suffix matches next
        , search_on <-> 'abc';        -- then sort by distance

db<>fiddle here

BTW 1: Full Text Search also supports prefix matching.
BTW 2: The "C" collation COLLATE "C" allows plain btree index support for prefix matches.

Best Answer

Related Solutions

Postgresql – Postgres: querying jsonb array with an array

Postgresql – Optimize a trigram search with custom sort order

Test setup

Queries

Related Question