PostgreSQL – Optimize Trigram Search with Custom Sort Order

nearest neighbororder-bypattern matchingpostgresqlpostgresql-9.6

I'm new to typeahead / full text searching but I'm trying to figure out an algorithm that yields the best results for my search use-case. The additional module pg_trgm is installed. I have a table search that looks like:

  Column    |         Type          | Modifiers
------------+-----------------------+-----------
id          | character varying(36) | collate C
user_id     | integer               |
type        | text                  |
search_on   | text                  | collate C
data        | json                  |
Indexes:
 "index_searchables_on_id" UNIQUE, btree (id)
 "index_searchables_on_search_on" gist (search_on gist_trgm_ops)
 "index_searchables_on_type" btree (type)
 "index_searchables_on_user_id" btree (user_id)

I'm querying rows using something like:

SELECT * FROM searchables
  WHERE search_on % 'mysearch'
ORDER BY search_on <-> 'mysearch'

And I get ordered lists which roughly match what I want, but I have certain patterns that I'd like to match higher based on how the matching worked.

For instance, I'd like to weight prefix matches higher than the normal trigram matching algorithm. So if, for instance, the string abc is at the beginning of my column's text, I'd want that factored into the ordering to come higher than something who might have more trigram matches on abc but doesn't start with that string. The equivalent of doing a WHERE col like 'mysearch%'

Is there any way I can incorporate a prefix matching to alter the order clause that'd still yield an efficient query but prioritize prefixes before other matches?

Additional Info

So i've gone down the path of adding the prefix ordering, but i'm noticing significant performance issues. Real user monitoring shows some of these queries taking many seconds each (I have some that take 15+ seconds for certain terms). My examples below show queries in the 100's of ms upwards of 1s, but in my monitoring system, for a db under slight load, it's hitting 15+s for a single query.

The database itself isn't huge:

=> select count(*) from searchables;
 count
--------
 622398

So we're not even in the millions. search_on is typically just company and stock names, so for example "alphabet inc. class c goog", which is a combination of the security name and ticker.

What I'm finding is that the addition of the prefix seems to add a LOT of overhead. And the speed of the queries I'm getting are much lower than I'd have expected. I realize a lot of factors go into how quickly a query returns but I'm just surprised at how slow this is. My setup is this:

Postgres: 9.6.6
AWS RDS: m4.large
500GB General Purpose SSD
Standard AWS Parameter Group

Here's a few sample EXPLAINs. Am I missing something here? Is 100's of ms the expected performance to search small strings over a few hundred thousand rows? It makes it pretty difficult to produce a decent user experience when you add in hte overhead of my web server + network. 300ms of a direct access query to the DB is not really cutting it for my use-case.

Query 1: Prefix Preference

EXPLAIN ANALYZE SELECT *
FROM "searchables"
WHERE (search_on % 'partners value inves')
ORDER BY search_on NOT LIKE 'partners value inves%', search_on <-> 'partners value inves' LIMIT 50;
                                                                           QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2243.61..2243.73 rows=50 width=241) (actual time=1497.445..1497.479 rows=50 loops=1)
   ->  Sort  (cost=2243.61..2245.16 rows=622 width=241) (actual time=1497.443..1497.454 rows=50 loops=1)
         Sort Key: ((search_on !~~ 'partners value inves%'::text)), ((search_on <-> 'partners value inves'::text))
         Sort Method: top-N heapsort  Memory: 38kB
         ->  Bitmap Heap Scan on searchables  (cost=77.23..2222.94 rows=622 width=241) (actual time=340.089..1433.072 rows=100430 loops=1)
               Recheck Cond: (search_on % 'partners value inves'::text)
               Heap Blocks: exact=20161
               ->  Bitmap Index Scan on index_searchables_on_search_on  (cost=0.00..77.08 rows=622 width=0) (actual time=336.245..336.245 rows=100430 loops=1)
                     Index Cond: (search_on % 'partners value inves'::text)
 Planning time: 0.092 ms
 Execution time: 1497.529 ms

Query 2 – Non prefix preference

EXPLAIN ANALYZE SELECT *
FROM "searchables"
WHERE (search_on % 'partners value inves')
ORDER BY search_on <-> 'partners value inves' LIMIT 50;
                                                                           QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..204.43 rows=50 width=240) (actual time=330.725..331.394 rows=50 loops=1)
   ->  Index Scan using index_searchables_on_search_on on searchables  (cost=0.41..2538.40 rows=622 width=240) (actual time=330.721..331.369 rows=50 loops=1)
         Index Cond: (search_on % 'partners value inves'::text)
         Order By: (search_on <-> 'partners value inves'::text)
 Planning time: 0.092 ms
 Execution time: 331.446 ms

I'm wondering if there's any tuning of RDS I should be paying attention to, or a better type of index I can use to do this prefix preferred, similarity matching for a typeahead?

Best Answer

Trigram similarity and distance operators put more weight on leading matches (prefix) automatically and to a lesser extent on trailing matches (suffix), due to the way trigrams are extracted from strings. The manual:

Each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string.

There is more, including examples. Read the manual. And consider the demos in my fiddle at the bottom of this answer.

Test setup

Based on your table definition and example:

CREATE TABLE search (id int PRIMARY KEY, search_on text, comment text);
    
INSERT INTO search (id, search_on, comment) VALUES
   ( 1, 'abc123456789', 'leading')
 , ( 2, '123abc456789', 'nested')
 , ( 3, '123456789abc', 'trailing')
 , ( 4, 'abc123abc456', 'leading, nested 1x')
 , ( 5, '123abc456abc', 'trailing,nested 1x')
 , ( 6, 'abcabcabc123', 'leading, nested 2x')
 , ( 7, '123abcabcabc', 'trailing nested 2x')
 , ( 8, '1abcabcabc23', 'nested 3x')
 , (10, 'abc12'       , 'leading short')
 , (11, '12abc'       , 'trailing short')
 , (12, '1abc2'       , 'nested short');

CREATE INDEX index_search_search_on ON search USING gist (search_on gist_trgm_ops);

Not using your odd type bpchar (blank padded character type) for id - and I suggest you don't either. text or varchar should serve better:

Queries

We need a low threshold for the demo:

SET pg_trgm.similarity_threshold = .01;  -- show weak matches, too

Demonstrating the built-in bias in your favor:

SELECT *, search_on <-> 'abc' AS distance
FROM   search
WHERE  search_on % 'abc'
ORDER  BY search_on <-> 'abc';
id | search_on    | comment            | distance
-: | :----------- | :----------------- | :-------
10 | abc12        | leading short      | 0.571429
 6 | abcabcabc123 | leading, nested 2x | 0.7     
11 | 12abc        | trailing short     | 0.75    
 4 | abc123abc456 | leading, nested 1x | 0.769231
 1 | abc123456789 | leading            | 0.785714
 7 | 123abcabcabc | trailing nested 2x | 0.818182
 5 | 123abc456abc | trailing,nested 1x | 0.857143
 3 | 123456789abc | trailing           | 0.866667
12 | 1abc2        | nested short       | 0.888889
 8 | 1abcabcabc23 | nested 3x          | 0.916667
 2 | 123abc456789 | nested             | 0.9375  

As you can see, leading matches have more weight. But it's still just a relative bias.

To make this absolute:

... at the beginning of my column's text, I'd want that factored into the ordering to come higher ...

SELECT *, search_on <-> 'abc' AS distance, search_on ILIKE 'abc%' AS prefix_match
FROM   search
WHERE  search_on % 'abc'
ORDER  BY search_on NOT ILIKE 'abc%'  -- prefix matches first
        , search_on <-> 'abc';        -- then sort by distance
id | search_on    | comment            | distance | prefix
-: | :----------- | :----------------- | :------- | :-----
10 | abc12        | leading short      | 0.571429 | t           
 6 | abcabcabc123 | leading, nested 2x | 0.7      | t           
 4 | abc123abc456 | leading, nested 1x | 0.769231 | t           
 1 | abc123456789 | leading            | 0.785714 | t           
11 | 12abc        | trailing short     | 0.75     | f           
 7 | 123abcabcabc | trailing nested 2x | 0.818182 | f           
 5 | 123abc456abc | trailing,nested 1x | 0.857143 | f           
 3 | 123456789abc | trailing           | 0.866667 | f           
12 | 1abc2        | nested short       | 0.888889 | f           
 8 | 1abcabcabc23 | nested 3x          | 0.916667 | f           
 2 | 123abc456789 | nested             | 0.9375   | f           

I chose the expression search_on NOT ILIKE 'abc%' to still sort NULL values last. Equivalent: search_on ILIKE 'abc%' DESC NULLS LAST. Related:

You could sort trailing matches in a similar fashion or combine both:

...
ORDER  BY search_on NOT ILIKE 'abc%'  -- prefix matches first
        , search_on NOT ILIKE '%abc'  -- suffix matches next
        , search_on <-> 'abc';        -- then sort by distance

db<>fiddle here

BTW 1: Full Text Search also supports prefix matching.
BTW 2: The "C" collation COLLATE "C" allows plain btree index support for prefix matches.

Related: