PostgreSQL Optimization – FTS and Trigram-Similarity Query Optimization

full-text-searchindexoptimizationpattern matchingpostgresql

I have recently started working on PostgreSQL and I have around 12M rows to handle in which I want to apply Full Text Search. I don't have any prior experience in handling such databases. I have tried to optimize the query but I doubt that it has been fully optimized.

Right now I'm using GIST Index as I read that updates are slower in GIN Index and my database will be updated regularly.

I need to focus on only two columns of my database right now merchant varchar(80) and product varchar(400).

I need to find the product using FTS and also I'm trying to get the product even if the merchant is misspelled.

I ran some queries on a sample Database of about 30K rows to get the following results:

First I ran the basic FTS query to analyse the results.

explain analyze
select count(*) from products
where to_tsvector('english', product) @@ to_tsquery('hat');

Aggregate  (cost=2027.27..2027.28 rows=1 width=0) (actual time=349.032..349.032 rows=1 loops=1)  
->  Seq Scan on products  (cost=0.00..2026.90 rows=147 width=0) (actual time=43.322..348.961 rows=307 loops=1)
 Filter: (to_tsvector((product)::text) @@ to_tsquery('hat'::text))
Total runtime: 349.140 ms

Then I created the GIST index and ran the same query to see the improvement. The results were quite good. At least for me.
```
create index product_gist on products using gist(to_tsvector('english', product));
```

Aggregate  (cost=447.17..447.18 rows=1 width=0) (actual time=12.911..12.911 rows=1 loops=1)
->  Bitmap Heap Scan on products  (cost=9.40..446.80 rows=147 width=0) (actual time=2.256..12.776 rows=307 loops=1)
 Recheck Cond: (to_tsvector('english'::regconfig, (product)::text) @@ to_tsquery('hat'::text))
 ->  Bitmap Index Scan on pn  (cost=0.00..9.37 rows=147 width=0) (actual time=2.111..2.111 rows=307 loops=1)
       Index Cond: (to_tsvector('english'::regconfig, (product)::text) @@ to_tsquery('hat'::text))
Total runtime: 13.051 ms

I also tested a GIN Index and the result was astonishing. Total Runtime: 0.583ms
But I can't use GIN Index, so lets get back to GIST Index.

Now, I'm using pg_trgm module in addition to find the similarity between two words (using it for misspelled merchant).

create index merchant_trgm on products using gist(merchant gist_trgm_ops);

select count(*) from products
where to_tsvector('english', product) @@ to_tsquery('hat')
AND   similarity(merchant,'fashion') > 0.2;

Aggregate  (cost=447.64..447.65 rows=1 width=0) (actual time=14.644..14.645 rows=1 loops=1)
->  Bitmap Heap Scan on products  (cost=9.38..447.51 rows=49 width=0) (actual time=2.187..14.635 rows=12 loops=1)
 Recheck Cond: (to_tsvector('english'::regconfig, (product)::text) @@ to_tsquery('hat'::text))
 Filter: (similarity((merchant)::text, 'fashion'::text) > 0.2::double precision)
 ->  Bitmap Index Scan on product_gist  (cost=0.00..9.37 rows=147 width=0) (actual time=2.055..2.055 rows=307 loops=1)
       Index Cond: (to_tsvector('english'::regconfig, (product)::text) @@ to_tsquery('hat'::text))
Total runtime: 14.705 ms

When I run these queries on my database with 12M rows. Obviously it takes more time. Can anyone help me to further reduce the total runtime.

~~Some more questions in my mind right now:~~

How can I search for a query like 'WALMART BAGS' which will first return me product BAG with merchant WALMART and then BAGS from other merchants.

Can I have both GIN and GIST index working for me?

Edit:

I also ran this query last night and got the following results. I already have the GIST index created and I've checked it is being called. Still the performance is not up to my expectations.

select count(*) from products where (setweight(to_tsvector('english', merchant || ' ' || product), 'A') || setweight(to_tsvector('english', product), 'B') || setweight(to_tsvector('english', merchant), 'C')) @@ to_tsquery('hat') AND similarity(merchant,'fashion') > 0.2;

Aggregate (cost=450.97..450.98 rows=1 width=0) (actual time=18.228..18.228 rows=1 loops=1) -> Bitmap Heap Scan on products (cost=9.40..450.84 rows=49 width=0) (actual time=2.399..18.220 rows=12 loops=1) Recheck Cond: (((setweight(to_tsvector('english'::regconfig, (((merchant)::text || ' '::text) || (product)::text)), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (product)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, (merchant)::text), 'C'::"char")) @@ to_tsquery('hat'::text)) Filter: (similarity((merchant)::text, 'fashion'::text) > 0.2::double precision) -> Bitmap Index Scan on products_weighted_index (cost=0.00..9.39 rows=147 width=0) (actual time=2.206..2.206 rows=307 loops=1) Index Cond: (((setweight(to_tsvector('english'::regconfig, (((merchant)::text || ' '::text) || (product)::text)), 'A'::"char") || setweight(to_tsvector('english'::regconfig, (product)::text), 'B'::"char")) || setweight(to_tsvector('english'::regconfig, (merchant)::text), 'C'::"char")) @@ to_tsquery('hat'::text)) Total runtime: 18.289 ms (7 rows)

Best Answer

Assessment

In your last query, the bitmap index scan looking for 'hat' produces 307 hits.
Postgres then runs a bitmap heap scan to filter merchants similar enough ( similarity(...) > 0.2), producing 12 rows. Your test is with 30K rows, so your real life query will produce around 300 times as many hits, 90k / 3.5k for the test case at hand. An additional index on merchant will help.

Advice

I suggest you create an additional trigram index for the similarity search. Be sure to read the chapter in the manual about trigram index support. We need the additional module pg_trgminstalled (like you obviously have).

For your first request:

How can I search for a query like 'WALMART BAGS' which will first return me product BAG with merchant WALMART and then BAGS from other merchants.

I suggest this query using the similarity operator %:

-- SELECT set_limit(0.2)  -- Adjust similarity operator only if needed

SELECT *
FROM   products
WHERE  to_tsvector('english', product) @@ to_tsquery('bag')
AND    merchant % 'walmart'
ORDER  BY merchant <-> 'walmart'
--    LIMIT  n; -- possibly limit to top n results

Again, you can choose between GiST and GIN, but this time GiST carries a decisive advantage:

This can be implemented quite efficiently by GiST indexes, but not by GIN indexes. It will usually beat the first formulation when only a small number of the closest matches is wanted.

Therefore, I suggest this index:

CREATE INDEX prod_merchant_trgm_idx ON products USING gist (merchant gist_trgm_ops);

As for your second request:

Can I have both GIN and GIST index working for me?

Yes, you can. It would hardly make sense to have both types for the same (combination of) column(s), but Postgres can combine GiST and GIN indices very well in the same query. I quote the excellent manual yet again, on Combining Multiple Indexes:

To combine multiple indexes, the system scans each needed index and prepares a bitmap in memory giving the locations of table rows that are reported as matching that index's conditions. The bitmaps are then ANDed and ORed together as needed by the query. Finally, the actual table rows are visited and returned. The table rows are visited in physical order, because that is how the bitmap is laid out; this means that any ordering of the original indexes is lost, and so a separate sort step will be needed if the query has an ORDER BY clause. For this reason, and because each additional index scan adds extra time, the planner will sometimes choose to use a simple index scan even though additional indexes are available that could have been used as well.

Best Answer

Assessment

Advice

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

PostgreSQL Indexing – How to Index WHERE (start_date >= ‘2013-12-15’)

Related Question