I have a table like this:
CREATE TABLE products (
id serial PRIMARY KEY,
category_ids integer[],
published boolean NOT NULL,
score integer NOT NULL,
title varchar NOT NULL);
A product can belong to multiple categories. category_ids
column holds a list of ids of all product's categories.
Typical query looks like this (always searching for single category):
SELECT * FROM products WHERE published
AND category_ids @> ARRAY[23465]
ORDER BY score DESC, title
LIMIT 20 OFFSET 8000;
To speed it up I use the following index:
CREATE INDEX idx_test1 ON products
USING GIN (category_ids gin__int_ops) WHERE published;
This one helps a lot unless there are too many products in one category. It quickly filters out products that belong to that category but then there is a sort operation that has to be done the hard way (without index).
A have installed btree_gin
extension allowing me to build multi-column GIN index like this:
CREATE INDEX idx_test2 ON products USING GIN (
category_ids gin__int_ops, score, title) WHERE published;
But Postgres does not want to use that for sorting. Even when I remove DESC
specifier in the query.
Any alternative approaches to optimize the task are very welcome.
Additional information:
- PostgreSQL 9.4, with intarray extension
- total number of products currently is 260k but expected to grow significantly (upto 10M, this is multi-tenant e-commerce platform)
- products per category 1..10000 (can grow up to 100k), average is below 100 but those categories with large number of products tend to attract many more requests
The following query plan was obtained from smaller test system (4680 products in selected category, 200k products total in the table):
Limit (cost=948.99..948.99 rows=1 width=72) (actual time=82.330..82.341 rows=20 loops=1)
-> Sort (cost=948.37..948.99 rows=245 width=72) (actual time=80.231..81.337 rows=4020 loops=1)
Sort Key: score, title
Sort Method: quicksort Memory: 928kB
-> Bitmap Heap Scan on products (cost=13.90..938.65 rows=245 width=72) (actual time=1.919..16.044 rows=4680 loops=1)
Recheck Cond: ((category_ids @> '{292844}'::integer[]) AND published)
Heap Blocks: exact=3441
-> Bitmap Index Scan on idx_test2 (cost=0.00..13.84 rows=245 width=0) (actual time=1.185..1.185 rows=4680 loops=1)
Index Cond: (category_ids @> '{292844}'::integer[])
Planning time: 0.202 ms
Execution time: 82.404 ms
Note #1: 82 ms might not look that scary but that is because sort buffer fits into memory. Once I select all columns from products table (SELECT * FROM ...
and in real life there are about 60 columns) it goes to Sort Method: external merge Disk: 5696kB
doubling execution time. And that is only for 4680 products.
Action point #1 (comes from Note #1): In order to reduce memory footprint of sort operation and therefore speed it up a little it would be wise to fetch, sort and limit product ids first, then fetch full records:
SELECT * FROM products WHERE id IN (
SELECT id FROM products WHERE published AND category_ids @> ARRAY[23465]
ORDER BY score DESC, title LIMIT 20 OFFSET 8000
) ORDER BY score DESC, title;
This brings us back to Sort Method: quicksort Memory: 903kB
and ~80 ms for 4680 products. Still can be slow when number of products grows to 100k.
Best Answer
I've done a lot of experimenting and here are my findings.
GIN and sorting
GIN index currently (as of version 9.4) can not assist ordering.
work_mem
Thanks Chris for pointing out to this configuration parameter. It defaults to 4MB, and in case your recordset is larger, increasing
work_mem
to proper value (can be found fromEXPLAIN ANALYSE
) can significantly speed up sort operations.Restart the server for change to take effect, then double check:
Original query
I've populated my database with 650k products with some categories holding up to 40k products. I've simplified query a bit by removing
published
clause:As we can see
work_mem
was not enough so we hadSort Method: external merge Disk: 29656kB
(the number here is approximate, it needs slightly more than 32MB for in-memory quicksort).Reduce memory footprint
Don't select full records for sorting, use ids, apply sort, offset and limit, then load just 10 records we need:
Note
Sort Method: quicksort Memory: 7396kB
. Result is much better.JOIN and additional B-tree index
As Chris advised I've created additional index:
First I tried joining like this:
Query plan differs slightly but result is the same:
Playing with various offsets and product counts I could not make PostgreSQL use additional B-tree index.
So I went classical way and created junction table:
Still not using B-tree index, resultset did not fit
work_mem
, hence poor results.But under some circumstances, having large number of products and small offset PostgreSQL now decides to use B-tree index:
This is in fact quite logical as B-tree index here does not produce direct result, it is only used as a guide for sequential scan.
Let's compare with GIN query:
GIN's result is much better. I checked with various combinations of number of products and offset, under no circumstances junction table approach was any better.
The power of real index
In order for PostgreSQL to fully utilize index for sorting, all query
WHERE
parameters as well asORDER BY
parameters must reside in single B-tree index. To do this I have copied sort fields from product to junction table:And this is the worst scenario with large number of products in chosen category and large offset. When offset=300 execution time is just 0.5 ms.
Unfortunately maintaining such a junction table requires extra effort. It could be accomplished via indexed materialized views, but that is only useful when your data updates rarely, cause refreshing such materialized view is quite a heavy operation.
So I am staying with GIN index so far, with increased
work_mem
and reduced memory footprint query.