I'm trying to squeeze some more performance out of this query in Postgres 9.5. I'm running it over 400,000 rows.
In playing around with it, I've noticed that the CASE
statements are adding quite a bit to the query cost – if I replace them with simply summing some existing column it halves the execution time. Is there a more efficient way to calculate these sums?
SELECT sum("tag1"), sum("tag2"), sum("total_tags")
FROM (
SELECT people.data->'recruiter_id' AS recruiter_id,
(CASE WHEN people.data->'tags' ? 'tag1' THEN 1 END) AS "tag1",
(CASE WHEN people.data->'tags' ? 'tag2' THEN 1 END) AS "tag2",
((CASE WHEN people.data->'tags' ? 'tag1' THEN 1 ELSE 0 END) +
(CASE WHEN people.data->'tags' ? 'tag2' THEN 1 ELSE 0 END)) AS total_tags
FROM people WHERE people.data->'tags' ?| ARRAY['tag1','tag2'] ) AS target
GROUP BY recruiter_id
Output of EXPLAIN ANALYSE
:
HashAggregate (cost=1076.30..1078.22 rows=550 width=202) (actual time=7043.115..7043.208 rows=449 loops=1)
Group Key: (people.data -> 'recruiter_id'::text)
-> Bitmap Heap Scan on people (cost=12.85..1072.72 rows=550 width=202) (actual time=13.908..2619.878 rows=48492 loops=1)
Recheck Cond: ((data -> 'tags'::text) ?| '{tag1,tag2}'::text[])
Heap Blocks: exact=26114
-> Bitmap Index Scan on index_people_on_data_tags (cost=0.00..12.82 rows=550 width=0) (actual time=9.219..9.219 rows=48493 loops=1)
Index Cond: ((data -> 'tags'::text) ?| '{tag1,tag2}'::text[])
Planning time: 0.139 ms
Execution time: 7043.291 ms
Running on:
PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit
The inner query and outer query are generated by separate portions of the application. Is it possible to optimise without restructuring?
Best Answer
This should be faster and simpler overall:
Assuming that
recruiter_id
is an integer entity, it's cheaper to group by the integer value than thejsonb
object containing an integer value. I also assume, you'd rather get the integer value in the result anyway.Only count once in the subquery, then add counts for the total sum in the outer
SELECT
.Use the aggregate
FILTER
clause for conditional counts:If you want shorter syntax, this gets you the same result and performance:
Index and missing statistics for
jsonb
Typically, indexes are the deciding factor for performance with big tables. But since your query retrieves 48.493 of 400.000 rows, i.e. > 12 %, the index is not helping this query at all.
Why the poor decision? The query planner has no statistics for values inside a
json
/jsonb
object and has to chose the best query plans based on generic selectivity estimates. It expects to findrows = 550
while the query actually finds ~ 90 x as many (rows=48493
). The query plan using bitmap index scans is a poor decision. A sequential scan would be faster (not using the index at all).The index may still be instrumental for less frequent tags (if you have those), an expression index on
data->'tags'
should serve best. Maybe even ajsonb_path_ops
index, combined with an adapted query. More:However, for this and other reasons, while working with common tags, a plain Postgres array or a fully normalized schema will beat the performance of your
jsonb
object by a long shot.This discussion in the postgresql-performance list is about your problem exactly: