Postgresql – faster way to count JSONB tags

execution-planjsonperformancepostgresqlpostgresql-performancestatistics

I'm trying to squeeze some more performance out of this query in Postgres 9.5. I'm running it over 400,000 rows.

In playing around with it, I've noticed that the CASE statements are adding quite a bit to the query cost – if I replace them with simply summing some existing column it halves the execution time. Is there a more efficient way to calculate these sums?

SELECT sum("tag1"), sum("tag2"), sum("total_tags")
FROM (
    SELECT people.data->'recruiter_id' AS recruiter_id,
          (CASE WHEN people.data->'tags' ? 'tag1' THEN 1 END) AS "tag1",
          (CASE WHEN people.data->'tags' ? 'tag2' THEN 1 END) AS "tag2", 
          ((CASE WHEN people.data->'tags' ? 'tag1' THEN 1 ELSE 0 END) +
           (CASE WHEN people.data->'tags' ? 'tag2' THEN 1 ELSE 0 END)) AS total_tags
    FROM people WHERE people.data->'tags' ?| ARRAY['tag1','tag2'] ) AS target
GROUP BY recruiter_id

Output of EXPLAIN ANALYSE:

HashAggregate  (cost=1076.30..1078.22 rows=550 width=202) (actual time=7043.115..7043.208 rows=449 loops=1)
  Group Key: (people.data -> 'recruiter_id'::text)
  ->  Bitmap Heap Scan on people  (cost=12.85..1072.72 rows=550 width=202) (actual time=13.908..2619.878 rows=48492 loops=1)
        Recheck Cond: ((data -> 'tags'::text) ?| '{tag1,tag2}'::text[])
        Heap Blocks: exact=26114
        ->  Bitmap Index Scan on index_people_on_data_tags  (cost=0.00..12.82 rows=550 width=0) (actual time=9.219..9.219 rows=48493 loops=1)
              Index Cond: ((data -> 'tags'::text) ?| '{tag1,tag2}'::text[])
Planning time: 0.139 ms
Execution time: 7043.291 ms

Running on:

PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit

The inner query and outer query are generated by separate portions of the application. Is it possible to optimise without restructuring?

Best Answer

This should be faster and simpler overall:

SELECT *, tag1 + tag2 AS total_tags
FROM  (
   SELECT (data->>'recruiter_id')::int AS recruiter_id  -- cheaper to group by int
        , count(*) FILTER (WHERE data->'tags' ? 'tag1') AS tag1
        , count(*) FILTER (WHERE data->'tags' ? 'tag2') AS tag2
   FROM   people
   WHERE  data->'tags' ?| ARRAY['tag1','tag2']
   GROUP  BY 1
   ) target;
  • Assuming that recruiter_id is an integer entity, it's cheaper to group by the integer value than the jsonb object containing an integer value. I also assume, you'd rather get the integer value in the result anyway.

  • Only count once in the subquery, then add counts for the total sum in the outer SELECT.

  • Use the aggregate FILTER clause for conditional counts:

  • If you want shorter syntax, this gets you the same result and performance:

    count(data->'tags' ? 'tag1' OR NULL) AS tag1
    

Index and missing statistics for jsonb

Typically, indexes are the deciding factor for performance with big tables. But since your query retrieves 48.493 of 400.000 rows, i.e. > 12 %, the index is not helping this query at all.

Why the poor decision? The query planner has no statistics for values inside a json / jsonb object and has to chose the best query plans based on generic selectivity estimates. It expects to find rows = 550 while the query actually finds ~ 90 x as many (rows=48493). The query plan using bitmap index scans is a poor decision. A sequential scan would be faster (not using the index at all).

The index may still be instrumental for less frequent tags (if you have those), an expression index on data->'tags' should serve best. Maybe even a jsonb_path_ops index, combined with an adapted query. More:

However, for this and other reasons, while working with common tags, a plain Postgres array or a fully normalized schema will beat the performance of your jsonb object by a long shot.

This discussion in the postgresql-performance list is about your problem exactly: