Postgresql – Index on JSONB not improving query speed

jsonpostgresql

table structure

   Column   |              Type              | Collation | Nullable |              Default
------------+--------------------------------+-----------+----------+------------------------------------
 id         | integer                        |           | not null | nextval('events_id_seq'::regclass)
 event_name | character varying(255)         |           | not null |
 data       | jsonb                          |           | not null |

Indexes:
    "events_pkey" PRIMARY KEY, btree (id)
    "events_expr_idx" btree ((data -> 'program_id'::text))
    "idxginp" gin (data)

what 'data' might look like

{  
   "label":"Take Survey Button",
   "sent_at":"2018-07-25",
   "user_id":123456,
   "order_id":7654321,
   "event_time":"2018-07-25 00:09:41",
   "program_id":4,
   "user_agent":"Mozilla/5.0 ...",
   "destination":"http://www.google.com",
   "communication_kind":"email",
   "communication_name":"welcome"
}

Query that I need to run:

select label, count(*) as clicks
from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id"
      from "events"
      where "event_name" = 'communication_link_clicked'
        and data @> '{"communication_kind": "email"}'
        and data @> '{"communication_name": "program_welcome"}'
        and CAST ((data ->> 'sent_at') as date ) between '2018-07-25' and '2019-05-18'
        and CAST ((data ->> 'program_id') as int) = 4) as sub
group by "label";

This query on my small VM takes around 1 minute.

Records on this table: ~4 million.
Records where data ->> program_id = 4 (~1 million.)

I've tried creating individual indexes for the columns included in the where condition but I din't see major improvements.

Should I convert the json attributes used in the WHERE clause to recular columns? or Am I just missing more indexes and tune my query?

EXPLAIN:

EXPLAIN (ANALYZE, BUFFERS) select label, count(*) as clicks
from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id"
      from "events"
      where "event_name" = 'communication_link_clicked'
        and data @> '{"communication_kind": "email"}'
        and data @> '{"communication_name": "program_welcome"}'
        and CAST ((data ->> 'sent_at') as date ) between '2018-07-25' and '2019-05-18'
        and CAST ((data ->> 'program_id') as int) = 4) as sub
group by "label";




GroupAggregate  (cost=104.24..104.28 rows=1 width=40) (actual time=43735.844..43735.844 rows=1 loops=1)
  Group Key: ((events.data ->> 'label'::text))
  Buffers: shared hit=839 read=143489, temp read=210 written=210
  ->  Unique  (cost=104.24..104.25 rows=1 width=64) (actual time=43719.789..43731.616 rows=46772 loops=1)
        Buffers: shared hit=839 read=143489, temp read=210 written=210
        ->  Sort  (cost=104.24..104.25 rows=1 width=64) (actual time=43719.787..43724.865 rows=55219 loops=1)
              Sort Key: ((events.data ->> 'label'::text)), ((events.data ->> 'user_id'::text))
              Sort Method: external merge  Disk: 1680kB
              Buffers: shared hit=839 read=143489, temp read=210 written=210
              ->  Bitmap Heap Scan on events  (cost=84.05..104.23 rows=1 width=64) (actual time=448.903..43328.010 rows=55219 loops=1)
                    Recheck Cond: ((data @> '{"communication_kind": "email"}'::jsonb) AND (data @> '{"communication_name": "program_welcome"}'::jsonb))
                    Rows Removed by Index Recheck: 1434250
                    Filter: (((event_name)::text = 'communication_link_clicked'::text) AND (((data ->> 'sent_at'::text))::date >= '2018-07-25'::date) AND (((data ->> 'sent_at'::text))::date > 'program_id'::text))::integer = 4))
                    Rows Removed by Filter: 264849
                    Heap Blocks: exact=42839 lossy=99688
                    Buffers: shared hit=839 read=143489
                    ->  Bitmap Index Scan on idxginp  (cost=0.00..84.05 rows=5 width=0) (actual time=437.235..437.235 rows=320068 loops=1)
                          Index Cond: ((data @> '{"communication_kind": "email"}'::jsonb) AND (data @> '{"communication_name": "program_welcome"}'::jsonb))
                          Buffers: shared hit=839 read=962
Planning time: 0.890 ms
Execution time: 43736.182 ms

Best Answer

So it is using the jsonb index. But your work_mem is not large enough to hold the full bitmap, so it goes "lossy" resulting in extra work, see:

Rows Removed by Index Recheck: 1434250

Heap Blocks: exact=42839 lossy=99688

Increasing the size of your work_mem setting should improve this.

Your functional index cannot be used, because of a mismatch between how you define it and how you try to use it.

((data -> 'program_id'::text))

CAST ((data ->> 'program_id') as int) = 4

One uses "->>", one uses "->". And one casts its output to int, and the other leaves it as jsonb. You should define the index over the same expression as you use in the query, if you want it to be usable. But since 1/4 of the table meets that criterion, the index is probably not going to be helpful in the first place.

To maximize the efficiency here, you might want a compound index over event_name and program_id:

create index ON events (event_name, CAST ((data ->> 'program_id') as int));

That way you get the 1/4 selectivity for free if you are already using the index on event_name. It is hard to say if that is worthwhile or not, as we don't know how selective the event_name = 'communication_link_clicked' is. Your current query plan is not using that index, but maybe improving the selectivity by 4 fold would make it worthwhile.

But the best solution is probably not to use JSONB in the first place. Your JSON structure appears to be flat (based on one example), couldn't you just map those fields into real PostgreSQL columns? You kind of suggest that, but what is the cost in doing so in terms of application complexity? Then the planner would have better statistics to work with. Top level JSON attributes do not get the same kind of statistics on that as real columns do.