table structure
Column | Type | Collation | Nullable | Default ------------+--------------------------------+-----------+----------+------------------------------------ id | integer | | not null | nextval('events_id_seq'::regclass) event_name | character varying(255) | | not null | data | jsonb | | not null | Indexes: "events_pkey" PRIMARY KEY, btree (id) "events_expr_idx" btree ((data -> 'program_id'::text)) "idxginp" gin (data)
what 'data' might look like
{ "label":"Take Survey Button", "sent_at":"2018-07-25", "user_id":123456, "order_id":7654321, "event_time":"2018-07-25 00:09:41", "program_id":4, "user_agent":"Mozilla/5.0 ...", "destination":"http://www.google.com", "communication_kind":"email", "communication_name":"welcome" }
Query that I need to run:
select label, count(*) as clicks from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id" from "events" where "event_name" = 'communication_link_clicked' and data @> '{"communication_kind": "email"}' and data @> '{"communication_name": "program_welcome"}' and CAST ((data ->> 'sent_at') as date ) between '2018-07-25' and '2019-05-18' and CAST ((data ->> 'program_id') as int) = 4) as sub group by "label";
This query on my small VM takes around 1 minute.
Records on this table: ~4 million.
Records where data ->> program_id = 4 (~1 million.)
I've tried creating individual indexes for the columns included in the where condition but I din't see major improvements.
Should I convert the json attributes used in the WHERE clause to recular columns? or Am I just missing more indexes and tune my query?
EXPLAIN:
EXPLAIN (ANALYZE, BUFFERS) select label, count(*) as clicks from (select distinct "data" ->> 'label' as "label", "data" ->> 'user_id' as "user_id" from "events" where "event_name" = 'communication_link_clicked' and data @> '{"communication_kind": "email"}' and data @> '{"communication_name": "program_welcome"}' and CAST ((data ->> 'sent_at') as date ) between '2018-07-25' and '2019-05-18' and CAST ((data ->> 'program_id') as int) = 4) as sub group by "label"; GroupAggregate (cost=104.24..104.28 rows=1 width=40) (actual time=43735.844..43735.844 rows=1 loops=1) Group Key: ((events.data ->> 'label'::text)) Buffers: shared hit=839 read=143489, temp read=210 written=210 -> Unique (cost=104.24..104.25 rows=1 width=64) (actual time=43719.789..43731.616 rows=46772 loops=1) Buffers: shared hit=839 read=143489, temp read=210 written=210 -> Sort (cost=104.24..104.25 rows=1 width=64) (actual time=43719.787..43724.865 rows=55219 loops=1) Sort Key: ((events.data ->> 'label'::text)), ((events.data ->> 'user_id'::text)) Sort Method: external merge Disk: 1680kB Buffers: shared hit=839 read=143489, temp read=210 written=210 -> Bitmap Heap Scan on events (cost=84.05..104.23 rows=1 width=64) (actual time=448.903..43328.010 rows=55219 loops=1) Recheck Cond: ((data @> '{"communication_kind": "email"}'::jsonb) AND (data @> '{"communication_name": "program_welcome"}'::jsonb)) Rows Removed by Index Recheck: 1434250 Filter: (((event_name)::text = 'communication_link_clicked'::text) AND (((data ->> 'sent_at'::text))::date >= '2018-07-25'::date) AND (((data ->> 'sent_at'::text))::date > 'program_id'::text))::integer = 4)) Rows Removed by Filter: 264849 Heap Blocks: exact=42839 lossy=99688 Buffers: shared hit=839 read=143489 -> Bitmap Index Scan on idxginp (cost=0.00..84.05 rows=5 width=0) (actual time=437.235..437.235 rows=320068 loops=1) Index Cond: ((data @> '{"communication_kind": "email"}'::jsonb) AND (data @> '{"communication_name": "program_welcome"}'::jsonb)) Buffers: shared hit=839 read=962 Planning time: 0.890 ms Execution time: 43736.182 ms
Best Answer
So it is using the jsonb index. But your work_mem is not large enough to hold the full bitmap, so it goes "lossy" resulting in extra work, see:
Increasing the size of your work_mem setting should improve this.
Your functional index cannot be used, because of a mismatch between how you define it and how you try to use it.
One uses "->>", one uses "->". And one casts its output to int, and the other leaves it as jsonb. You should define the index over the same expression as you use in the query, if you want it to be usable. But since 1/4 of the table meets that criterion, the index is probably not going to be helpful in the first place.
To maximize the efficiency here, you might want a compound index over event_name and program_id:
That way you get the 1/4 selectivity for free if you are already using the index on event_name. It is hard to say if that is worthwhile or not, as we don't know how selective the
event_name = 'communication_link_clicked'
is. Your current query plan is not using that index, but maybe improving the selectivity by 4 fold would make it worthwhile.But the best solution is probably not to use JSONB in the first place. Your JSON structure appears to be flat (based on one example), couldn't you just map those fields into real PostgreSQL columns? You kind of suggest that, but what is the cost in doing so in terms of application complexity? Then the planner would have better statistics to work with. Top level JSON attributes do not get the same kind of statistics on that as real columns do.