I have two tables in Postgres 9.4.1 events
and event_refs
with the following schemas:
events
table
CREATE TABLE events (
id serial NOT NULL PRIMARY KEY,
event_type text NOT NULL,
event_path jsonb,
event_data jsonb,
created_at timestamp with time zone NOT NULL
);
-- Index on type and created time
CREATE INDEX events_event_type_created_at_idx
ON events (event_type, created_at);
event_refs
table
CREATE TABLE event_refs (
event_id integer NOT NULL,
reference_key text NOT NULL,
reference_value text NOT NULL,
CONSTRAINT event_refs_pkey PRIMARY KEY (event_id, reference_key, reference_value),
CONSTRAINT event_refs_event_id_fkey FOREIGN KEY (event_id)
REFERENCES events (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
Both tables hold 2M rows. This is the query I'm running
SELECT
EXTRACT(EPOCH FROM (MAX(events.created_at) - MIN(events.created_at))) as funnel_time
FROM
events
INNER JOIN
event_refs
ON
event_refs.event_id = events.id AND
event_refs.reference_key = 'project'
WHERE
events.event_type = 'event1' OR
events.event_type = 'event2' AND
events.created_at >= '2015-07-01 00:00:00+08:00' AND
events.created_at < '2015-12-01 00:00:00+08:00'
GROUP BY event_refs.reference_value
HAVING COUNT(*) > 1
I am aware of the operator precedence in the where clause. It is only supposed to filter events with the type 'event2' by date.
This is the EXPLAIN ANALYZE
output
GroupAggregate (cost=116503.86..120940.20 rows=147878 width=14) (actual time=3970.530..4163.041 rows=53532 loops=1)
Group Key: event_refs.reference_value
Filter: (count(*) > 1)
Rows Removed by Filter: 41315
-> Sort (cost=116503.86..116873.56 rows=147878 width=14) (actual time=3970.509..4105.316 rows=153766 loops=1)
Sort Key: event_refs.reference_value
Sort Method: external merge Disk: 3904kB
-> Hash Join (cost=24302.26..101275.04 rows=147878 width=14) (actual time=101.667..1394.281 rows=153766 loops=1)
Hash Cond: (event_refs.event_id = events.id)
-> Seq Scan on event_refs (cost=0.00..37739.00 rows=2000000 width=10) (actual time=0.007..368.661 rows=2000000 loops=1)
Filter: (reference_key = 'project'::text)
-> Hash (cost=21730.79..21730.79 rows=147878 width=12) (actual time=101.524..101.524 rows=153766 loops=1)
Buckets: 16384 Batches: 2 Memory Usage: 3315kB
-> Bitmap Heap Scan on events (cost=3761.23..21730.79 rows=147878 width=12) (actual time=23.139..75.814 rows=153766 loops=1)
Recheck Cond: ((event_type = 'event1'::text) OR ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone)))
Heap Blocks: exact=14911
-> BitmapOr (cost=3761.23..3761.23 rows=150328 width=0) (actual time=21.210..21.210 rows=0 loops=1)
-> Bitmap Index Scan on events_event_type_created_at_idx (cost=0.00..2349.42 rows=102533 width=0) (actual time=12.234..12.234 rows=99864 loops=1)
Index Cond: (event_type = 'event1'::text)
-> Bitmap Index Scan on events_event_type_created_at_idx (cost=0.00..1337.87 rows=47795 width=0) (actual time=8.975..8.975 rows=53902 loops=1)
Index Cond: ((event_type = 'event2'::text) AND (created_at >= '2015-07-01 04:00:00+12'::timestamp with time zone) AND (created_at < '2015-12-01 05:00:00+13'::timestamp with time zone))
Planning time: 0.493 ms
Execution time: 4178.517 ms
I am aware that the filter on the event_refs
table scan is not filtering anything, that is a result my test data, there will be different types added later.
Everything under and including the HashJoin
seems reasonable give my test data, but I'm wondering if it's possible to increase the Sort
speed from the GROUP BY
clause?
I've tried adding a b-tree index to the reference_value
column but it doesn't seem to use it. If I'm not mistaken (and I very well could be so please tell me), it's sorting 153766 rows. Wouldn't an index be beneficial to this sorting process?
Best Answer
work_mem
That's what makes your sort expensive:
The sort spills to disk, which kills performance. You need more RAM. In particular, you need to increase the setting for
work_mem
. The manual:In this particular case, raising the setting by a meager 4MB should do the trick. Generally, since you are going to need a lot more in your full deployment for 60M rows and since it can backfire if the general setting for
work_mem
is too high (read the manual where I linked to!), consider setting it high enough locally for your query, like:Be aware that even
SET LOCAL
sticks until the end of the transaction. You may want to reset if you put more into the same transaction:Or encapsulate the query in a function with a function-local setting. Related answer with example for function:
Indexes
I would also try these indexes:
Adding
id
as last column only makes sense if you get index-only scans out of it. See:And a partial index on
event_refs
:The predicate
WHERE reference_key = 'project'
does not help much in your test case (except maybe for query planning), but it should help a lot in your full deployment wherethere will be different types added later
.And this should also allow index-only scans.
Possible alternative query
Since you are selecting large parts of
events
anyway, this alternative query might be faster (heavily depends on data distribution):I suspect a bug in the query, and that you want parentheses in the
WHERE
clause like I added. According to operator precedence,AND
binds beforeOR
.Only makes sense if there are many rows per
(event_id, reference_value)
inevent_refs
. Again, the above index would help.