A couple of observations:
1.
The query plan for the slow query shows:
Seq Scan on event_page o (cost=10,000,000,000.00..10,000,000,000.00 rows=1 width=274)
... which indicates that you ran with SET enable_seqscan = off;
. And that means your Postgres version positively could not find any other way than the sequential scan.
2.
Did you enable constraint_exclusion like the manual advises here:
- Ensure that the constraint_exclusion configuration parameter is not disabled in
postgresql.conf
. If it is, queries will not be optimized
as desired.
3.
The day boundaries in your query are not in sync with the partitioning. (But that has no immediate effect for the given query, see comments.)
The CHECK
constraint reads:
CHECK ( timestamp_ >= '2017-4-01'::timestamp AND timestamp_ < '2017-05-01'::timestamp )
There may be confusion with timestamp
vs. timestamptz
and / or time zones. Or something got lost in translation and the question is misleading in that respect.
Either way, the expression in your WHERE
clause opens up a sneaky corner case (even if your day boundaries were in sync):
where O.timestamp_ between '2017-4-28 04:00:00' and '2017-4-29 03:59:59.999'
Postgres timestamps are implemented as 8-byte integers allowing 6 decimal places. A row with '2017-4-29 03:59:59.9995'
would not behave as expected.
Details:
Plus, this does not play well with constraint exclusion. More partitions than necessary may have to be read.
Use instead:
WHERE o.timestamp_ >= '2017-4-28 04:00' -- include lower bound
AND o.timestamp_ < '2017-4-29 04:00' -- exclude upper bound
You may want get "days" in your partitions in sync with your queries - 00:00 vs. 04:00 (or the evil 03:59:59.999) - and maybe use timestamptz
to begin with.
4.
I see for the index-only scan on alias
:
Heap Fetches: 1918543
... which is exceptionally high. There may or may not be problems with the visibility map. (See also: Updating The Visibility Map) Did you run VACUUM FULL
on the table(s)? (Compare this thread on plpgsql-performance.) Try plain VACUUM
, which updates the visibility map properly. But that's all hard to say, since you seem to have been running with enable_seqscan = off
. Not sure how this plays out in combination with all the other stuff at work here.
There have been various improvements to VACUUM
for Postgres 9.6 (among other things), this may explain the difference - or what you observed in pg 9.6 is coincidence and due to other factors.
5.
If alias.alias
is unique (and referential integrity can be assumed), the query can be simplified. Just count(DISTINCT o.person_alias)
without even joining to table alias
at all. Like:
SELECT count(DISTINCT o.person_alias) AS thecount
FROM event_page o
WHERE o.timestamp_ >= '2017-4-28 04:00' -- include lower bound
AND o.timestamp_ < '2017-4-29 04:00' -- exclude upper bound
AND o.location_host = 'www.foo.com'
AND o.location_path = '/ca/sale';
(Or use the subquery like you had it, may be faster than count(DISTINCT ...)
.)
If alias.alias
is not unique, your query may be wrong / ambiguous.
Why do you have person_alias
in table event_page
and not person_id
to begin with?
6.
You use the data type CHAR(24)
as PK for your tables, which is almost certainly a bad choice. Consider a serial
or bigserial
column or maybe a uuid
. Numeric types are smaller and faster and not burdened by collation rules, varying byte length, etc. Related:
Constraint-based exclusion [CBE] is performed on early stage of query planning, just after the query is parsed, mapped to actual relations and rewritten. (internals, Planner/Optimizer stage)
The planner cannot assume any contents of "sensor_sample" table.
So unless you have values hardcoded in the query, the planner will not exclude "partitions".
I guess what happens with the CTE variant... the planner is restricted because you use TABLESAMPLE and the whole subquery may be treated as volatile even if literals in the subquery are static. (that's just my guess, I'm not expert on planner code)
On the bright side, the index scan with negative result is blazingly fast. (single page scan at most!) so unless you have over 10000 partitions, I would not bother.
So, to answer your question directly:
You cannot improve this data structure much more.
Regardin index scans - they are cheap;
Regarding sequential scans - they are avoided when possible, as you see on your own examples.
Best Answer
If you have the correct indexes, which in this case would be
on events (device_id,event_time)
, then it will only hit a tiny part of each index (the path leading to the leaf page containing the highest event_time for the specified device_id for each partition), and that part will probably be readily cached, depending on how many distinct device_id exist and are routinely queried.When version 12 is released, then if you use declarative range partitioning rather than partitioning by inheritance the system will be smart enough to hit partitions in appropriate order and not startup the index scan on more partitions once the LIMIT is already satisfied. However, this might only be a minor improvement in your case. If you have the right index, it should be so fast already that no improvement is necessary, and if you have the wrong index it might be annoyingly slow even with this improvement.
If you can't wait for version 12 and refactor your partitioning, and can't build the appropriate indexes, then a set returning function which manually iterates through the partitions in order might the way to go, despite being tedious, ugly, and error prone.