PostgreSQL 9.6 – Progressive Child Tables Scan for Performance

inheritanceperformancepostgresqlpostgresql-9.6postgresql-performance

We have append-only events table where events from different devices are being collected.
We use PostgreSQL 9.6 and pg_partman to split the table into monthly partitions.
Data is partitioned using event_time column.
Each monthly table has about 100M rows.

There's a query from application side that is used to show latest events from a device. It looks something like this (simplified):

SELECT event_time, event_data
FROM events
WHERE device_id = 'zzz'
ORDER BY event_time DESC
LIMIT 10

Every search criteria is covered by index, so PostgreSQL does only index scans.

The problem is that it scans all child tables and only then does the LIMIT 10.
Indexes for all these tables are quite big and don't fit into memory, so it takes up to 20 seconds to complete this query.

But in most cases the 10 most recent events are available from one child table (partition) that is named events_p<current_year>_<current_month>.

Is there any way to implement "progressive" scan for child tables inside PostgreSQL without changing application code?

For example:

events
events_p2019_02 <- and so on...
events_p2019_03 <- scan this if less than 10 rows found in _04 and _05
events_p2019_04 <- scan this if less than 10 rows found in events_p2019_05
events_p2019_05 <- scan this first

We don't know in advance the possible date range for that 10 events. Some device may have them during the last day, while others may have them spanned over several months (so in many child tables).

Best Answer

The problem is that it scans all child tables and only then does the LIMIT 10. Indexes for all these tables are quite big and don't fit into memory, so it takes up to 20 seconds to complete this query.

If you have the correct indexes, which in this case would be on events (device_id,event_time), then it will only hit a tiny part of each index (the path leading to the leaf page containing the highest event_time for the specified device_id for each partition), and that part will probably be readily cached, depending on how many distinct device_id exist and are routinely queried.

When version 12 is released, then if you use declarative range partitioning rather than partitioning by inheritance the system will be smart enough to hit partitions in appropriate order and not startup the index scan on more partitions once the LIMIT is already satisfied. However, this might only be a minor improvement in your case. If you have the right index, it should be so fast already that no improvement is necessary, and if you have the wrong index it might be annoyingly slow even with this improvement.

If you can't wait for version 12 and refactor your partitioning, and can't build the appropriate indexes, then a set returning function which manually iterates through the partitions in order might the way to go, despite being tedious, ugly, and error prone.

1.

The query plan for the slow query shows:

Seq Scan on event_page o (cost=10,000,000,000.00..10,000,000,000.00 rows=1 width=274)

... which indicates that you ran with SET enable_seqscan = off;. And that means your Postgres version positively could not find any other way than the sequential scan.

2.

Did you enable constraint_exclusion like the manual advises here:

Ensure that the constraint_exclusion configuration parameter is not disabled in postgresql.conf. If it is, queries will not be optimized as desired.

3.

The day boundaries in your query are not in sync with the partitioning. (But that has no immediate effect for the given query, see comments.)
The CHECK constraint reads:

CHECK ( timestamp_ >= '2017-4-01'::timestamp AND timestamp_ < '2017-05-01'::timestamp )

There may be confusion with timestamp vs. timestamptz and / or time zones. Or something got lost in translation and the question is misleading in that respect.

Either way, the expression in your WHERE clause opens up a sneaky corner case (even if your day boundaries were in sync):

where O.timestamp_ between '2017-4-28 04:00:00' and '2017-4-29 03:59:59.999'

Postgres timestamps are implemented as 8-byte integers allowing 6 decimal places. A row with '2017-4-29 03:59:59.9995' would not behave as expected. Details:

Ignoring timezones altogether in Rails and PostgreSQL

Plus, this does not play well with constraint exclusion. More partitions than necessary may have to be read.

Use instead:

WHERE o.timestamp_ >= '2017-4-28 04:00' -- include lower bound
AND   o.timestamp_ <  '2017-4-29 04:00' -- exclude upper bound

You may want get "days" in your partitions in sync with your queries - 00:00 vs. 04:00 (or the evil 03:59:59.999) - and maybe use timestamptz to begin with.

4.

I see for the index-only scan on alias:

Heap Fetches: 1918543

... which is exceptionally high. There may or may not be problems with the visibility map. (See also: Updating The Visibility Map) Did you run VACUUM FULL on the table(s)? (Compare this thread on plpgsql-performance.) Try plain VACUUM, which updates the visibility map properly. But that's all hard to say, since you seem to have been running with enable_seqscan = off. Not sure how this plays out in combination with all the other stuff at work here.

There have been various improvements to VACUUM for Postgres 9.6 (among other things), this may explain the difference - or what you observed in pg 9.6 is coincidence and due to other factors.

5.

If alias.alias is unique (and referential integrity can be assumed), the query can be simplified. Just count(DISTINCT o.person_alias) without even joining to table alias at all. Like:

SELECT count(DISTINCT o.person_alias) AS thecount
FROM   event_page o
WHERE  o.timestamp_ >= '2017-4-28 04:00' -- include lower bound
AND    o.timestamp_ <  '2017-4-29 04:00' -- exclude upper bound
AND    o.location_host = 'www.foo.com'
AND    o.location_path = '/ca/sale';

(Or use the subquery like you had it, may be faster than count(DISTINCT ...).)

If alias.alias is not unique, your query may be wrong / ambiguous.

Why do you have person_alias in table event_page and not person_id to begin with?

6.

You use the data type CHAR(24) as PK for your tables, which is almost certainly a bad choice. Consider a serial or bigserial column or maybe a uuid. Numeric types are smaller and faster and not burdened by collation rules, varying byte length, etc. Related:

PostgreSQL 9.6 Performance – Partition Constraint Not Used for Joins by Timestamp

Constraint-based exclusion [CBE] is performed on early stage of query planning, just after the query is parsed, mapped to actual relations and rewritten. (internals, Planner/Optimizer stage)

The planner cannot assume any contents of "sensor_sample" table.

So unless you have values hardcoded in the query, the planner will not exclude "partitions".

I guess what happens with the CTE variant... the planner is restricted because you use TABLESAMPLE and the whole subquery may be treated as volatile even if literals in the subquery are static. (that's just my guess, I'm not expert on planner code)

On the bright side, the index scan with negative result is blazingly fast. (single page scan at most!) so unless you have over 10000 partitions, I would not bother.

So, to answer your question directly:

You cannot improve this data structure much more.
Regardin index scans - they are cheap;
Regarding sequential scans - they are avoided when possible, as you see on your own examples.

Best Answer

Related Solutions

Postgresql – Nested Loop with Materialize resulting in very slow query on postgresql 9.5

1.

2.

3.

4.

5.

6.

PostgreSQL 9.6 Performance – Partition Constraint Not Used for Joins by Timestamp

Related Question