PostgreSQL 9.6 – Progressive Child Tables Scan for Performance

inheritanceperformancepostgresqlpostgresql-9.6postgresql-performance

We have append-only events table where events from different devices are being collected.
We use PostgreSQL 9.6 and pg_partman to split the table into monthly partitions.
Data is partitioned using event_time column.
Each monthly table has about 100M rows.

There's a query from application side that is used to show latest events from a device. It looks something like this (simplified):

SELECT event_time, event_data
FROM events
WHERE device_id = 'zzz'
ORDER BY event_time DESC
LIMIT 10

Every search criteria is covered by index, so PostgreSQL does only index scans.

The problem is that it scans all child tables and only then does the LIMIT 10.
Indexes for all these tables are quite big and don't fit into memory, so it takes up to 20 seconds to complete this query.

But in most cases the 10 most recent events are available from one child table (partition) that is named events_p<current_year>_<current_month>.

Is there any way to implement "progressive" scan for child tables inside PostgreSQL without changing application code?

For example:

events
events_p2019_02 <- and so on...
events_p2019_03 <- scan this if less than 10 rows found in _04 and _05
events_p2019_04 <- scan this if less than 10 rows found in events_p2019_05
events_p2019_05 <- scan this first

We don't know in advance the possible date range for that 10 events. Some device may have them during the last day, while others may have them spanned over several months (so in many child tables).

Best Answer

The problem is that it scans all child tables and only then does the LIMIT 10. Indexes for all these tables are quite big and don't fit into memory, so it takes up to 20 seconds to complete this query.

If you have the correct indexes, which in this case would be on events (device_id,event_time), then it will only hit a tiny part of each index (the path leading to the leaf page containing the highest event_time for the specified device_id for each partition), and that part will probably be readily cached, depending on how many distinct device_id exist and are routinely queried.

When version 12 is released, then if you use declarative range partitioning rather than partitioning by inheritance the system will be smart enough to hit partitions in appropriate order and not startup the index scan on more partitions once the LIMIT is already satisfied. However, this might only be a minor improvement in your case. If you have the right index, it should be so fast already that no improvement is necessary, and if you have the wrong index it might be annoyingly slow even with this improvement.

If you can't wait for version 12 and refactor your partitioning, and can't build the appropriate indexes, then a set returning function which manually iterates through the partitions in order might the way to go, despite being tedious, ugly, and error prone.