I have a table station_logs
in a PostgreSQL 9.6 database:
Column | Type |
---------------+-----------------------------+
id | bigint | bigserial
station_id | integer | not null
submitted_at | timestamp without time zone |
level_sensor | double precision |
Indexes:
"station_logs_pkey" PRIMARY KEY, btree (id)
"uniq_sid_sat" UNIQUE CONSTRAINT, btree (station_id, submitted_at)
I'm trying to get the last level_sensor
value based on submitted_at
, for each station_id
. There are around 400 unique station_id
values, and around 20k rows per day per station_id
.
Before creating index:
EXPLAIN ANALYZE
SELECT DISTINCT ON(station_id) station_id, submitted_at, level_sensor
FROM station_logs ORDER BY station_id, submitted_at DESC;
Unique (cost=4347852.14..4450301.72 rows=89 width=20) (actual time=22202.080..27619.167 rows=98 loops=1) -> Sort (cost=4347852.14..4399076.93 rows=20489916 width=20) (actual time=22202.077..26540.827 rows=20489812 loops=1) Sort Key: station_id, submitted_at DESC Sort Method: external merge Disk: 681040kB -> Seq Scan on station_logs (cost=0.00..598895.16 rows=20489916 width=20) (actual time=0.023..3443.587 rows=20489812 loops=$ Planning time: 0.072 ms Execution time: 27690.644 ms
Creating index:
CREATE INDEX station_id__submitted_at ON station_logs(station_id, submitted_at DESC);
After creating index, for the same query:
Unique (cost=0.56..2156367.51 rows=89 width=20) (actual time=0.184..16263.413 rows=98 loops=1) -> Index Scan using station_id__submitted_at on station_logs (cost=0.56..2105142.98 rows=20489812 width=20) (actual time=0.181..1$ Planning time: 0.206 ms Execution time: 16263.490 ms
Is there a way to make this query faster? Like 1 sec for example, 16 sec is still too much.
Best Answer
For only 400 stations, this query will be massively faster:
dbfiddle here
(comparing plans for this query, Abelisto's alternative and your original)
Resulting
EXPLAIN ANALYZE
as provided by the OP:The only index you need is the one you created:
station_id__submitted_at
. TheUNIQUE
constraintuniq_sid_sat
also does the job, basically. Maintaining both seems like a waste of disk space and write performance.I added
NULLS LAST
toORDER BY
in the query becausesubmitted_at
isn't definedNOT NULL
. Ideally, if applicable!, add aNOT NULL
constraint to the columnsubmitted_at
, drop the additional index and removeNULLS LAST
from the query.If
submitted_at
can beNULL
, create thisUNIQUE
index to replace both your current index and unique constraint:Consider:
This is assuming a separate table
station
with one row per relevantstation_id
(typically the PK) - which you should have either way. If you don't have it, create it. Again, very fast with this rCTE technique:I use that in the fiddle as well. You could use a similar query to solve your task directly, without
station
table - if you can't be convinced to create it.Detailed instructions, explanation and alternatives:
Optimize index
Your query should be very fast now. Only if you still need to optimize read performance ...
It might make sense to add
level_sensor
as last column to the index to allow index-only scans, like joanolo commented.Con: It makes the index bigger - which adds a little cost to all queries using it.
Pro: If you actually get index only scans out of it, the query at hand does not have to visit heap pages at all, which makes it about twice as fast. But that may be an insubstantial gain for the very fast query now.
However, I don't expect that to work for your case. You mentioned:
Typically, that would indicate unceasing write load (1 per
station_id
every 5 seconds). And you are interested in the latest row. Index-only scans only work for heap pages that are visible to all transactions (bit in the visibility map is set). You would have to run extremely aggressiveVACUUM
settings for the table to keep up with the write load, and it would still not work most of the time. If my assumptions are correct, index-only scans are out, don't addlevel_sensor
to the index.OTOH, if my assumptions hold, and your table is growing very big, a BRIN index might help. Related:
Or, even more specialized and more efficient: A partial index for only the latest additions to cut off the bulk of irrelevant rows:
Chose a timestamp for which you know that younger rows must exist. You have to add a matching
WHERE
condition to all queries, like:You have to adapt index and query from time to time.
Related answers with more details: