I want to select last unique rows based on time, so:
SELECT DISTINCT ON (track) *
FROM eco.tracks WHERE id > (SELECT id FROM eco.tracks WHERE time_track < ((SELECT time_track FROM eco.tracks ORDER BY id DESC LIMIT 1) - INTERVAL '300 seconds') ORDER BY id DESC LIMIT 1)
ORDER BY track, time_track DESC;
It takes 20 seconds, that's too slow.
If I replace id
by the actual value, the query takes only 2 ms:
SELECT DISTINCT ON (track) *
FROM eco.tracks WHERE id > 48000000
ORDER BY track, time_track DESC;
That subquery alone takes only 2 ms.
SELECT id FROM eco.tracks
WHERE time_track < ((SELECT time_track FROM eco.tracks ORDER BY id DESC LIMIT 1) - INTERVAL '300 seconds')
ORDER BY id DESC
LIMIT 1
Why the big difference?
Best Answer
My crystal ball says the two queries end up using different query plans. You can verify by running
EXPLAIN
on each. More useful details withEXPLAIN (ANALYZE, BUFFERS)
.The subquery gets the greatest
id
withtime_track
more than 5 minutes before the one of the greatestid
overall in tableeco.tracks
. Assuming a strong positive correlation betweenid
andtime_track
, humans get a rough idea of the work to be done here: it only concerns the (probably very few) latest rows within the last 5 minutes in the table. But Postgres has no idea at the time of planning the query.(Aside: It may be a lingering logic bug to assume that serial IDs are in sync with timestamps in the column
time_track
. Not enough information to say more.)For the first (slow) query, Postgres has to prepare a plan that best copes with any value that might come out of the subquery. That may very well be a sequential scan, which turns out to be a bad choice for only the few latest rows.
For the second (fast) query, Postgres knows from the given value what to expect. It can start by identifying the few relevant rows with an index or bitmap index scan on an index on
(id)
which undoubtably exists (again assuming that's the PK). The rest is peanuts.Possible solutions
There are many ways. Anything to get Postgres to pick the appropriate query plan ...
1. Two separate queries
Dead simple. Build the second query with the actual value returned from the first. The downside: two round trips to the db server collecting two times network latency and varying overhead.
2. Dynamic SQL
Call:
Why? The manual about Executing Dynamic Commands:
Downside: You need to create a server-side function introducing a dependency on table
eco.tracks
. And you need to be comfortable with PL/pgSQL, of course.Using a parameter default for conveniencse. See: