PostgreSQL 10 – SELECT Query Optimization Techniques

optimizationperformancepostgresqlquery-performance

I want to select last unique rows based on time, so:

 SELECT DISTINCT ON (track) *
 FROM   eco.tracks WHERE id > (SELECT id FROM eco.tracks WHERE time_track < ((SELECT time_track FROM eco.tracks ORDER BY id DESC LIMIT 1)  - INTERVAL '300 seconds')  ORDER BY id DESC LIMIT 1)
 ORDER  BY track, time_track DESC;

It takes 20 seconds, that's too slow.
If I replace id by the actual value, the query takes only 2 ms:

SELECT DISTINCT ON (track) *
FROM   eco.tracks WHERE id > 48000000
ORDER  BY track, time_track DESC;

That subquery alone takes only 2 ms.

SELECT id FROM eco.tracks
WHERE  time_track < ((SELECT time_track FROM eco.tracks ORDER BY id DESC LIMIT 1) - INTERVAL '300 seconds')
ORDER  BY id DESC
LIMIT  1

Why the big difference?

Best Answer

My crystal ball says the two queries end up using different query plans. You can verify by running EXPLAIN on each. More useful details with EXPLAIN (ANALYZE, BUFFERS).

The subquery gets the greatest id with time_track more than 5 minutes before the one of the greatest id overall in table eco.tracks. Assuming a strong positive correlation between id and time_track, humans get a rough idea of the work to be done here: it only concerns the (probably very few) latest rows within the last 5 minutes in the table. But Postgres has no idea at the time of planning the query.

(Aside: It may be a lingering logic bug to assume that serial IDs are in sync with timestamps in the column time_track. Not enough information to say more.)

For the first (slow) query, Postgres has to prepare a plan that best copes with any value that might come out of the subquery. That may very well be a sequential scan, which turns out to be a bad choice for only the few latest rows.

For the second (fast) query, Postgres knows from the given value what to expect. It can start by identifying the few relevant rows with an index or bitmap index scan on an index on (id) which undoubtably exists (again assuming that's the PK). The rest is peanuts.

Possible solutions

There are many ways. Anything to get Postgres to pick the appropriate query plan ...

1. Two separate queries

Dead simple. Build the second query with the actual value returned from the first. The downside: two round trips to the db server collecting two times network latency and varying overhead.

2. Dynamic SQL

CREATE OR REPLACE FUNCTION f_latest_tracks(_intv interval = '300 seconds')
  RETURNS SETOF eco.tracks AS
$func$
BEGIN

RETURN QUERY EXECUTE
'SELECT DISTINCT ON (track) *
 FROM   eco.tracks
 WHERE  id > $1
 ORDER  BY track, time_track DESC'
USING (
   SELECT id
   FROM   eco.tracks
   WHERE  time_track < (SELECT time_track - _intv FROM eco.tracks ORDER BY id DESC LIMIT 1)
   ORDER  BY id DESC
   LIMIT  1
   );

END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_latest_tracks();
SELECT * FROM f_latest_tracks('10 minutes');

Why? The manual about Executing Dynamic Commands:

The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

Downside: You need to create a server-side function introducing a dependency on table eco.tracks. And you need to be comfortable with PL/pgSQL, of course.

Using a parameter default for conveniencse. See: