Postgresql – Improve Performance on GROUP BY – large table PostgreSQL

performancepostgresqlquery-performance

I have a Postgres table containing the following three columns:

  • origin (INT, b tree index)
  • destination (INT, b tree index)
  • minutes (SMALLINT, b tree index)

The table has ~ 1.4 billion rows. Each row contains the information about the travel time (minutes) from one place (origin) to another (destination).

The table is read-only – we might need to change it, but that will happen once a year or so…

I need to do queries like:

SELECT min(minutes) 
FROM table 
WHERE origin IN([long list]) 
AND destination IN([long list]) 
GROUP BY origin

The performance is really bad. Depending on the number of origins and destinations, the query takes more than 10 minutes. What we need is < 3 minutes, ideally < 10 sec. DBMS is postgreSQL 9.4, 16 G RAM.

Is there a way to improve it? Partitioning the table, optimizing indexes etc.? Or another DBMS?

Edited 2018-11-28:

  • Added Index on (origin, destination)
EXPLAIN ANALYSE SELECT min(minutes),origin FROM table WHERE destination = ANY('{100225471, 100236548, 10263073, 10827564, 7435519, 100272388, 10688802, 10823750, 10853634, 10681223, 100213867, 100234761, 100113775, 100229234, 100234067, 100235418, 100229220, 1000053957, 1000059198, 1000028857, 1000057809, 1000058848, 1000059188, 1000057802}') GROUP BY(origin)
"HashAggregate  (cost=228043.99..228044.04 rows=5 width=6) (actual time=10118.146..10126.308 rows=50022 loops=1)"
"  Group Key: origin"
"  ->  Index Scan using idx_destination on table  (cost=0.60..226716.69 rows=265460 width=6) (actual time=0.022..10021.269 rows=208174 loops=1)"
"        Index Cond: (destination = ANY ('{100225471,100236548,10263073,10827564,7435519,100272388,10688802,10823750,10853634,10681223,100213867,100234761,100113775,100229234,100234067,100235418,100229220,1000053957,1000059198,1000028857,1000057809,1000058848,1000 (...)"
"Planning time: 0.306 ms"
"Execution time: 10127.949 ms"

Best Answer

IN can take a list of values (like you have it) or a set. Long lists don't scale well. See:

Try:

SELECT origin, min(minutes) -- you'll also want to show origin
FROM   tbl
JOIN   unnest($origin_array) o(origin)      USING (origin)
JOIN   unnest($dest_array)   d(destination) USING (destination)
GROUP  BY origin;

Where $origin_array can be an array of the appropriate type int[] or an array literal like '{100225471,100236548,10263073}'::int[]

The optimal index for this query should be:

CREATE INDEX ON tbl (origin, destination, minutes);

Appending minutes only makes sense if you get index-only scans out of it (introduced with pg 9.2, improved in later versions). But since your table is mostly read-only, it seem like the perfect use case. See:

You later commented:

I want to get the closest (min(minutes)) facility (destination) for each place (origin).

The above query does not achieve that. Only returns the shortest time for any qualifying destination. Use instead:

SELECT DISTINCT ON (origin)
       origin, minutes, destination
FROM   tbl
JOIN   unnest($origin_array) o(origin)      USING (origin)
JOIN   unnest($dest_array)   d(destination) USING (destination)
ORDER  BY origin, minutes;

Depending on cardinalities and value frequencies, this alternative may be faster:

SELECT *
FROM   unnest($origin_array) o(origin)
LEFT   JOIN LATERAL (
   SELECT minutes, destination
   FROM   tbl
   WHERE  origin = o.origin
   AND    destination = ANY ($dest_array)  -- equivalent to IN([long list]) 
   ORDER  BY minutes
   LIMIT  1
   ) t ON true;

One subtle difference: the last query returns origins from the input array with no match in tbl at all. May or may not be desirable - or even necessary. The last query is my favorite.

Further reading on how to optimize queries of this kind:

Plus, urgently consider upgrading to a current version of Postgres, like a_horse already suggested. Performance for big data has been improved substantially since Postgres 9.4.

You mentioned lists of up to 2000 items. Combining two long list like that is problematic since it results in a Cartesian product of 2000 x 2000 = 4000000 possible combinations. If that can happen - sure you need to consider that many combinations? That won't be fast.