I am trying to perform some mathematical operations in PostgreSQL that involve calculating multiple values from each row, then aggregating, without requiring multiple Seq Scans over the whole table. Performance is critical for my application, so I want this to run as efficiently as possible on large data sets. Are there any optimizations I can do to cause PostgreSQL to only use a single Seq Scan?
Here's a simplified example:
Given this test data set:
postgres=> CREATE TABLE values (value int);
postgres=> INSERT INTO values (value) SELECT * from generate_series(1,1000000);
postgres=> SELECT * FROM values;
value
---------
1
2
3
4
5
...
999996
999997
999998
999999
1000000
And I want to perform this query that counts 2 instances of each row, once by the value
column and once by the abs(value)
. I'm currently accomplishing this with CROSS JOIN
:
SELECT
CASE idx
WHEN 0 THEN value % 2
WHEN 1 THEN value % 3
WHEN 2 THEN value % 5
END,
COUNT(value)
FROM values
CROSS JOIN LATERAL unnest(ARRAY[0,1,2]) idx
GROUP BY 1;
Here's the EXPLAIN ANALYZE
result for this query. Notice the loops=3
for in the Materialize line:
postgres=> EXPLAIN ANALYZE SELECT ....
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=128546.28..128561.28 rows=600 width=12) (actual time=1028.875..1028.879 rows=9 loops=1)
Group Key: CASE idx.idx WHEN 0 THEN ("values".value % 2) WHEN 1 THEN ("values".value % 3) WHEN 2 THEN ("values".value % 5) ELSE NULL::integer END
-> Nested Loop (cost=0.00..111620.66 rows=3385125 width=8) (actual time=0.022..672.754 rows=3000003 loops=1)
-> Function Scan on unnest idx (cost=0.00..0.03 rows=3 width=4) (actual time=0.004..0.007 rows=3 loops=1)
-> Materialize (cost=0.00..21350.62 rows=1128375 width=4) (actual time=0.005..106.673 rows=1000001 loops=3)
-> Seq Scan on "values" (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.014..82.969 rows=1000001 loops=1)
Planning Time: 0.077 ms
Execution Time: 1036.536 ms
(8 rows)
I compared this to the case of only using 1 instance of each row rather than 2. The 1 instance
query performs a single Seq Scan and runs ~3x faster (as expected):
postgres=> EXPLAIN ANALYZE SELECT
postgres-> value % 2,
postgres-> value % 3,
postgres-> value % 5,
postgres-> COUNT(value)
postgres-> FROM values
postgres-> GROUP BY 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=35455.31..35458.81 rows=200 width=20) (actual time=290.033..290.042 rows=59 loops=1)
Group Key: (value % 2), (value % 3), (value % 5)
-> Seq Scan on "values" (cost=0.00..24171.56 rows=1128375 width=16) (actual time=0.018..130.865 rows=1000001 loops=1)
Planning Time: 0.048 ms
Execution Time: 290.081 ms
(5 rows)
Time: 347.262 ms
Other Attempts
Here are some other approachs I tried:
- Using
UNION
still performs 2 Seq Scans:
SELECT
agg,
count(agg)
FROM (
SELECT value % 2 FROM values UNION
SELECT value % 3 FROM values UNION
SELECT value % 5 FROM values
) agg(agg)
GROUP BY 1;
postgres=> EXPLAIN ANALYZE ...
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=199456.88..199458.88 rows=200 width=12) (actual time=883.210..883.211 rows=9 loops=1)
Group Key: (("values".value % 2))
-> HashAggregate (cost=114828.75..148680.00 rows=3385125 width=4) (actual time=873.320..883.192 rows=9 loops=1)
Group Key: (("values".value % 2))
-> Append (cost=0.00..106365.94 rows=3385125 width=4) (actual time=0.027..549.677 rows=3000003 loops=1)
-> Seq Scan on "values" (cost=0.00..18529.69 rows=1128375 width=4) (actual time=0.026..109.382 rows=1000001 loops=1)
-> Seq Scan on "values" values_1 (cost=0.00..18529.69 rows=1128375 width=4) (actual time=0.013..109.137 rows=1000001 loops=1)
-> Seq Scan on "values" values_2 (cost=0.00..18529.69 rows=1128375 width=4) (actual time=0.010..109.372 rows=1000001 loops=1)
Planning Time: 0.064 ms
Execution Time: 927.634 ms
(10 rows)
- Using PL/PGSQL. This performs only 1 Seq Scan, but ARRAY operations in PL/PGSQL are very slow, so this actual executes slower than the original:
CREATE TEMP TABLE result (value int, count int);
DO LANGUAGE PLPGSQL $$
DECLARE
counts int8[];
row record;
BEGIN
counts = array_fill(0, ARRAY[5]);
FOR row IN (SELECT value FROM values) LOOP
counts[row.value % 2] = counts[row.value % 2] + 1;
counts[row.value % 3] = counts[row.value % 3] + 1;
counts[row.value % 5] = counts[row.value % 5] + 1;
END LOOP;
FOR i IN 0..4 LOOP
CONTINUE WHEN counts[i] = 0;
INSERT INTO result (value, count) VALUES (i, counts[i]);
END LOOP;
END
$$;
SELECT value, count FROM result;
postgres=> \timing
Timing is on.
postgres=> DO LANGUAGE PLPGSQL $$ ...
DO
Time: 1474.065 ms (00:01.474)
- Tweaking Query Plan Configuration. I tried changing
enable_seqscan
,enable_nestloop
,work_mem
, and cost constraints and could not find a configuration that performed better than original.
Best Answer
You can get PostgreSQL to perform a single sequential scan like this (this is PostgreSQL v13 with large
work_mem
):But that doesn't make a substantial difference, because in both our queries the costly thing is not the sequential scan, but the aggregate.