Postgresql – Aggregate Multiple Instances of Each Row Without Multiple Seq Scans

postgresqlpostgresql-performance

I am trying to perform some mathematical operations in PostgreSQL that involve calculating multiple values from each row, then aggregating, without requiring multiple Seq Scans over the whole table. Performance is critical for my application, so I want this to run as efficiently as possible on large data sets. Are there any optimizations I can do to cause PostgreSQL to only use a single Seq Scan?

Here's a simplified example:

Given this test data set:

postgres=> CREATE TABLE values (value int);
postgres=> INSERT INTO values (value) SELECT * from generate_series(1,1000000);
postgres=> SELECT * FROM values;
  value
---------
1
2
3
4
5
...
 999996
 999997
 999998
 999999
 1000000

And I want to perform this query that counts 2 instances of each row, once by the value column and once by the abs(value). I'm currently accomplishing this with CROSS JOIN:

SELECT
  CASE idx
  WHEN 0 THEN value % 2
  WHEN 1 THEN value % 3
  WHEN 2 THEN value % 5
  END,
  COUNT(value)
FROM values
CROSS JOIN LATERAL unnest(ARRAY[0,1,2]) idx
GROUP BY 1;

Here's the EXPLAIN ANALYZE result for this query. Notice the loops=3 for in the Materialize line:

postgres=> EXPLAIN ANALYZE SELECT ....
                                                                     QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=128546.28..128561.28 rows=600 width=12) (actual time=1028.875..1028.879 rows=9 loops=1)
   Group Key: CASE idx.idx WHEN 0 THEN ("values".value % 2) WHEN 1 THEN ("values".value % 3) WHEN 2 THEN ("values".value % 5) ELSE NULL::integer END
   ->  Nested Loop  (cost=0.00..111620.66 rows=3385125 width=8) (actual time=0.022..672.754 rows=3000003 loops=1)
         ->  Function Scan on unnest idx  (cost=0.00..0.03 rows=3 width=4) (actual time=0.004..0.007 rows=3 loops=1)
         ->  Materialize  (cost=0.00..21350.62 rows=1128375 width=4) (actual time=0.005..106.673 rows=1000001 loops=3)
               ->  Seq Scan on "values"  (cost=0.00..15708.75 rows=1128375 width=4) (actual time=0.014..82.969 rows=1000001 loops=1)
 Planning Time: 0.077 ms
 Execution Time: 1036.536 ms
(8 rows)

I compared this to the case of only using 1 instance of each row rather than 2. The 1 instance query performs a single Seq Scan and runs ~3x faster (as expected):

postgres=> EXPLAIN ANALYZE SELECT
postgres-> value % 2,
postgres-> value % 3,
postgres-> value % 5,
postgres-> COUNT(value)
postgres-> FROM values
postgres-> GROUP BY 1;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=35455.31..35458.81 rows=200 width=20) (actual time=290.033..290.042 rows=59 loops=1)
   Group Key: (value % 2), (value % 3), (value % 5)
   ->  Seq Scan on "values"  (cost=0.00..24171.56 rows=1128375 width=16) (actual time=0.018..130.865 rows=1000001 loops=1)
 Planning Time: 0.048 ms
 Execution Time: 290.081 ms
(5 rows)

Time: 347.262 ms

Other Attempts

Here are some other approachs I tried:

  1. Using UNION still performs 2 Seq Scans:
SELECT
  agg,
  count(agg)
FROM (
  SELECT value % 2 FROM values UNION
  SELECT value % 3 FROM values UNION
  SELECT value % 5 FROM values
 ) agg(agg)
 GROUP BY 1;
postgres=> EXPLAIN ANALYZE ...
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=199456.88..199458.88 rows=200 width=12) (actual time=883.210..883.211 rows=9 loops=1)
   Group Key: (("values".value % 2))
   ->  HashAggregate  (cost=114828.75..148680.00 rows=3385125 width=4) (actual time=873.320..883.192 rows=9 loops=1)
         Group Key: (("values".value % 2))
         ->  Append  (cost=0.00..106365.94 rows=3385125 width=4) (actual time=0.027..549.677 rows=3000003 loops=1)
               ->  Seq Scan on "values"  (cost=0.00..18529.69 rows=1128375 width=4) (actual time=0.026..109.382 rows=1000001 loops=1)
               ->  Seq Scan on "values" values_1  (cost=0.00..18529.69 rows=1128375 width=4) (actual time=0.013..109.137 rows=1000001 loops=1)
               ->  Seq Scan on "values" values_2  (cost=0.00..18529.69 rows=1128375 width=4) (actual time=0.010..109.372 rows=1000001 loops=1)
 Planning Time: 0.064 ms
 Execution Time: 927.634 ms
(10 rows)
  1. Using PL/PGSQL. This performs only 1 Seq Scan, but ARRAY operations in PL/PGSQL are very slow, so this actual executes slower than the original:
CREATE TEMP TABLE result (value int, count int);
DO LANGUAGE PLPGSQL $$
  DECLARE
    counts int8[];
    row record;
  BEGIN

    counts = array_fill(0, ARRAY[5]);
    FOR row IN (SELECT value FROM values) LOOP
      counts[row.value % 2] = counts[row.value % 2] + 1;
      counts[row.value % 3] = counts[row.value % 3] + 1;
      counts[row.value % 5] = counts[row.value % 5] + 1;
    END LOOP;

    FOR i IN 0..4 LOOP
      CONTINUE WHEN counts[i] = 0;
      INSERT INTO result (value, count) VALUES (i, counts[i]);
    END LOOP;
  END
$$;
SELECT value, count FROM result;
postgres=> \timing
Timing is on.
postgres=> DO LANGUAGE PLPGSQL $$ ...
DO
Time: 1474.065 ms (00:01.474)
  1. Tweaking Query Plan Configuration. I tried changing enable_seqscan, enable_nestloop, work_mem, and cost constraints and could not find a configuration that performed better than original.

Best Answer

You can get PostgreSQL to perform a single sequential scan like this (this is PostgreSQL v13 with large work_mem):

EXPLAIN (ANALYZE, BUFFERS) SELECT
  CASE idx
  WHEN 0 THEN value
  WHEN 1 THEN abs(value)
  END,
  COUNT(value)
FROM values
CROSS JOIN LATERAL (VALUES (0), (1)) idx(idx)
GROUP BY 1;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=64425.09..99425.12 rows=2000002 width=12) (actual time=803.944..1047.598 rows=1000001 loops=1)
   Group Key: CASE "*VALUES*".column1 WHEN 0 THEN "values".value WHEN 1 THEN abs("values".value) ELSE NULL::integer END
   Batches: 1  Memory Usage: 180241kB
   Buffers: shared hit=4425
   ->  Nested Loop  (cost=0.00..54425.08 rows=2000002 width=8) (actual time=0.024..379.624 rows=2000002 loops=1)
         Buffers: shared hit=4425
         ->  Seq Scan on "values"  (cost=0.00..14425.01 rows=1000001 width=4) (actual time=0.014..45.913 rows=1000001 loops=1)
               Buffers: shared hit=4425
         ->  Materialize  (cost=0.00..0.04 rows=2 width=4) (actual time=0.000..0.000 rows=2 loops=1000001)
               ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=4) (actual time=0.002..0.003 rows=2 loops=1)
 Planning Time: 0.212 ms
 Execution Time: 1153.489 ms
(12 rows)

But that doesn't make a substantial difference, because in both our queries the costly thing is not the sequential scan, but the aggregate.