Postgresql – How to get the aggregate of a window function in Postgres

aggregatepostgresqlwindow functions

I have a table containing a two columns of permutations/combinations of integer arrays, and a third column containing a value, like so:

CREATE TABLE foo
(
  perm integer[] NOT NULL,
  combo integer[] NOT NULL,
  value numeric NOT NULL DEFAULT 0
);
INSERT INTO foo
VALUES
( '{3,1,2}', '{1,2,3}', '1.1400' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0.9280' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0.8000' )

I want to find out the average and standard deviation for each permutation, as well as for each combination. I can do that with this query:

SELECT
  f1.perm,
  f2.combo,
  f1.perm_average_value,
  f2.combo_average_value,
  f1.perm_stddev,
  f2.combo_stddev,
  f1.perm_count,
  f2.combo_count
FROM
(
  SELECT
    perm,
    combo,
    avg( value ) AS perm_average_value,
    stddev_pop( value ) AS perm_stddev,
    count( * ) AS perm_count
  FROM foo
  GROUP BY perm, combo
) AS f1
JOIN
(
  SELECT
    combo,
    avg( value ) AS combo_average_value,
    stddev_pop( value ) AS combo_stddev,
    count( * ) AS combo_count
  FROM foo
  GROUP BY combo
) AS f2 ON ( f1.combo = f2.combo );

However, that query can get pretty slow when I have a lot of data, because the "foo" table (which in reality, consists of 14 partitions each with roughly 4 million rows) needs to be scanned twice.

Recently, I learned that Postgres supports "Window Functions", which is basically like a GROUP BY for a particular column. I modified my query to use these like so:

SELECT
  perm,
  combo,
  avg( value ) as perm_average_value,
  avg( avg( value ) ) over w_combo AS combo_average_value,
  stddev_pop( value ) as perm_stddev,
  stddev_pop( avg( value ) ) over w_combo as combo_stddev,
  count( * ) as perm_count,
  sum( count( * ) ) over w_combo AS combo_count
FROM foo
GROUP BY perm, combo
WINDOW w_combo AS ( PARTITION BY combo );

While this works for the "combo_count" column, the "combo_average_value" and "combo_stddev" columns are no longer accurate. It appears that the average is being taken for each permutation, and then being averaged a second time for each combination, which is incorrect.

How can I fix this? Can window functions even be used as an optimization here?

Best Answer

You can have window functions on the result of aggregate functions in a single query level.

This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.

SELECT perm
      ,combo
      ,avg(value)                 AS perm_average_value
      ,sum(avg(value) * count(*)) OVER w_combo /
       sum(count(*)) OVER w_combo AS combo_average_value
      ,stddev_pop(value)          AS perm_stddev
      ,0                          AS combo_stddev  -- doesn't work!
      ,count(*)                   AS perm_count
      ,sum(count(*)) OVER w_combo AS combo_count
FROM   foo
GROUP  BY perm, combo
WINDOW w_combo  AS (PARTITION BY combo);

For combo_average_value you would need this expression

sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo

Since you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)

This works:

SELECT DISTINCT ON (perm, combo)
       perm
      ,combo
      ,avg(value)        OVER wpc AS perm_average_value
      ,avg(value)        OVER wc  AS combo_average_value
      ,stddev_pop(value) OVER wpc AS perm_stddev
      ,stddev_pop(value) OVER wc  AS combo_stddev
      ,count(*)          OVER wpc AS perm_count
      ,count(*)          OVER wc  AS combo_count
FROM   foo
WINDOW wc  AS (PARTITION BY combo)
      ,wpc AS (PARTITION BY perm, combo);

I am using two different windows here, and reduce the rows with DISTINCT which is applied even after window functions.

But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.

Better performance with altered table layout

Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:

CREATE TABLE combo ( 
  combo_id serial PRIMARY KEY
 ,combo    int[] NOT NULL
);

CREATE TABLE perm ( 
  perm_id  serial PRIMARY KEY
 ,perm     int[] NOT NULL
);

CREATE TABLE value (
  perm_id  int REFERENCES perm(perm_id)
 ,combo_id int REFERENCES combo(combo_id)
 ,value numeric NOT NULL DEFAULT 0
);

If you don't need referential integrity you can omit the foreign key constraints.

The connection to combo_id could also be placed in the table perm, but in this scenario I would store it (slightly de-normalized) in value for better performance.

This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your numeric column. (If you don't need extreme precision, a double precision or even a real column might do, too.)

More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance

Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.

You would first aggregate in a subquery and then join to perm and combo for best performance.