I have a table containing a two columns of permutations/combinations of integer arrays, and a third column containing a value, like so:
CREATE TABLE foo
(
perm integer[] NOT NULL,
combo integer[] NOT NULL,
value numeric NOT NULL DEFAULT 0
);
INSERT INTO foo
VALUES
( '{3,1,2}', '{1,2,3}', '1.1400' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0.9280' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,1,2}', '{1,2,3}', '1.2680' ),
( '{3,1,2}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0' ),
( '{3,2,1}', '{1,2,3}', '0.8000' )
I want to find out the average and standard deviation for each permutation, as well as for each combination. I can do that with this query:
SELECT
f1.perm,
f2.combo,
f1.perm_average_value,
f2.combo_average_value,
f1.perm_stddev,
f2.combo_stddev,
f1.perm_count,
f2.combo_count
FROM
(
SELECT
perm,
combo,
avg( value ) AS perm_average_value,
stddev_pop( value ) AS perm_stddev,
count( * ) AS perm_count
FROM foo
GROUP BY perm, combo
) AS f1
JOIN
(
SELECT
combo,
avg( value ) AS combo_average_value,
stddev_pop( value ) AS combo_stddev,
count( * ) AS combo_count
FROM foo
GROUP BY combo
) AS f2 ON ( f1.combo = f2.combo );
However, that query can get pretty slow when I have a lot of data, because the "foo" table (which in reality, consists of 14 partitions each with roughly 4 million rows) needs to be scanned twice.
Recently, I learned that Postgres supports "Window Functions", which is basically like a GROUP BY for a particular column. I modified my query to use these like so:
SELECT
perm,
combo,
avg( value ) as perm_average_value,
avg( avg( value ) ) over w_combo AS combo_average_value,
stddev_pop( value ) as perm_stddev,
stddev_pop( avg( value ) ) over w_combo as combo_stddev,
count( * ) as perm_count,
sum( count( * ) ) over w_combo AS combo_count
FROM foo
GROUP BY perm, combo
WINDOW w_combo AS ( PARTITION BY combo );
While this works for the "combo_count" column, the "combo_average_value" and "combo_stddev" columns are no longer accurate. It appears that the average is being taken for each permutation, and then being averaged a second time for each combination, which is incorrect.
How can I fix this? Can window functions even be used as an optimization here?
Best Answer
You can have window functions on the result of aggregate functions in a single query level.
This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.
For
combo_average_value
you would need this expressionSince you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)
This works:
I am using two different windows here, and reduce the rows with
DISTINCT
which is applied even after window functions.But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.
Better performance with altered table layout
Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:
If you don't need referential integrity you can omit the foreign key constraints.
The connection to
combo_id
could also be placed in the tableperm
, but in this scenario I would store it (slightly de-normalized) invalue
for better performance.This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your
numeric
column. (If you don't need extreme precision, adouble precision
or even areal
column might do, too.)More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance
Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.
You would first aggregate in a subquery and then join to
perm
andcombo
for best performance.