You can have window functions on the result of aggregate functions in a single query level.
This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.
SELECT perm
,combo
,avg(value) AS perm_average_value
,sum(avg(value) * count(*)) OVER w_combo /
sum(count(*)) OVER w_combo AS combo_average_value
,stddev_pop(value) AS perm_stddev
,0 AS combo_stddev -- doesn't work!
,count(*) AS perm_count
,sum(count(*)) OVER w_combo AS combo_count
FROM foo
GROUP BY perm, combo
WINDOW w_combo AS (PARTITION BY combo);
For combo_average_value
you would need this expression
sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo
Since you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)
This works:
SELECT DISTINCT ON (perm, combo)
perm
,combo
,avg(value) OVER wpc AS perm_average_value
,avg(value) OVER wc AS combo_average_value
,stddev_pop(value) OVER wpc AS perm_stddev
,stddev_pop(value) OVER wc AS combo_stddev
,count(*) OVER wpc AS perm_count
,count(*) OVER wc AS combo_count
FROM foo
WINDOW wc AS (PARTITION BY combo)
,wpc AS (PARTITION BY perm, combo);
I am using two different windows here, and reduce the rows with DISTINCT
which is applied even after window functions.
But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.
Better performance with altered table layout
Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:
CREATE TABLE combo (
combo_id serial PRIMARY KEY
,combo int[] NOT NULL
);
CREATE TABLE perm (
perm_id serial PRIMARY KEY
,perm int[] NOT NULL
);
CREATE TABLE value (
perm_id int REFERENCES perm(perm_id)
,combo_id int REFERENCES combo(combo_id)
,value numeric NOT NULL DEFAULT 0
);
If you don't need referential integrity you can omit the foreign key constraints.
The connection to combo_id
could also be placed in the table perm
, but in this scenario I would store it (slightly de-normalized) in value
for better performance.
This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your numeric
column. (If you don't need extreme precision, a double precision
or even a real
column might do, too.)
More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance
Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.
You would first aggregate in a subquery and then join to perm
and combo
for best performance.
First of all, you have an integer primary key on principal_country_divisions
. Use it. More efficient than joining via name
for multiple reasons (storage size, index size, faster integer arithmetic, no collations involved, fixed length).
create table principal_country_divisions (
country_id int primary key
,name text not null
,country_code char(2)
);
create table synonyms (
country_id int REFERENCES principal_country_divisions (country_id)
,syn text
,primary key (syn, country_id)
);
syn
needs to the the first column of the index (pk), you had that right already. The accompanying index automatically covers equality tests on synonyms.syn
.
Be sure to add an index on principal_country_divisions.name
:
CREATE INDEX foo ON principal_country_divisions (name);
If you'd want to match patterns, not whole strings, the job would become more complex.
Next, how can you be sure to
return ONE record for New York?
Obviously, name
and syn
can be the same. There is no unique constraint over both columns and there isn't even one on syn
alone. Otherwise your EXISTS
query is a good approach - usually fast. You'd just have to avoid multiple rows. The added benefit of EXISTS
would be to eliminate duplicates from synonyms
alone, but that's ruled out by the pk. This may be faster for the case:
SELECT DISTINCT ON (1)
a.country_id, a.name, a.country_code
FROM principal_country_divisions a
LEFT JOIN synonyms s USING (country_id)
WHERE a.name = 'NY'
OR s.syn = 'NY'
-- ORDER BY 1, <more expressions to pick from peers>
As you commented, a LEFT JOIN
is in order to preserve finds in name
.
In case of multiple finds, you can chose what to pick by adding more ORDER BY
expressions. Leading columns have to agree with DISTINCT ON
, though. Details in this related answer on SO.
Best Answer
You can use a with clause and get so the average from every team per month and then simply add the data as you see fit
db<>fiddle here