Postgresql – Combine Results in Same Column to get Average

postgresql

Lets say I have a table that looks like the following:

Average Score	Team	Month
2	New York Team	May
3	San Fran Team	May
2	Chi Team	May
3	Texas Team	May
2	New York Team	June
3	San Fran Team	June
2	Chi Team	June
3	Texas Team	June

I want to get the average for certain combinations of teams. How would I get the average of New York Team + San Fran team in May, and the average for the Chi team + Texas team?

Basically I want New York + San Fran to = Tier 1 (average)
and Chi + Texas to = Tier 2 (average)

Thanks!

Best Answer

You can use a with clause and get so the average from every team per month and then simply add the data as you see fit

CREATE TABLE sportstable
    ("Average Score" int, "Team" varchar(13), "Month" varchar(4))
;
    
INSERT INTO sportstable
    ("Average Score", "Team", "Month")
VALUES
    (2, 'New York Team', 'May'),
    (3, 'San Fran Team', 'May'),
    (2, 'Chi Team', 'May'),
    (3, 'Texas Team', 'May'),
    (2, 'New York Team', 'June'),
    (3, 'San Fran Team', 'June'),
    (2, 'Chi Team', 'June'),
    (3, 'Texas Team', 'June')
;

✓

8 rows affected

WITH average_month AS (SELECT "Month",
MAX(CASE WHEN "Team" = 'New York Team' THEN "Average Score"
            ELSE 0
       END) AS "New York Team",
       MAX(CASE WHEN "Team" = 'San Fran Team' THEN "Average Score"
            ELSE 0
       END) AS "San Fran Team",
       MAX(CASE WHEN "Team" = 'Chi Team' THEN "Average Score"
            ELSE 0
       END) AS "Chi Team",
       MAX(CASE WHEN "Team" = 'Texas Team' THEN "Average Score"
            ELSE 0
       END) AS "Texas Team"
FROM 
   sportstable
GROUP BY "Month")
SELECT 
    "Month"
    ,("New York Team" + "San Fran Team" ) AS Tier1
    ,("Chi Team" + "Texas Team") AS Tier2
FROM
    average_month

Month | tier1 | tier2
:---- | ----: | ----:
May   |     5 |     5
June  |     5 |     5

db<>fiddle here

Related Solutions

How to Get Aggregate of a Window Function in PostgreSQL

You can have window functions on the result of aggregate functions in a single query level.

This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.

SELECT perm
      ,combo
      ,avg(value)                 AS perm_average_value
      ,sum(avg(value) * count(*)) OVER w_combo /
       sum(count(*)) OVER w_combo AS combo_average_value
      ,stddev_pop(value)          AS perm_stddev
      ,0                          AS combo_stddev  -- doesn't work!
      ,count(*)                   AS perm_count
      ,sum(count(*)) OVER w_combo AS combo_count
FROM   foo
GROUP  BY perm, combo
WINDOW w_combo  AS (PARTITION BY combo);

For combo_average_value you would need this expression

sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo

Since you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)

This works:

SELECT DISTINCT ON (perm, combo)
       perm
      ,combo
      ,avg(value)        OVER wpc AS perm_average_value
      ,avg(value)        OVER wc  AS combo_average_value
      ,stddev_pop(value) OVER wpc AS perm_stddev
      ,stddev_pop(value) OVER wc  AS combo_stddev
      ,count(*)          OVER wpc AS perm_count
      ,count(*)          OVER wc  AS combo_count
FROM   foo
WINDOW wc  AS (PARTITION BY combo)
      ,wpc AS (PARTITION BY perm, combo);

I am using two different windows here, and reduce the rows with DISTINCT which is applied even after window functions.

But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.

Better performance with altered table layout

Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:

CREATE TABLE combo ( 
  combo_id serial PRIMARY KEY
 ,combo    int[] NOT NULL
);

CREATE TABLE perm ( 
  perm_id  serial PRIMARY KEY
 ,perm     int[] NOT NULL
);

CREATE TABLE value (
  perm_id  int REFERENCES perm(perm_id)
 ,combo_id int REFERENCES combo(combo_id)
 ,value numeric NOT NULL DEFAULT 0
);

If you don't need referential integrity you can omit the foreign key constraints.

The connection to combo_id could also be placed in the table perm, but in this scenario I would store it (slightly de-normalized) in value for better performance.

This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your numeric column. (If you don't need extreme precision, a double precision or even a real column might do, too.)

More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance

Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.

You would first aggregate in a subquery and then join to perm and combo for best performance.

Postgresql – Geographic Synonyms

First of all, you have an integer primary key on principal_country_divisions. Use it. More efficient than joining via name for multiple reasons (storage size, index size, faster integer arithmetic, no collations involved, fixed length).

create table principal_country_divisions (
  country_id    int primary key
  ,name         text not null
  ,country_code char(2)
);

create table synonyms (
   country_id int REFERENCES principal_country_divisions (country_id)
  ,syn        text
  ,primary    key (syn, country_id)
);

syn needs to the the first column of the index (pk), you had that right already. The accompanying index automatically covers equality tests on synonyms.syn.
Be sure to add an index on principal_country_divisions.name:

CREATE INDEX foo ON principal_country_divisions (name);

If you'd want to match patterns, not whole strings, the job would become more complex.

Next, how can you be sure to

return ONE record for New York?

Obviously, name and syn can be the same. There is no unique constraint over both columns and there isn't even one on syn alone. Otherwise your EXISTS query is a good approach - usually fast. You'd just have to avoid multiple rows. The added benefit of EXISTS would be to eliminate duplicates from synonyms alone, but that's ruled out by the pk. This may be faster for the case:

SELECT DISTINCT ON (1)
       a.country_id, a.name, a.country_code
FROM   principal_country_divisions a
LEFT   JOIN synonyms               s USING (country_id)
WHERE  a.name = 'NY'
OR     s.syn  = 'NY'
-- ORDER BY 1, <more expressions to pick from peers>

As you commented, a LEFT JOIN is in order to preserve finds in name.

In case of multiple finds, you can chose what to pick by adding more ORDER BY expressions. Leading columns have to agree with DISTINCT ON, though. Details in this related answer on SO.

Best Answer

Related Solutions

How to Get Aggregate of a Window Function in PostgreSQL

Better performance with altered table layout

Postgresql – Geographic Synonyms

Related Question