PostgreSQL – Select First Row (Grouping) and Add Aggregate Function

aggregatepostgispostgresqlwindow functions

First have a look at this question on StackOverflow.

I'm looking to accomplish the same task, except I also need to add an aggregate function (PostGIS's ST_Union) to my query.

How can I combine the answer using DISTINCT with an aggregation…

I've tried:

SELECT DISTINCT ON (name, zonedistrict_id)
       ST_Union(geom) as geom, gid, name, zonedistrict_id, zonestyle_id, longname 
FROM zones
ORDER BY name, zonedistrict_id, zonestyle_id;

Which screams:

column "zones.gid" must appear in the GROUP BY clause or be used in an aggregate function

Which is strange, because if I remove ST_Union(geom) as geom,, the query works. But it's absent of the union'd geometry.

Best Answer

Answer to question

SELECT DISTINCT ON (name, zonedistrict_id)
       ST_Union(geom) as geom, gid, name, zonedistrict_id, zonestyle_id, longname 
FROM   zones
GROUP  BY gid, name, zonedistrict_id, zonestyle_id, longname
ORDER  BY name, zonedistrict_id, zonestyle_id;

It depends on what you are actually trying to achieve and what version of Postgres you are using and the table definition of the underlying table. You chose not to disclose any of that.

This gives you one distinct row per (name, zonedistrict_id), but geom is only the aggregate over (gid, name, zonedistrict_id, zonestyle_id, longname)

In older versions you had to list every non-aggregated column of the SELECT list in the GROUP BY list. This changed somewhat with PostgreSQL 9.1. I quote the release notes:

Allow non-GROUP BY columns in the query target list when the primary key is specified in the GROUP BY clause (Peter Eisentraut)

My guess

However, I suspect you secretly want this:

SELECT DISTINCT ON (name, zonedistrict_id)
       ST_Union(geom) OVER (PARTITION BY name, zonedistrict_id) AS geom
     , gid, name, zonedistrict_id, zonestyle_id, longname 
FROM   zones
ORDER  BY name, zonedistrict_id, zonestyle_id;

Using a ST_Union(geom) as aggregate window function. So geom is actually the aggregate over (name, zonedistrict_id), which seems more likely.

This should be possible since the PostGis manual tells us about its aggregate functions:

... can be used just like any other sql aggregate function such as sum, average.

And SQL aggregate functions can be used as window functions.

Related Solutions

MongoDB: Aggregation workaround by helper collection

Once you go beyond the limitations you mention (until $out is supported at least), then your best option is Map Reduce. If you are going to run multiple Map Reduce jobs, I would recommend version 2.4 at least (uses V8 engine, muti-threaded). For an example of how to do a unique count with Map Reduce, take a look here:

http://cookbook.mongodb.org/patterns/unique_items_map_reduce/

How to Get Aggregate of a Window Function in PostgreSQL

You can have window functions on the result of aggregate functions in a single query level.

This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.

SELECT perm
      ,combo
      ,avg(value)                 AS perm_average_value
      ,sum(avg(value) * count(*)) OVER w_combo /
       sum(count(*)) OVER w_combo AS combo_average_value
      ,stddev_pop(value)          AS perm_stddev
      ,0                          AS combo_stddev  -- doesn't work!
      ,count(*)                   AS perm_count
      ,sum(count(*)) OVER w_combo AS combo_count
FROM   foo
GROUP  BY perm, combo
WINDOW w_combo  AS (PARTITION BY combo);

For combo_average_value you would need this expression

sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo

Since you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)

This works:

SELECT DISTINCT ON (perm, combo)
       perm
      ,combo
      ,avg(value)        OVER wpc AS perm_average_value
      ,avg(value)        OVER wc  AS combo_average_value
      ,stddev_pop(value) OVER wpc AS perm_stddev
      ,stddev_pop(value) OVER wc  AS combo_stddev
      ,count(*)          OVER wpc AS perm_count
      ,count(*)          OVER wc  AS combo_count
FROM   foo
WINDOW wc  AS (PARTITION BY combo)
      ,wpc AS (PARTITION BY perm, combo);

I am using two different windows here, and reduce the rows with DISTINCT which is applied even after window functions.

But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.

Better performance with altered table layout

Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:

CREATE TABLE combo ( 
  combo_id serial PRIMARY KEY
 ,combo    int[] NOT NULL
);

CREATE TABLE perm ( 
  perm_id  serial PRIMARY KEY
 ,perm     int[] NOT NULL
);

CREATE TABLE value (
  perm_id  int REFERENCES perm(perm_id)
 ,combo_id int REFERENCES combo(combo_id)
 ,value numeric NOT NULL DEFAULT 0
);

If you don't need referential integrity you can omit the foreign key constraints.

The connection to combo_id could also be placed in the table perm, but in this scenario I would store it (slightly de-normalized) in value for better performance.

This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your numeric column. (If you don't need extreme precision, a double precision or even a real column might do, too.)

More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance

Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.

You would first aggregate in a subquery and then join to perm and combo for best performance.