Answer to question
SELECT DISTINCT ON (name, zonedistrict_id)
ST_Union(geom) as geom, gid, name, zonedistrict_id, zonestyle_id, longname
FROM zones
GROUP BY gid, name, zonedistrict_id, zonestyle_id, longname
ORDER BY name, zonedistrict_id, zonestyle_id;
It depends on what you are actually trying to achieve and what version of Postgres you are using and the table definition of the underlying table. You chose not to disclose any of that.
This gives you one distinct row per (name, zonedistrict_id)
, but geom
is only the aggregate over (gid, name, zonedistrict_id, zonestyle_id, longname)
In older versions you had to list every non-aggregated column of the SELECT
list in the GROUP BY
list. This changed somewhat with PostgreSQL 9.1. I quote the release notes:
Allow non-GROUP BY
columns in the query target list when the primary
key is specified in the GROUP BY
clause (Peter Eisentraut)
My guess
However, I suspect you secretly want this:
SELECT DISTINCT ON (name, zonedistrict_id)
ST_Union(geom) OVER (PARTITION BY name, zonedistrict_id) AS geom
, gid, name, zonedistrict_id, zonestyle_id, longname
FROM zones
ORDER BY name, zonedistrict_id, zonestyle_id;
Using a ST_Union(geom)
as aggregate window function. So geom
is actually the aggregate over (name, zonedistrict_id)
, which seems more likely.
This should be possible since the PostGis manual tells us about its aggregate functions:
... can be used just like any other sql aggregate function such as sum, average.
And SQL aggregate functions can be used as window functions.
DISTINCT ON()
Just as a side note, this is precisely what DISTINCT ON()
does (not to be confused with DISTINCT
)
SELECT DISTINCT ON ( expression [, ...] )
keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON
expressions are interpreted using the same rules as for ORDER BY
(see above). Note that the "first row" of each set is unpredictable unless ORDER BY
is used to ensure that the desired row appears first. For example
So if you were to write,
SELECT myFirstAgg(z)
FROM foo
GROUP BY x,y;
It's effectively
SELECT DISTINCT ON(x,y) z
FROM foo;
-- ORDER BY z;
In that it takes the first z
. There are two important differences,
You can also select other columns at no cost of further aggregation..
SELECT DISTINCT ON(x,y) z, k, r, t, v
FROM foo;
-- ORDER BY z, k, r, t, v;
Because there is no GROUP BY
you can not use (real) aggregates with it.
CREATE TABLE foo AS
SELECT * FROM ( VALUES
(1,2,3),
(1,2,4),
(1,2,5)
) AS t(x,y,z);
SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;
-- fails, as you should expect.
SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;
-- would not otherwise fail.
SELECT myFirstAgg(z), sum(z)
FROM foo
GROUP BY x,y;
Don't forget ORDER BY
Also, while I didn't bold it then I will now
Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example
Always use an ORDER BY
with DISTINCT ON
Using an Ordered-Set Aggregate Function
I imagine a lot of people are looking for first_value
, Ordered-Set Aggregate Functions. Just wanted to throw that out there. It would look like this, if the function existed:
SELECT a, b, first_value() WITHIN GROUP (ORDER BY z)
FROM foo
GROUP BY a,b;
But, alas you can do this.
SELECT a, b, percentile_disc(0) WITHIN GROUP (ORDER BY z)
FROM foo
GROUP BY a,b;
Best Answer
There are various ways to do this, with different performance, depending on the distribution of data (number of distinct
object_id
values, etc).The easiest query to write - but not necessary the most efficient, is of course using an aggregate,
MIN()
orMAX()
:If you have an index on
(object_id, data2)
this will not be too bad in recent versions of Postgres that can use index-only-scan for the execution plan.Another way would be using
DISTINCT ON
syntax. The same index as above would help:If, compared to the table size, there is a small number of
object_id
values, a different approach would be much more efficient. Provided you also have another tables (sayobjects
) that hasobject_id
as its primary key:The same index would be needed. The
ORDER BY
is not required but with the index in place, it won't hurt efficiency. If you don't have anobjects
table then that part would have to be replaced with:But you'd lose some efficiency, especially in older versions. In that case, you could replace this subquery with a complicated recursive query that traverses the
object_id
index efficiently. See the Posgres docs for more details: Loose Index Scan.Read also these great answers by Erwin in the related questions:
How do I efficiently get “the most recent corresponding row”?.
Optimize GROUP BY query to retrieve latest record per user
Best performance in sampling repeated value from a grouped column
And last but not least, the main reason for the problems is this:
Normalizing the table would lead to much more efficient queries.