Postgresql – Aggregate function to return any (first) encountered value

aggregatepostgresql

How do I tell PostgreSQL to return first encountered value instead of an aggregate column?

                          Table "public.cache"
  Column   |  Type   | Modifiers | Storage | Stats target | Description 
-----------+---------+-----------+---------+--------------+-------------
 user_id   | integer |           | plain   |              | 
 object_id | integer |           | plain   |              | 
 data      | integer |           | plain   |              | 
 data2     | boolean |           | plain   |              | 
Indexes:
    "cache_object_id_user_id_key" UNIQUE CONSTRAINT, btree (object_id, user_id)
    "cache_user_id_object_id_key" UNIQUE CONSTRAINT, btree (user_id, object_id)
Has OIDs: no

Query grouping by object_id and data2 will make hash aggregate, what I want to avoid.

SELECT object_id, data2 FROM cache GROUP BY object_id, data2;

I found bool_or() but it will scan all values in bad case.

SELECT object_id, bool_or(data2) FROM cache GROUP BY object_id;

Moreover, there is no such function for any datatype. What I want to do is to get any of the values from data2 column so that engine don't have to iterate all rows.

What about data column which is integer?

Best Answer

There are various ways to do this, with different performance, depending on the distribution of data (number of distinct object_id values, etc).

The easiest query to write - but not necessary the most efficient, is of course using an aggregate, MIN() or MAX():

SELECT object_id, MIN(data2) AS data2 
FROM cache 
GROUP BY object_id ;

If you have an index on (object_id, data2) this will not be too bad in recent versions of Postgres that can use index-only-scan for the execution plan.

Another way would be using DISTINCT ON syntax. The same index as above would help:

SELECT DISTINCT ON (object_id)
    object_id, data2 
FROM cache 
ORDER BY object_id ;

If, compared to the table size, there is a small number of object_id values, a different approach would be much more efficient. Provided you also have another tables (say objects) that has object_id as its primary key:

SELECT o.object_id, c.data2 
FROM objects AS o
  CROSS JOIN LATERAL 
     ( SELECT data2
       FROM cache AS c 
       WHERE c.object_id = o.object_id
       ORDER BY c.data2 
       LIMIT 1
     ) AS c ;

The same index would be needed. The ORDER BY is not required but with the index in place, it won't hurt efficiency. If you don't have an objects table then that part would have to be replaced with:

---
FROM ( SELECT DISTINCT object_id FROM cache) AS o
  CROSS JOIN LATERAL 
---

But you'd lose some efficiency, especially in older versions. In that case, you could replace this subquery with a complicated recursive query that traverses the object_id index efficiently. See the Posgres docs for more details: Loose Index Scan.

Read also these great answers by Erwin in the related questions:

And last but not least, the main reason for the problems is this:

I have denormalized data and I want to avoid ... (I know that all values in that group are equal booleans).

Normalizing the table would lead to much more efficient queries.

Answer to question

SELECT DISTINCT ON (name, zonedistrict_id)
       ST_Union(geom) as geom, gid, name, zonedistrict_id, zonestyle_id, longname 
FROM   zones
GROUP  BY gid, name, zonedistrict_id, zonestyle_id, longname
ORDER  BY name, zonedistrict_id, zonestyle_id;

It depends on what you are actually trying to achieve and what version of Postgres you are using and the table definition of the underlying table. You chose not to disclose any of that.

This gives you one distinct row per (name, zonedistrict_id), but geom is only the aggregate over (gid, name, zonedistrict_id, zonestyle_id, longname)

In older versions you had to list every non-aggregated column of the SELECT list in the GROUP BY list. This changed somewhat with PostgreSQL 9.1. I quote the release notes:

Allow non-GROUP BY columns in the query target list when the primary key is specified in the GROUP BY clause (Peter Eisentraut)

My guess

However, I suspect you secretly want this:

SELECT DISTINCT ON (name, zonedistrict_id)
       ST_Union(geom) OVER (PARTITION BY name, zonedistrict_id) AS geom
     , gid, name, zonedistrict_id, zonestyle_id, longname 
FROM   zones
ORDER  BY name, zonedistrict_id, zonestyle_id;

Using a ST_Union(geom) as aggregate window function. So geom is actually the aggregate over (name, zonedistrict_id), which seems more likely.

This should be possible since the PostGis manual tells us about its aggregate functions:

... can be used just like any other sql aggregate function such as sum, average.

And SQL aggregate functions can be used as window functions.

PostgreSQL – Is There a Type-Safe first() Aggregate Function?

`DISTINCT ON()`

Just as a side note, this is precisely what DISTINCT ON() does (not to be confused with DISTINCT)

SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

So if you were to write,

SELECT myFirstAgg(z)
FROM foo
GROUP BY x,y;

It's effectively

SELECT DISTINCT ON(x,y) z
FROM foo;
-- ORDER BY z;

In that it takes the first z. There are two important differences,

You can also select other columns at no cost of further aggregation..

SELECT DISTINCT ON(x,y) z, k, r, t, v
FROM foo;
-- ORDER BY z, k, r, t, v;

Because there is no GROUP BY you can not use (real) aggregates with it.

CREATE TABLE foo AS
SELECT * FROM ( VALUES
  (1,2,3),
  (1,2,4),
  (1,2,5)
) AS t(x,y,z);

SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- fails, as you should expect.
SELECT DISTINCT ON (x,y) z, sum(z)
FROM foo;

-- would not otherwise fail.
SELECT myFirstAgg(z), sum(z)
FROM foo
GROUP BY x,y;

Don't forget `ORDER BY`

Also, while I didn't bold it then I will now

Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example

Always use an ORDER BY with DISTINCT ON

Using an Ordered-Set Aggregate Function

I imagine a lot of people are looking for first_value, Ordered-Set Aggregate Functions. Just wanted to throw that out there. It would look like this, if the function existed:

SELECT a, b, first_value() WITHIN GROUP (ORDER BY z)    
FROM foo
GROUP BY a,b;

But, alas you can do this.

SELECT a, b, percentile_disc(0) WITHIN GROUP (ORDER BY z)   
FROM foo
GROUP BY a,b;

Best Answer

Related Solutions

PostgreSQL – Select First Row (Grouping) and Add Aggregate Function

Answer to question

My guess

PostgreSQL – Is There a Type-Safe first() Aggregate Function?

DISTINCT ON()

Don't forget ORDER BY

Using an Ordered-Set Aggregate Function

Related Question

`DISTINCT ON()`

Don't forget `ORDER BY`