Postgresql – Aggregate function to return any (first) encountered value

aggregatepostgresql

How do I tell PostgreSQL to return first encountered value instead of an aggregate column?

                          Table "public.cache"
  Column   |  Type   | Modifiers | Storage | Stats target | Description 
-----------+---------+-----------+---------+--------------+-------------
 user_id   | integer |           | plain   |              | 
 object_id | integer |           | plain   |              | 
 data      | integer |           | plain   |              | 
 data2     | boolean |           | plain   |              | 
Indexes:
    "cache_object_id_user_id_key" UNIQUE CONSTRAINT, btree (object_id, user_id)
    "cache_user_id_object_id_key" UNIQUE CONSTRAINT, btree (user_id, object_id)
Has OIDs: no

Query grouping by object_id and data2 will make hash aggregate, what I want to avoid.

SELECT object_id, data2 FROM cache GROUP BY object_id, data2;

I found bool_or() but it will scan all values in bad case.

SELECT object_id, bool_or(data2) FROM cache GROUP BY object_id;

Moreover, there is no such function for any datatype. What I want to do is to get any of the values from data2 column so that engine don't have to iterate all rows.

What about data column which is integer?

Best Answer

There are various ways to do this, with different performance, depending on the distribution of data (number of distinct object_id values, etc).

The easiest query to write - but not necessary the most efficient, is of course using an aggregate, MIN() or MAX():

SELECT object_id, MIN(data2) AS data2 
FROM cache 
GROUP BY object_id ;

If you have an index on (object_id, data2) this will not be too bad in recent versions of Postgres that can use index-only-scan for the execution plan.


Another way would be using DISTINCT ON syntax. The same index as above would help:

SELECT DISTINCT ON (object_id)
    object_id, data2 
FROM cache 
ORDER BY object_id ;

If, compared to the table size, there is a small number of object_id values, a different approach would be much more efficient. Provided you also have another tables (say objects) that has object_id as its primary key:

SELECT o.object_id, c.data2 
FROM objects AS o
  CROSS JOIN LATERAL 
     ( SELECT data2
       FROM cache AS c 
       WHERE c.object_id = o.object_id
       ORDER BY c.data2 
       LIMIT 1
     ) AS c ;

The same index would be needed. The ORDER BY is not required but with the index in place, it won't hurt efficiency. If you don't have an objects table then that part would have to be replaced with:

---
FROM ( SELECT DISTINCT object_id FROM cache) AS o
  CROSS JOIN LATERAL 
---

But you'd lose some efficiency, especially in older versions. In that case, you could replace this subquery with a complicated recursive query that traverses the object_id index efficiently. See the Posgres docs for more details: Loose Index Scan.

Read also these great answers by Erwin in the related questions:


And last but not least, the main reason for the problems is this:

I have denormalized data and I want to avoid ... (I know that all values in that group are equal booleans).

Normalizing the table would lead to much more efficient queries.