PostgreSQL Aggregate – Squashing Aggregated Rows with PostgreSQL

aggregatepostgresql

Given a table my_data like this:

 id | name | surname | age
----+------+---------+------
 1  | john | smith   | NULL
 1  | NULL | smith   | 32
 1  | NULL | NULL    | NULL
 1  | john | smith   | NULL
 1  | john | NULL    | 32

CREATE TABLE my_data(id,name,surname,age)
AS ( VALUES 
  (1::int, 'john', 'smith' ,NULL::int),
  (1, NULL,   'smith' ,32),
  (1, NULL,   NULL    ,NULL),
  (1, 'john', 'smith' ,NULL),
  (1, 'john', NULL    ,32)
);

For the same id, the values (if present) in the respective columns are always the sames so how can I "squash" them to get:

 id | name | surname | age
----+------+---------+------
 1  | john | smith   | 32

My attempt

A cross join lateral for each column is my only idea so far, but I doubt is good:

select 
distinct column1, c2.value, c3.value, c4.value
from my_data md
cross join lateral (select column2 from my_data where column1 = md.column1 and column2 is not null limit 1) as c2(value)
cross join lateral (select column3 from my_data where column1 = md.column1 and column3 is not null limit 1) as c3(value)
cross join lateral (select column4 from my_data where column1 = md.column1 and column4 is not null limit 1) as c4(value);

Best Answer

Using percentile_disc

I would think something like this would be the fastest,

SELECT id,
  percentile_disc(0) WITHIN GROUP (ORDER BY name NULLS LAST)    AS name,
  percentile_disc(0) WITHIN GROUP (ORDER BY surname NULLS LAST) AS surname,
  percentile_disc(0) WITHIN GROUP (ORDER BY age NULLS LAST)     AS age
FROM my_data
GROUP BY id;

 id | name | surname | age 
----+------+---------+-----
  1 | john | smith   |  32
(1 row)

Here we're using percentile_disc an Ordered-Set Aggregate described as "discrete percentile: returns the first input value whose position in the ordering equals or exceeds the specified fraction". So,

  • if that position is "0", it's essentially first_value over the rows being aggregated.
  • if NULLS are last (NULLS LAST), the first value won't be null, and that's all that matters here (because you said "for the same id, the values (if present) in the respective columns are always the same")

You could also do this, I think, with first_value in a Window Function, and then pull from that with DISTINCT ON.

Using mode

If you want the consensus of the non-values, we can do that too. I was assuming percentile_disc was what the OP wanted. Another option is to use mode() which is a different Ordered-Set Aggregate Function. It's described as, "returns the most frequent input value (arbitrarily choosing the first one if there are multiple equally-frequent results)". It looks like,

SELECT id,
  mode() WITHIN GROUP (ORDER BY name)    AS name,
  mode() WITHIN GROUP (ORDER BY surname) AS surname,
  mode() WITHIN GROUP (ORDER BY age)     AS age
FROM my_data
GROUP BY id;