PostgreSQL Aggregate – Squashing Aggregated Rows with PostgreSQL

aggregatepostgresql

Given a table my_data like this:

 id | name | surname | age
----+------+---------+------
 1  | john | smith   | NULL
 1  | NULL | smith   | 32
 1  | NULL | NULL    | NULL
 1  | john | smith   | NULL
 1  | john | NULL    | 32

CREATE TABLE my_data(id,name,surname,age)
AS ( VALUES 
  (1::int, 'john', 'smith' ,NULL::int),
  (1, NULL,   'smith' ,32),
  (1, NULL,   NULL    ,NULL),
  (1, 'john', 'smith' ,NULL),
  (1, 'john', NULL    ,32)
);

For the same id, the values (if present) in the respective columns are always the sames so how can I "squash" them to get:

 id | name | surname | age
----+------+---------+------
 1  | john | smith   | 32

My attempt

A cross join lateral for each column is my only idea so far, but I doubt is good:

select 
distinct column1, c2.value, c3.value, c4.value
from my_data md
cross join lateral (select column2 from my_data where column1 = md.column1 and column2 is not null limit 1) as c2(value)
cross join lateral (select column3 from my_data where column1 = md.column1 and column3 is not null limit 1) as c3(value)
cross join lateral (select column4 from my_data where column1 = md.column1 and column4 is not null limit 1) as c4(value);

Best Answer

Using `percentile_disc`

I would think something like this would be the fastest,

SELECT id,
  percentile_disc(0) WITHIN GROUP (ORDER BY name NULLS LAST)    AS name,
  percentile_disc(0) WITHIN GROUP (ORDER BY surname NULLS LAST) AS surname,
  percentile_disc(0) WITHIN GROUP (ORDER BY age NULLS LAST)     AS age
FROM my_data
GROUP BY id;

 id | name | surname | age 
----+------+---------+-----
  1 | john | smith   |  32
(1 row)

Here we're using percentile_disc an Ordered-Set Aggregate described as "discrete percentile: returns the first input value whose position in the ordering equals or exceeds the specified fraction". So,

if that position is "0", it's essentially first_value over the rows being aggregated.
if NULLS are last (NULLS LAST), the first value won't be null, and that's all that matters here (because you said "for the same id, the values (if present) in the respective columns are always the same")

You could also do this, I think, with first_value in a Window Function, and then pull from that with DISTINCT ON.

Using `mode`

If you want the consensus of the non-values, we can do that too. I was assuming percentile_disc was what the OP wanted. Another option is to use mode() which is a different Ordered-Set Aggregate Function. It's described as, "returns the most frequent input value (arbitrarily choosing the first one if there are multiple equally-frequent results)". It looks like,

SELECT id,
  mode() WITHIN GROUP (ORDER BY name)    AS name,
  mode() WITHIN GROUP (ORDER BY surname) AS surname,
  mode() WITHIN GROUP (ORDER BY age)     AS age
FROM my_data
GROUP BY id;

Window function?

A window function (count(*) over ()) does not seem to be what you want, since you don't have unaggregated rows.
You could add to the inner subquery:

count(*) OVER ()

.. to get the count of distinct landing_path_id, which is one other possible number that might be of interest. But that doesn't seem to be what you meant by "the total number of rows from that records select".
Or you could add to the inner subquery:

sum(count(*)) OVER ()

.. to get the total count with every landing_path_id redundantly, but that would seem pointless. Just mentioning that to demonstrate it's possible to run a window function over the result of an aggregate function in a single pass. Details for that:

Updated question

Your result, just without total_count in the records subquery. Now accounting for the LIMIT in the inner SELECT. Even though a maximum of 10 distinct landing_path_id is selected, all qualifying landing_path_id are counted.

To get both in one scan and reuse count and sum separately I introduce a CTE:

WITH cte AS (
  SELECT sum(entrances) AS entrances
       , count(*) over () AS total_count
  FROM   report_la
  WHERE  profile_id = 3777614
  GROUP  BY landing_path_id
  LIMIT  10
  )
SELECT row_to_json(selected_records)::text AS data
FROM  (   
   SELECT (SELECT total_count FROM cte LIMIT 1) AS total_count
        , array_to_json(array_agg(row_to_json(records))) AS data
   FROM  (SELECT entrances FROM cte) records
   ) selected_records;

If you don't care about the attribute name, you can have that cheaper with a subquery:

SELECT row_to_json(selected_records)::text AS data
FROM  (   
   SELECT min(total_count) AS total_count
        , array_to_json(array_agg(row_to_json(ROW(entrances)))) AS data
   FROM (
      SELECT sum(entrances) AS entrances
           , count(*) over () AS total_count  -- shouldn't show up in result
      FROM   report_la
      WHERE  profile_id = 3777614
      GROUP  BY landing_path_id
      LIMIT  1
      ) records
   ) selected_records;

You get the default attribute name f1 instead of entrances, since the ROW expression does not preserve the column name.

If you need a certain attribute name, you could cast the row to a registered type. (Ab-)using a TEMP TABLE to register my row type for the session:

CREATE TEMP TABLE rec1 (entrances bigint);

...
        , array_to_json(array_agg(row_to_json(ROW(entrances)::rec1))) AS data
...

This would be a bit faster than the CTE. Or, more verbose but without cast:

...
        , array_to_json(array_agg(row_to_json(
                   (SELECT x FROM (SELECT records.entrances) x)))) AS data
...

Detailed explanation in this related answer:

Select columns inside json_agg

SQL Fiddle.

My attempt

Best Answer

Using percentile_disc

Using mode

Related Solutions

MongoDB: Aggregation workaround by helper collection

PostgreSQL – Return Total Number of Rows and Selected Aggregated Data

Window function?

Updated question

Related Question

Using `percentile_disc`

Using `mode`