PostgreSQL – Reduce Number of Rows with Same ID Based on Status

aggregatedatabase-designpostgresql

I have the following table:

-- id, location_id, status, posted_year, posted_quarter
CREATE TABLE foo AS
SELECT * FROM ( VALUES
(1 ,12,'active'  ,2014,3),  
(2 ,12,'inactive',2014,3),
(3 ,12,'active'  ,2014,3),
(4 ,12,'active'  ,2014,4),
(5 ,12,'inactive',2014,4),
(6 ,13,'active'  ,2015,1),
(7 ,13,'active'  ,2015,1),
(8 ,13,'inactive',2015,1),
(9 ,13,'active'  ,2015,2),
(10,13,'active'  ,2015,2),
(11,13,'inactive',2015,3),
(12,13,'active'  ,2015,4),
(13,13,'active'  ,2015,4),
(14,13,'inactive',2015,4),
(15,12,'active'  ,2015,1),
(16,13,'active'  ,2015,1),
(17,12,'inactive',2015,1),
(18,12,'active'  ,2015,2)
) AS t(id,location_id,status,posted_year,posted_quarter);

I want to recreate this table but have only one quarter per year from each location.

We might have more than one record in a year for a quarter, and in this case we need to check the status based on:

If one status is active for a year in selected quarter, the status will be active, otherwise the status will be inactive.

Examples:

  • location_id 12 for year 2014 and quarter 3 will have one record in the new table with status active.

  • location_id 12 for year 2015 and quarter 1 will have one record in the new table with status inactive.

How to write this query?

Best Answer

The only trick here is to create something you can group by that involves the year, and the quarter. This isn't the only way to do it, but you can do this...

make_timestamp(posted_year,1,1,0,0,0)::date
+ posted_quarter*3*'1 month'::interval

Remember, three months in a quarter-year. You can also do posted_year + posted_quater*0.25 but all the same.

From there, it's pretty basic. I'll use a CTE to separate formatting from calculation (feel free to ditch it for speed).

DISTINCT ON()

WITH t AS (
  SELECT (make_timestamp(posted_year,1,1,0,0,0)::date + posted_quarter*3*'1 month'::interval)::date AS qtr, *
  FROM foo
)
SELECT DISTINCT ON ( location_id, qtr ) location_id, qtr, status
FROM t
ORDER BY location_id, qtr, status='active' DESC;

GROUP BY ... bool_or()

Or, alternatively (and maybe faster),

WITH t AS (
  SELECT (make_timestamp(posted_year,1,1,0,0,0)::date + posted_quarter*3*'1 month'::interval)::date AS qtr, *
  FROM foo
)
SELECT location_id,
  qtr,
  CASE WHEN bool_or(status='active') THEN 'active' ELSE 'inactive' END
FROM t
GROUP BY location_id, qtr
ORDER BY location_id, qtr;