Postgresql – Split up interval into year slices

datetimepostgresqlset-returning-functions

Being a novice Postgres user, I have a table ad in my PostgreSQL 9.5 (x64) database with 87 rows. In addition to other columns, it has two columns 'start' and 'end' having date-time duration range like this:

ID      Start                   End
1       2003-06-07 00:00:00     2004-09-30 23:59:59

I need to split the range into one year windows in separate rows (from start year to the last year of interval) stored in database like this:

ID       Start
1_2003   2003-06-07 00:00:00     2003-12-31 23:59:59
1_2004   2004-01-01 00:00:00     2004-09-30 23:59:59

Using the operator ||'_'|| and Extract() function, I am able to concatenate ID with year. Also, this question addresses how to split up in interval in weeks and this one shows how to do the same for days but none of them addresses how to split up an interval in years specifically.

I avoided this question because I don't want to go for a stored procedure based approach. I am aware that generate_series() returns a series from start and stop parameters but actually I am struggling to break the interval at last day of the year and then re-starting from first day of the year in next row. I would highly appreciate if someone could guide me to this?

Best Answer

A couple of suggestions:

  • Use legal, lower-case, unquoted identifiers to save yourself a lot of confusion. end (and, to a lesser degree start) are reserved words.

  • Your id column seems to be numeric type like integer, not text.

  • Since you don't need a time component in start_day and end_day, the appropriate data type is date, not timestamp.

  • Why concatenate id and the year for a new "ID"? Add a second column year if you need it for a PK. Or don't add an additional column at all. It can cheaply be extracted from the new start_day on the fly. Generally, don't store data redundantly if you can avoid it.

  • Typically, timestamp ranges use an inclusive lower bound and an exclusive upper bound. Since timestamps can have fractional digits (up to 6 in Postgres) that is much cleaner. Your input 2003-12-31 23:59:59 would fail for 2003-12-31 23:59:59.123.

So, your table could look like this:

CREATE TABLE ad (
  id        int PRIMARY KEY
, start_day date NOT NULL    -- *inclusive* lower bound
, end_day   date NOT NULL    -- *exclusive* upper bound
CHECK (end_day > start_day)  -- enforce legal input
);

Proper test values:

INSERT INTO ad(id, start_day, end_day)
VALUES
  (1, '2003-06-07', '2004-10-01')   -- span 2 years (your example)
, (2, '2003-06-07', '2003-06-08')   -- 1 day in same year
, (3, '2003-06-07', '2003-10-01')   -- span 1 year
, (4, '2003-06-07', '2006-10-01');  -- span many years

Solution

Use generate_series() in a LATERAL join, based on start and end day, truncated to the year with date_trunc(). This produces one row per year with the new start date. Add a year and you have the new end date. Except for first and last row per id, where you substitute the proper start / end with GREATEST and LEAST respectively. Voilá.

-- CREATE TABLE ad_year AS
SELECT ad.id
     , extract('year' FROM y)::int AS year
     , GREATEST(y                    , ad.start_day) AS start_day
     , LEAST   (y + interval '1 year', ad.end_day)   AS end_day
FROM   ad
     , generate_series(date_trunc('year', start_day::timestamp)  -- cast to ts here!
                     , date_trunc('year', end_day::timestamp)
                     , interval '1 year') y;

Note that date_trunc() returns timestamptz for timestamptz input and timestamp for timestamp input. For date input it defaults to timestamptz. Since you seem to be ignoring time zones, cast the date to timestamp explicitly (start_day::timestamp).

Result:

 id | year |      start_day      |       end_day
----+------+---------------------+---------------------
  1 | 2003 | 2003-06-07 00:00:00 | 2004-01-01 00:00:00
  1 | 2004 | 2004-01-01 00:00:00 | 2004-10-01 00:00:00
  2 | 2003 | 2003-06-07 00:00:00 | 2003-06-08 00:00:00
  3 | 2003 | 2003-06-07 00:00:00 | 2003-10-01 00:00:00
  4 | 2003 | 2003-06-07 00:00:00 | 2004-01-01 00:00:00
  4 | 2004 | 2004-01-01 00:00:00 | 2005-01-01 00:00:00
  4 | 2005 | 2005-01-01 00:00:00 | 2006-01-01 00:00:00
  4 | 2006 | 2006-01-01 00:00:00 | 2006-10-01 00:00:00

If you create a new table from the result, I suggest (id, year) as primary key.

Aside: This is not an operator: ||'_'|| (nor a cute little face). It's 2 concatenation operators || and a string literal '_'.