Postgresql – Split up interval into year slices

datetimepostgresqlset-returning-functions

Being a novice Postgres user, I have a table ad in my PostgreSQL 9.5 (x64) database with 87 rows. In addition to other columns, it has two columns 'start' and 'end' having date-time duration range like this:

ID      Start                   End
1       2003-06-07 00:00:00     2004-09-30 23:59:59

I need to split the range into one year windows in separate rows (from start year to the last year of interval) stored in database like this:

ID       Start
1_2003   2003-06-07 00:00:00     2003-12-31 23:59:59
1_2004   2004-01-01 00:00:00     2004-09-30 23:59:59

Using the operator ||'_'|| and Extract() function, I am able to concatenate ID with year. Also, this question addresses how to split up in interval in weeks and this one shows how to do the same for days but none of them addresses how to split up an interval in years specifically.

I avoided this question because I don't want to go for a stored procedure based approach. I am aware that generate_series() returns a series from start and stop parameters but actually I am struggling to break the interval at last day of the year and then re-starting from first day of the year in next row. I would highly appreciate if someone could guide me to this?

Best Answer

A couple of suggestions:

Use legal, lower-case, unquoted identifiers to save yourself a lot of confusion. end (and, to a lesser degree start) are reserved words.
Your id column seems to be numeric type like integer, not text.
Since you don't need a time component in start_day and end_day, the appropriate data type is date, not timestamp.
Why concatenate id and the year for a new "ID"? Add a second column year if you need it for a PK. Or don't add an additional column at all. It can cheaply be extracted from the new start_day on the fly. Generally, don't store data redundantly if you can avoid it.
Typically, timestamp ranges use an inclusive lower bound and an exclusive upper bound. Since timestamps can have fractional digits (up to 6 in Postgres) that is much cleaner. Your input 2003-12-31 23:59:59 would fail for 2003-12-31 23:59:59.123.

So, your table could look like this:

CREATE TABLE ad (
  id        int PRIMARY KEY
, start_day date NOT NULL    -- *inclusive* lower bound
, end_day   date NOT NULL    -- *exclusive* upper bound
CHECK (end_day > start_day)  -- enforce legal input
);

Proper test values:

INSERT INTO ad(id, start_day, end_day)
VALUES
  (1, '2003-06-07', '2004-10-01')   -- span 2 years (your example)
, (2, '2003-06-07', '2003-06-08')   -- 1 day in same year
, (3, '2003-06-07', '2003-10-01')   -- span 1 year
, (4, '2003-06-07', '2006-10-01');  -- span many years

Solution

Use generate_series() in a LATERAL join, based on start and end day, truncated to the year with date_trunc(). This produces one row per year with the new start date. Add a year and you have the new end date. Except for first and last row per id, where you substitute the proper start / end with GREATEST and LEAST respectively. Voilá.

-- CREATE TABLE ad_year AS
SELECT ad.id
     , extract('year' FROM y)::int AS year
     , GREATEST(y                    , ad.start_day) AS start_day
     , LEAST   (y + interval '1 year', ad.end_day)   AS end_day
FROM   ad
     , generate_series(date_trunc('year', start_day::timestamp)  -- cast to ts here!
                     , date_trunc('year', end_day::timestamp)
                     , interval '1 year') y;

Note that date_trunc() returns timestamptz for timestamptz input and timestamp for timestamp input. For date input it defaults to timestamptz. Since you seem to be ignoring time zones, cast the date to timestamp explicitly (start_day::timestamp).

Result:

 id | year |      start_day      |       end_day
----+------+---------------------+---------------------
  1 | 2003 | 2003-06-07 00:00:00 | 2004-01-01 00:00:00
  1 | 2004 | 2004-01-01 00:00:00 | 2004-10-01 00:00:00
  2 | 2003 | 2003-06-07 00:00:00 | 2003-06-08 00:00:00
  3 | 2003 | 2003-06-07 00:00:00 | 2003-10-01 00:00:00
  4 | 2003 | 2003-06-07 00:00:00 | 2004-01-01 00:00:00
  4 | 2004 | 2004-01-01 00:00:00 | 2005-01-01 00:00:00
  4 | 2005 | 2005-01-01 00:00:00 | 2006-01-01 00:00:00
  4 | 2006 | 2006-01-01 00:00:00 | 2006-10-01 00:00:00

If you create a new table from the result, I suggest (id, year) as primary key.

Aside: This is not an operator: ||'_'|| (nor a cute little face). It's 2 concatenation operators || and a string literal '_'.

Related Solutions

Postgresql – Optimizing a query on a large table (EXPLAIN ANALYZE result inside)

A few thoughts. Consider range partitioning on "timestamp", it will reduce the amount of work your query have to do. Further optimization might be to calculate and store the agg for "closed" partitions. You will of course have to recalculate this when you modify historical information.

As a bonus it will be much easier to roll-out historical data that is no longer needed.

As mentioned this is just some thoughts and reflections, it may or it may not fit your situation. One pitfall may be the number of different intervals that you use for reporting. If larger intervals is not multiples of the partitioning interval this wont work well.

Edit: How is the plan affected by creating an temp table and an index on "timestamp"?

create index X on market_trades ( timestamp );

create temp table T ( time_range tsrange );

insert into T ( time_range )
WITH vals AS (
    SELECT '2013-08-19 0:00'::timestamp  AS frame_start,
       '2013-08-26 0:00'::timestamp  AS frame_end,
       '17h'::interval               AS interval_length
),   intervals AS (
    SELECT tsrange(start_time,
               lead(start_time, 1, frame_end) OVER (ORDER BY start_time   NULLS FIRST)) AS time_range
    FROM (
        SELECT generate_series(frame_start, frame_end, interval_length) AS  
               start_time, frame_end
        FROM vals
    ) _
    WHERE start_time < frame_end
)
SELECT time_range
FROM intervals i;

create index x on T ( time_range );

analyze t;

SELECT time_range, count(td.id) AS agg
FROM T
LEFT JOIN market_trades td
    ON td.timestamp <@ T.time_range
GROUP  BY T.time_range
ORDER  BY T.time_range;

Postgresql – How to deal with datetime ranges in a table with separate columns for date and time

You can use the + operator.

SELECT pk,ev_date,ev FROM events;

 pk |  ev_date   |    ev    
----+------------+----------
  1 | 2016-02-19 | 01:00:00
  2 | 2016-02-19 | 02:00:00
  3 | 2016-02-19 | 05:00:00
  4 | 2016-02-19 | 12:00:00
  5 | 2016-02-19 | 18:00:00
  6 | 2016-02-19 | 23:00:00
  7 | 2016-02-20 | 01:00:00
  8 | 2016-02-20 | 05:00:00
  9 | 2016-02-20 | 12:00:00
 10 | 2016-02-20 | 18:00:00
(10 rows)

SELECT pk, ev_date, ev 
FROM events 
WHERE (ev_date + ev) 
    BETWEEN ('2016-02-19 04:00:00') 
        AND ('2016-02-20 02:00:00');

 pk |  ev_date   |    ev    
----+------------+----------
  3 | 2016-02-19 | 05:00:00
  4 | 2016-02-19 | 12:00:00
  5 | 2016-02-19 | 18:00:00
  6 | 2016-02-19 | 23:00:00
  7 | 2016-02-20 | 01:00:00
(5 rows)

Don't forget to create the index below:

CREATE INDEX events_ts_idx ON events ((ev_date + ev));
ANALYZE events;

I've inserted many dummy rows, so I show the result of EXPLAIN:

EXPLAIN ANALYZE SELECT pk, ev_date, ev FROM events  WHERE (ev_date + ev) 
    BETWEEN ('2016-02-19 23:50:00') 
        AND ('2016-02-20 00:01:00');
                                                                            QUERY PLAN                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using events_ts_idx on events  (cost=0.29..8.52 rows=8 width=16) (actual time=0.014..0.029 rows=42 loops=1)
   Index Cond: (((ev_date + ev) >= '2016-02-19 23:50:00'::timestamp without time zone) AND ((ev_date + ev) <= '2016-02-20 00:01:00'::timestamp without time zone))
 Planning time: 0.082 ms
 Execution time: 0.053 ms
(4 rows)

For comparison, I've created other index and tried other form:

CREATE INDEX events_ts2_idx ON events (ev_date,ev);
ANALYZE events;

EXPLAIN ANALYZE SELECT pk, ev_date, ev FROM events  WHERE (ev_date,ev) 
    BETWEEN ('2016-02-19','23:50:00') 
        AND ('2016-02-20','0:01:00');
                             QUERY PLAN
--------------------------------------------------------------------------
 Bitmap Heap Scan on events  (cost=189.50..511.36 rows=7143 width=16) (actual time=0.027..0.042 rows=42 loops=1)
   Recheck Cond: ((ROW(ev_date, ev) >=ROW('2016-02-19'::date,'23:50:00'::time without time zone)) AND (ROW(ev_date, ev) <= ROW('2016-02-20'::date, '00:01:00'::time without time zone)))
   Heap Blocks: exact=7
   ->  Bitmap Index Scan on events_ts2_idx  (cost=0.00..187.72rows=7143 width=0) (actual time=0.019..0.019 rows=42 loops=1)
         Index Cond: ((ROW(ev_date, ev) >= ROW('2016-02-19'::date,'23:50:00'::time without time zone))AND(ROW(ev_date, ev) <= ROW('2016-02-20'::date, '00:01:00'::time without time zone)))
 Planning time: 0.079 ms
 Execution time: 0.071 ms
(7 rows)

According to my investigation, my way (using + operator) is better. I recommend to compare with both ways on your machine.

Best Answer

Solution

Related Solutions

Postgresql – Optimizing a query on a large table (EXPLAIN ANALYZE result inside)

Postgresql – How to deal with datetime ranges in a table with separate columns for date and time

Related Question