PostgreSQL – Optimal Query to Show Active Records by Date

performancepostgresqlquery-performance

I'm trying to determine what the optimal query might be for associating a date with a record that would have been active for that day.

Here is a setup to demonstrate what I'm trying to do (demo at SQL Fiddle):

DDL:

CREATE TABLE component_data
    ("id" int, "generated_date" timestamp, "removed_date" timestamp, "active" bool, "owner_rel_id" int)
;

CREATE TABLE dates
    ("fulldate" date, "year" int, "month" int, "day" int)
;

component_data Sample Data:

INSERT INTO component_data
    ("id", "generated_date", "removed_date", "active", "owner_rel_id")
VALUES
    (2507562, '2016-04-23 07:37:51', NULL, 't', 977720),
    (2507563, '2016-04-23 07:37:51', NULL, 't', 977720),
    (2507564, '2016-04-23 07:37:51', NULL, 't', 977720),
    (2507565, '2016-04-23 07:37:51', NULL, 't', 977720),
    (2507566, '2016-04-23 07:37:51', NULL, 't', 977720),
    (2507567, '2016-04-23 07:37:51', NULL, 't', 977720),
    (1586966, '2013-09-16 18:28:23', NULL, 't', 653531),
    (1586854, '2013-09-16 17:02:43', NULL, 't', 678806),
    (2363032, '2015-12-26 13:59:02', NULL, 't', 666874),
    (1586929, '2013-09-16 18:04:41', NULL, 't', 678820),
    (1587071, '2013-09-16 18:47:32', NULL, 't', 463631),
    (1587072, '2013-09-16 18:53:10', NULL, 't', 678834),
    (2363033, '2015-12-26 13:59:02', NULL, 't', 666874),
    (2235362, '2015-09-07 17:30:58', NULL, 't', 882233),
    (1587065, '2013-11-17 05:12:42', '2016-05-16 09:12:58-04', NULL, 678831),
    (1587785, '2013-09-19 00:00:00', NULL, 't', 679404),
    (2363036, '2015-12-26 14:15:59', NULL, 't', 713679),
    (1581675, '2013-09-25 00:00:00', '2016-05-16 09:34:26-04', NULL, 677199),
    (1210333, '2013-02-14 00:00:00', NULL, 't', 681094),
    (1592753, '2013-09-24 18:04:49', NULL, 't', 680679),
    (1593061, '2013-09-27 00:00:00', '2016-05-15 14:28:35-04', NULL, 680794),
    (1593064, '2013-09-27 00:00:00', '2016-05-15 14:28:35-04', NULL, 680794)
;

dates Sample Data:

INSERT INTO dates
    ("fulldate", "year", "month", "day")
VALUES
    ('2016-05-01 00:00:00', 2016, 5, 1),
    ('2016-05-02 00:00:00', 2016, 5, 2),
    ('2016-05-03 00:00:00', 2016, 5, 3),
    ('2016-05-04 00:00:00', 2016, 5, 4),
    ('2016-05-05 00:00:00', 2016, 5, 5),
    ('2016-05-06 00:00:00', 2016, 5, 6),
    ('2016-05-07 00:00:00', 2016, 5, 7),
    ('2016-05-08 00:00:00', 2016, 5, 8),
    ('2016-05-09 00:00:00', 2016, 5, 9),
    ('2016-05-10 00:00:00', 2016, 5, 10),
    ('2016-05-11 00:00:00', 2016, 5, 11),
    ('2016-05-12 00:00:00', 2016, 5, 12),
    ('2016-05-13 00:00:00', 2016, 5, 13),
    ('2016-05-14 00:00:00', 2016, 5, 14),
    ('2016-05-15 00:00:00', 2016, 5, 15),
    ('2016-05-16 00:00:00', 2016, 5, 16),
    ('2016-05-17 00:00:00', 2016, 5, 17),
    ('2016-05-18 00:00:00', 2016, 5, 18),
    ('2016-05-19 00:00:00', 2016, 5, 19),
    ('2016-05-20 00:00:00', 2016, 5, 20),
    ('2016-05-21 00:00:00', 2016, 5, 21),
    ('2016-05-22 00:00:00', 2016, 5, 22),
    ('2016-05-23 00:00:00', 2016, 5, 23),
    ('2016-05-24 00:00:00', 2016, 5, 24),
    ('2016-05-25 00:00:00', 2016, 5, 25)
;

Query:

SELECT
    d.fulldate,
    cd.id,
    cd.owner_rel_id,
    cd.generated_date,
    cd.removed_date,
    cd.active
FROM
    component_data cd
        INNER JOIN dates d ON (cd.generated_date <= d.fulldate) AND ((cd.removed_date >= d.fulldate) OR cd.active)
WHERE
    d.fulldate IN ('2016-05-16', '2016-05-17')
ORDER BY
    cd.id,
    d.fulldate;

In the real world scenario, the component_data table will have ~2 million rows or so, and I'll be comparing against dates back to 1JAN2013

I'm wondering if there's a more efficient way to do what my query is doing.

Best Answer

So I happened to have stumbled on a more efficient pattern for this sort of operation (based on how much faster this is in testing).

Given the above, here's what it would look like (I've excluded fields beyond what are necessary to point to a compoent_data record):

Updated Query:

WITH cd_start_end AS (
    SELECT
        id,
        owner_rel_id,
        generated_date AS start_date,
        COALESCE(removed_date, now()) AS end_date
    FROM
        component_data
    ORDER BY
        id
), active_dates AS (
    SELECT
        id,
        generate_series(start_date::date, end_date::date, '1 day'::interval)::date AS active_dates
    FROM
        cd_start_end
)

SELECT
    *
FROM
    active_dates
WHERE
    active_dates >= '2013-01-01'
    AND active_dates < now()::date;