PostgreSQL – Joining Two Large Tables on a Date Range

performancepostgresqlquery-performance

I know this question must come up often, but I can't seem to find a recent answer that addresses something more simple like what I'm trying to do. I've read several similar questions.

I'm using PostgreSQL 9.4 and have 2 tables, each with close to 200M rows containing two types of metrics. The columns are the_date, feature_A, feature_B and 24 hourly metrics. The metrics in the two tables are different enough such that the tables can't be combined. Or at least I don't think so. Some metrics don't apply to all features and that's why they're separated. Both tables have indexes on the date column.

When I try to do a query that joins them, if I specify a single date the query will return in about 3 seconds which seems reasonable to me. But if I simply do a.the_date between '8/1/2017' and '8/1/2017' the query now takes essentially forever. I kill it after 10 minutes because I'm impatient. Obviously PostgreSQL is no longer using the indexes.

Why is that and how can I get it to use the indexes again? I'd like to do queries like a.the_date between '7/1/2017' and '7/31/2017' or a.the_date >= '1/1/2017'. Right now I've simply got a python program running the query for one date at a time and then totaling the results, which seems like a hack. But at least it's predictable and much faster at 3-seconds-per-date.

The query:

select a.feature_a, a.feature_b, count(*) tot_ct,
    count(case when b.br01 > 0 and a.bc01 <= 10 then 1 end) cx01,
    count(case when b.br02 > 0 and a.bc02 <= 10 then 1 end) cx02,
    count(case when b.br03 > 0 and a.bc03 <= 10 then 1 end) cx03,
    count(case when b.br04 > 0 and a.bc04 <= 10 then 1 end) cx04,
    count(case when b.br05 > 0 and a.bc05 <= 10 then 1 end) cx05,
    count(case when b.br06 > 0 and a.bc06 <= 10 then 1 end) cx06,
    count(case when b.br07 > 0 and a.bc07 <= 10 then 1 end) cx07,
    count(case when b.br08 > 0 and a.bc08 <= 10 then 1 end) cx08,
    count(case when b.br09 > 0 and a.bc09 <= 10 then 1 end) cx09,
    count(case when b.br10 > 0 and a.bc10 <= 10 then 1 end) cx10,
    count(case when b.br11 > 0 and a.bc11 <= 10 then 1 end) cx11,
    count(case when b.br12 > 0 and a.bc12 <= 10 then 1 end) cx12,
    count(case when b.br13 > 0 and a.bc13 <= 10 then 1 end) cx13,
    count(case when b.br14 > 0 and a.bc14 <= 10 then 1 end) cx14,
    count(case when b.br15 > 0 and a.bc15 <= 10 then 1 end) cx15,
    count(case when b.br16 > 0 and a.bc16 <= 10 then 1 end) cx16,
    count(case when b.br17 > 0 and a.bc17 <= 10 then 1 end) cx17,
    count(case when b.br18 > 0 and a.bc18 <= 10 then 1 end) cx18,
    count(case when b.br19 > 0 and a.bc19 <= 10 then 1 end) cx19,
    count(case when b.br20 > 0 and a.bc20 <= 10 then 1 end) cx20,
    count(case when b.br21 > 0 and a.bc21 <= 10 then 1 end) cx21,
    count(case when b.br22 > 0 and a.bc22 <= 10 then 1 end) cx22,
    count(case when b.br23 > 0 and a.bc23 <= 10 then 1 end) cx23,
    count(case when b.br24 > 0 and a.bc24 <= 10 then 1 end) cx24,
    avg(b.br01) av01, avg(b.br02) av02, avg(b.br03) av03,
    avg(b.br04) av04, avg(b.br05) av05, avg(b.br06) av06,
    avg(b.br07) av07, avg(b.br08) av08, avg(b.br09) av09,
    avg(b.br10) av10, avg(b.br11) av11, avg(b.br12) av12,
    avg(b.br13) av13, avg(b.br14) av14, avg(b.br15) av15,
    avg(b.br16) av16, avg(b.br17) av17, avg(b.br18) av18,
    avg(b.br19) av19, avg(b.br20) av20, avg(b.br21) av21,
    avg(b.br22) av22, avg(b.br23) av23, avg(b.br24) av24
from table_a a, table_b b
where a.the_date = '8/1/2017'
    and a.the_date = b.the_date
    and a.feature_a = b.feature_a
    and a.feature_b = b.feature_b
group by a.feature_a, a.feature_b

Update:

EXPLAIN on the above query:

GroupAggregate  (cost=3123833.79..5157342.05 rows=201 width=680)                                                         
Group Key: a.feature_a, a.feature_b                                                                                    
->  Merge Join  (cost=3123833.79..3220488.32 rows=5270312 width=680)                                                   
        Merge Cond: ((a.feature_a = b.feature_a) AND (a.feature_b = b.feature_b))                                        
        ->  Sort  (cost=1438291.32..1439351.90 rows=424234 width=348)                                                    
            Sort Key: a.feature_a, a.feature_b                                                                         
            ->  Bitmap Heap Scan on table_a a  (cost=14960.38..1262333.10 rows=424234 width=348)               
                    Recheck Cond: (the_day = '2017-08-01'::date)                                                   
                    ->  Bitmap Index Scan on table_a_day_idx  (cost=0.00..14854.32 rows=424234 width=0)        
                        Index Cond: (the_day = '2017-08-01'::date)                                               
        ->  Materialize  (cost=1685542.47..1688027.10 rows=496925 width=348)
            ->  Sort  (cost=1685542.47..1686784.79 rows=496925 width=348)
                    Sort Key: b.feature_a, b.feature_b                                                                   
                    ->  Bitmap Heap Scan on table_b b  (cost=17759.74..1478863.74 rows=496925 width=348)
                        Recheck Cond: (the_day = '2017-08-01'::date)                                             
                        ->  Bitmap Index Scan on table_b_day_idx (cost=0.00..17635.51 rows=496925 width=0)
                                Index Cond: (the_day = '2017-08-01'::date)

EXPLAIN on date-range query (8/1 to 8/5) not using table_b index:

GroupAggregate  (cost=84467702.82..87504872.78 rows=201 width=680)
Group Key: a.feature_a, a.feature_b
->  Merge Join  (cost=84467702.82..85568019.42 rows=5270311 width=680)
        Merge Cond: ((b.feature_a = a.feature_a) AND (b.feature_b = a.feature_b) AND (b.the_day = a.the_day))  
        ->  Sort  (cost=83027290.33..83275752.78 rows=99384980 width=348)                                                    
            Sort Key: b.feature_a, b.feature_b, b.the_day
            ->  Seq Scan on table_b b  (cost=0.00..5963098.80 rows=99384980 width=348)
        ->  Materialize  (cost=1440412.49..1442533.66 rows=424234 width=348)
            ->  Sort  (cost=1440412.49..1441473.07 rows=424234 width=348)
                    Sort Key: a.feature_a, a.feature_b, a.the_day
                    ->  Bitmap Heap Scan on table_a a  (cost=16020.97..1264454.27 rows=424234 width=348)
                        Recheck Cond: ((the_day >= '2017-08-01'::date) AND (the_day <= '2017-08-05'::date))    
                        ->  Bitmap Index Scan on table_a_day_idx  (cost=0.00..15914.91 rows=424234 width=0)      
                                Index Cond: ((the_day >= '2017-08-01'::date) AND (the_day <= '2017-08-05'::date))

Best Answer

Solved my index issue, for anyone who might find this...

It was as simple as adding the date range to the where clause of both tables. Now both indexes get used and no more full table scans.

and a.the_date between '7/1/2017' and '7/31/2017'
and b.the_date between '7/1/2017' and '7/31/2017'
and a.the_date = b.the_date

Seems unnecessary, but I remember doing things like this many years ago when optimizing Oracle queries. Sometimes I'd even have to do things like a.col1 = b.col1 and a.col1 = b.col1 because having the = clause in there once wasn't enough.