Postgresql – How to optimize an ordered date range query with several OR filters (for streaming)

performancepostgresqlquery-performance

I have a table of data which is be outputted into a stream, so the upper bound of the cost doesn't matter as much as the lower bound. The table contains about 200M rows. The question is how to structurize indexes and queries with order by such that Postgres would use the index instead of doing additional sorting.

Each entry is characterized by 5 fields each of which is not unique on its own:

  • date of occurrence (timestamp)
  • num (order within a timestamp)
  • type1
  • type2
  • id_in_type2 (never used without type2)

A typical query for this data is bounded by date (say, from 1 day before now until now) and zero or more (about 5 actually) filters by any combination of type1, type2 and id_in_type2 connected by OR, and expects a stream of data (it doesn't matter how long it takes to output it completely, but it should start streaming as fast as possible), example:

SELECT *
FROM MY_TABLE
WHERE date > ?
AND (
  (type1 = ?)
  OR (type2 = ?)
  OR (type2 = ? AND associated_id = ?)
  OR ...
)
ORDER BY date, num;

When there are no additional filters or there is just one, a simple index by (date, num, type1, type2, associated_id) works perfectly – since the results are stored along the sort order of the index, there's no problem to start outputting the results.

However, when things get more complicated, no combination of structures of my indexes and my queries seem to yield satisfying results.
Here's roughly what I tried so far:

Indexes: I tried having a separate index for (date, num) and 3 separate ones for other fields (like in the example below), I tried adding date to filters 2-4, i tried putting date and num as first fields in the index as well as the last ones.

CREATE INDEX my_table_1 ON my_table (date, num);
CREATE INDEX my_table_2 ON my_table (type1);
CREATE INDEX my_table_3 ON my_table (type2, id_in_type2);
CREATE INDEX my_table_4 ON my_table (type1, type2, id_in_type2);

Queries: I tried to select by date range and then by filters, I tried putting the filter by date to the end, I tried splitting the filters to separate ones and make a union.

No matter what I do, it won't apply the filters along the (date, num) index order, it will always do the sorting, yielding lower cost about the same as the upper cost.

Oh yes the table is also partitioned by date.

Any ideas on structurizing the indexes? Maybe I'd be better off returning all results with matching date and make the filtering in the app that's reading the stream?

Here's an execution plan for a naively constructed query with index setting as shown above (I ran this one on a test DB which is not partitioned and has 2M rows total, I'm presuming that on production table numbers will be different but orders of magnitude of difference between lower cost and upper cost are going to be about the same)

explain analyze
select * from my_table
where date >= now() - '1 day'::interval
  and (
    (type2 = 'T2A' and type1 = 'T1A')
    or (type2 = 'T2B' and type1 = 'T1B1')
    or (type1 = 'T1C')
    or (type2 = 'T2B' and type1 = 'T1B2')
  )
order by date, num;

Sort  (cost=157714.25..158431.63 rows=286949 width=262) (actual time=2153.327..2313.075 rows=600314 loops=1)
  Sort Key: date, num
  Sort Method: external merge  Disk: 144376kB
  ->  Bitmap Heap Scan on my_table  (cost=16602.76..96390.20 rows=286949 width=262) (actual time=199.193..1281.214 rows=600314 loops=1)
        Recheck Cond: (((type1)::text = 'T1A'::text) OR (((type1)::text = 'T1B1'::text) AND ((type2)::text = 'T2B'::text)) OR ((type1)::text = 'T1C'::text) OR (((type1)::text = 'T1B2'::text) AND ((type2)::text = 'T2B'::text)))
        Rows Removed by Index Recheck: 455892
        Filter: ((date >= (now() - '1 day'::interval)) AND ((((type2)::text = 'T2A'::text) AND ((type1)::text = 'T1A'::text)) OR (((type2)::text = 'T2B'::text) AND ((type1)::text = 'T1B1'::text)) OR ((type1)::text = 'T1C'::text) OR (((type2)::text = 'T2B'::text) AND ((type1)::text = 'T1B2'::text))))
        Rows Removed by Filter: 2
        Heap Blocks: exact=42763 lossy=26546
        ->  BitmapOr  (cost=16602.76..16602.76 rows=299384 width=0) (actual time=188.980..188.980 rows=0 loops=1)
              ->  Bitmap Index Scan on my_table_2  (cost=0.00..4.44 rows=1 width=0) (actual time=0.037..0.037 rows=30 loops=1)
                    Index Cond: ((type1)::text = 'T1A'::text)
              ->  Bitmap Index Scan on my_table_4  (cost=0.00..8094.98 rows=148643 width=0) (actual time=93.562..93.562 rows=300240 loops=1)
                    Index Cond: (((type1)::text = 'T1B1'::text) AND ((type2)::text = 'T2B'::text))
              ->  Bitmap Index Scan on my_table_2  (cost=0.00..4.44 rows=1 width=0) (actual time=0.059..0.059 rows=28 loops=1)
                    Index Cond: ((type1)::text = 'T1C'::text)
              ->  Bitmap Index Scan on my_table_4  (cost=0.00..8211.96 rows=150741 width=0) (actual time=95.315..95.315 rows=300018 loops=1)
                    Index Cond: (((type1)::text = 'T1B2'::text) AND ((type2)::text = 'T2B'::text))
Planning time: 0.435 ms
Execution time: 2356.305 ms

The query may return 100s of thousands of rows.

Best Answer

Have you tried making 4 separate queries, and then using union all, so that you can get rid of the or statements?

Example:

select * from my_table
where date >= now() - '1 day'::interval
  and type2 = 'T2A' 
  and type1 = 'T1A'

union all
select * from my_table
where date >= now() - '1 day'::interval
  and type2 = 'T2B' 
  and type1 = 'T1B1'
...