Postgresql – How to make Postgres Multi-Column and Multi-Table Search More Efficient

indexindex-tuningperformancepostgresqlpostgresql-10

I have a Shipment table with some basic data about a shipment and a ShipmentItem table which contains additional attributes about that shipment with a foreignKey on the Shipment table's primaryKey. The Shipment to ShipmentItem table is OneToMany relationship.

We have a requirement to include a text search option that takes a given inputted text string and searches over 2 of the (make) Shipment's columns in addition to three specific types of the ShipmentItem name column. This is my current query:

select *
from Shipment shipment
where shipment.deliveryRequestedDate >= '2019-06-09T00:00:00Z'
  and shipment.deliveryRequestedDate <= '2019-12-06T23:59:59Z'
  and (
        shipment.identifierkeyvalues = '12345'
        or shipment.carrierReferenceNumber = '12345'
        or shipment.uuid in (
            select shipmentItem.resultId
            from ShipmentItem shipmentItem
            where (
                shipmentItem.type in (
                                      'poNumber', 'deliveryNoteNumber', 'salesOrderNumber'
                )
            )
            and shipmentItem.name = '12345'
            and shipmentItem.deliveryRequestedDate >= '2019-06-09T00:00:00Z'
            and shipmentItem.deliveryRequestedDate <= '2019-12-06T23:59:59Z'
       )
    )
limit 25

The issue I've discovered is the combination of putting a subQuery as one of the or conditions is causing a MAJOR performance problem (even though the subQuery itself returns fast by utilizing a type_name_deliveryRequestedDate index on that table. Although we have multiple indexes on the main table (identifierKeyValues, carrierReferenceNumber, and even an index over all three Shipment columns queried over, it will only use the deliveryRequestedDate index which is extremely inefficient since the range is so large for this query.

Converting this into a JOIN has seemed to cause the same behavior. I'm just not quite sure what the best approach is at this point. We have a Java Persistence API layer above this query so would like to avoid making any major changes to the data model if possible but not sure what the best approach is. Any ideas would be greatly appreciated!

EXPLAIN PLAN:

Limit  (cost=110.61..209.98 rows=25 width=1370) (actual time=119503.030..124034.809 rows=1 loops=1)
      ->  Index Scan using shipment_deliveryrequesteddate_idx on shipment shipment  (cost=110.61..890840.18 rows=224084 width=1370) (actual time=119503.027..124034.805 rows=1 loops=1)
            Index Cond: ((deliveryrequesteddate >= '2019-06-09 00:00:00'::timestamp without time zone) AND (deliveryrequesteddate <= '2019-12-06 23:59:59'::timestamp without time zone))
            Filter: ((identifierkeyvalues = '12345'::text) OR (carrierreferencenumber = '12345'::text) OR (hashed SubPlan 1))
            Rows Removed by Filter: 496784
            SubPlan 1
              ->  Index Scan using "type_name_deliveryRequestedDate" on resultitem shipmentitem  (cost=0.56..110.11 rows=24 width=16) (actual time=10.706..16.416 rows=1 loops=1)
                    Index Cond: ((type = ANY ('{poNumber,deliveryNoteNumber,salesOrderNumber}'::text[])) AND (name = '12345'::text) AND (deliveryrequesteddate >= '2019-06-09 00:00:00'::timestamp without time zone) AND (deliveryrequesteddate <= '2019-12-06 23:59:59'::timestamp without time zone))
    Planning time: 3.175 ms
    Execution time: 124035.006 ms

EXPLAIN PLAN removing the subquery — Why does it use a completely different index?

Limit  (cost=9.51..273.71 rows=6 width=1370) (actual time=0.052..0.053 rows=0 loops=1)
  ->  Bitmap Heap Scan on shipment shipment  (cost=9.51..273.71 rows=6 width=1370) (actual time=0.051..0.051 rows=0 loops=1)
        Recheck Cond: (((identifierkeyvalues = '12345'::text) AND (deliveryrequesteddate >= '2019-06-09 00:00:00'::timestamp without time zone) AND (deliveryrequesteddate <= '2019-12-06 23:59:59'::timestamp without time zone)) OR (carrierreferencenumber = '12345'::text))
        Filter: ((deliveryrequesteddate >= '2019-06-09 00:00:00'::timestamp without time zone) AND (deliveryrequesteddate <= '2019-12-06 23:59:59'::timestamp without time zone))
        Rows Removed by Filter: 2
        Heap Blocks: exact=2
        ->  BitmapOr  (cost=9.51..9.51 rows=66 width=0) (actual time=0.041..0.041 rows=0 loops=1)
              ->  Bitmap Index Scan on shipment_identifierkeyvalues_idx  (cost=0.00..4.61 rows=4 width=0) (actual time=0.023..0.024 rows=0 loops=1)
                    Index Cond: ((identifierkeyvalues = '12345'::text) AND (deliveryrequesteddate >= '2019-06-09 00:00:00'::timestamp without time zone) AND (deliveryrequesteddate <= '2019-12-06 23:59:59'::timestamp without time zone))
              ->  Bitmap Index Scan on shipment_carrierreferencenumber_idx  (cost=0.00..4.90 rows=62 width=0) (actual time=0.016..0.016 rows=2 loops=1)
                    Index Cond: (carrierreferencenumber = '12345'::text)
Planning time: 1.668 ms
Execution time: 0.116 ms

Best Answer

It can't use a BitmapOr against a scan on a different table (or at least, it wasn't coded to be able to do that--perhaps it could be made to do that if someone put in the work--it would have to look up the UUID in the other table, then convert them to tids on the ipso table and stuff them into the bitmap), so it can't use the BitmapOr plan.

Your best bet is probably to write this as a UNION ALL of two different queries, one that just hits the single table and one that hits both.