Postgresql – Why does PostgreSQL not use an index when OR is used in the conditions

indexperformancepostgresqlquery-performance

DB structure summarized:

  • the main table is cases (about 136k rows)
  • each case can have 0 – n referencing rows in the table case_contacts
  • each case contact references a main contact in the table contacts
  • a case contact may also reference a secondary subcontact, also in the table contacts
  • contacts have their names in contacts.v_fullname, which is indexed with
    a trigram index

The goal is to find cases where the name of a contact or subcontact contains the string "test":

SELECT  c.id,
        c.number
  FROM  cases c
        JOIN case_contacts caco ON caco.case_id = c.id
        JOIN contacts con_main ON con_main.id = caco.contact_id
        LEFT JOIN contacts con_sub ON con_sub.id = caco.subcontact_id
 WHERE  con_main.v_fullname ILIKE '%test%'
        OR con_sub.v_fullname ILIKE '%test%'

This query (query plan) returns the correct result, but does not use the trigram index. It takes around 330ms.

Removing either of the match conditions (query plan), or having them point at the same table (query plan), removes the performance problem. Both of these use the trigram index and are executed in under 1ms, but do not solve the given task.

How can I get PostgreSQL to use my index?

I have simplified this example to the minimum necessary to demonstrate the effect. The actual query is much more complex (and partially auto-generated), so using a UNION of two queries with only one text match each would be very hard, if it's even possible.

I'm using PostgreSQL 9.5.5.
The schema is still open for modifications (to some degree).


As requested, more information about the indexes:

dbname=# \di+ *contacts*
                                         List of relations
 Schema |               Name               | Type  | Owner |     Table     |  Size
--------+----------------------------------+-------+-------+---------------+---------
 public | case_contacts_case_id_idx        | index | x     | case_contacts | 4544 kB
 public | case_contacts_contact_id_idx     | index | x     | case_contacts | 4544 kB
 public | case_contacts_id_case_id_idx     | index | x     | case_contacts | 4544 kB
 public | case_contacts_idx                | index | x     | case_contacts | 9608 kB
 public | case_contacts_pkey               | index | x     | case_contacts | 4544 kB
 public | case_contacts_reference_trgm_idx | index | x     | case_contacts | 4960 kB
 public | case_contacts_subcontact_id_idx  | index | x     | case_contacts | 4544 kB
 public | case_contacts_type_idx           | index | x     | case_contacts | 6208 kB
 public | case_contacts_unique_types_idx   | index | x     | case_contacts | 5464 kB
 public | contacts_parent_id_id_idx        | index | x     | contacts      | 456 kB
 public | contacts_parent_id_idx           | index | x     | contacts      | 360 kB
 public | contacts_pkey                    | index | x     | contacts      | 360 kB
 public | contacts_v_fullname_trgm_idx     | index | x     | contacts      | 1560 kB
(13 rows)

This is how the index on contacts.v_fullname is created:

CREATE INDEX contacts_v_fullname_trgm_idx ON contacts USING GIN (v_fullname gin_trgm_ops);

Best Answer

I cannot really answer your question, because I really don't know why, but I've found a way to make PostgreSQL do more or less what I guess you want. I've tested your situation with a simplified simulation scenario, and using PostgreSQL 9.6.1 (latest as of today). I get the same results.

Good news is: If you can change the way you make your query, you have a couple of options which use the trigram index.

The first one consists on moving the condition on the subcontacts. In this case, the trigram index is used for one of the situations (but not the other):

SELECT
    c.id, c.number
FROM  
    cases c
    JOIN case_contacts caco ON caco.case_id = c.id
    JOIN contacts con_main ON con_main.id = caco.contact_id
    LEFT JOIN 
    (
        SELECT
            * 
        FROM
            contacts  
        WHERE
            v_fullname ilike '%test%' 
    ) AS con_sub ON con_sub.id = caco.subcontact_id
WHERE  
    con_main.v_fullname ILIKE '%test%'
    or con_sub.id is not null /* if the left join gave an answer, it's got '%test%' */ ;

A very few trials with simulated data (where aprox. 0.1%, 2.5%, 5% or 25% of the v_fullname contain '%test%') show that the difference in execution times is minuscule. [My disc is SSD, a real HD might behave very differently.] This should actually be checked with a real system with real data... but it seems that using the trigram index or not, doesn't make a big difference.

PostgreSQL is not exceptionally good at estimating how many rows will appear searching "like '%test%'", but it seems not to matter on which plan decides to use.

There is another option, which (with my little experimentation) works a little bit faster in most cases, and a lot faster when the percentage of '%test%' is low. This option means using a CTE to "prefilter" the contacts (and it uses the trigram index once, because it doesn't need to use it twice):

WITH filtered_contacts AS
(
SELECT
    *
FROM
    contacts
WHERE
    v_fullname ilike '%test%'
)
SELECT
    c.id, c.number
FROM  
    cases c
    JOIN case_contacts caco ON caco.case_id = c.id
    JOIN filtered_contacts con_main ON con_main.id = caco.contact_id
    LEFT JOIN filtered_contacts con_sub ON con_sub.id = caco.subcontact_id
WHERE
    con_main.v_fullname ILIKE '%test%'
    or con_sub.id is not null /* we need this test again, or we'll miss rows */ ;