PostgreSQL – How to Speed Up JOIN ON LIKE

index-tuningjoin;performancepostgresqlquery-performance

I have millions of things in a table. I'd like to assign groups to the things based on theirs prefixes.

I tried to add all the indexes but the planner still does nested loop left join.

A demo case SQL:

create table things as
select 
    i as id, 
    left(md5(random()::text), 8) as name
from generate_series(1, 100000) as i;

create table match_group_rules as
select
    i as id,
    trunc(random() * 5 + 1) as group_id,
    left(md5(random()::text), 2) as rule
from generate_series(1, 100) as i;

create extension if not exists pg_trgm;
create index match_group_rules_rule on match_group_rules (rule);
create index match_group_rules_rule_pattern on match_group_rules (rule text_pattern_ops);
create index things_name_idx on things (name);
create index things_name_pattern_idx on things (name text_pattern_ops);
create index things_name_gin_trgm_idx on things using gin (name gin_trgm_ops);
create index things_name_gist_trgm_idx on things using gist (name gist_trgm_ops);

explain 
select *
from things t
left join match_group_rules r 
    on t.name like r.rule || '%';

The demo case planner output:

Nested Loop Left Join  (cost=0.00..176543.25 rows=100000 width=57)
  Join Filter: (t.name ~~ (r.rule || '%'::text))
  ->  Seq Scan on things t  (cost=0.00..1541.00 rows=100000 width=13)
  ->  Materialize  (cost=0.00..2.50 rows=100 width=44)
        ->  Seq Scan on match_group_rules r  (cost=0.00..2.00 rows=100 width=44)

Questions:

  1. What makes the planner ignore the indexes?
  2. Is there a quick way to compute the groups for the things?

Best Answer

The initial problem is the left join. That restricts it to using things as the driving table, but once you do that there is no way to use the indexes, as pg_trgm and text_pattern_ops offer ways to index for texts matching a given pattern, not for patterns matching a given text.

In order to use an index on the left join, you would have to first do a nested loop with match_group_rules driving, memorize all the rows of things retrieved, and then go back and make up NULL rows for the things not found. There is no inherent reason that I can see that PostgreSQL wouldn't be able to do this complex operation, it just isn't implemented. You can do it yourself with a CTE and a UNION ALL:

with foobar as 
(select t.id as t_id, name, r.id, r.group_id, r.rule
   from things t
   join match_group_rules r 
     on t.name like r.rule || '%'
)
select * from foobar 
    union all
select id, name, NULL, NULL, NULL from things t2 where not exists 
 (select 1 from foobar where t2.id=foobar.t_id);