Postgresql – Optimize query with SIMILAR TO and many prefixes

indexperformancepostgresqlpostgresql-performance

I have a query in PostgreSQL 9.6 that is taking a particularly long time to run:

SELECT DISTINCT ON (gc.number)
       gc.number, gc.code, ch.client_id, ch.client_parent_id
FROM client gc, client_hierarchy ch
WHERE gc.code = ch.client_id
AND gc.number in (SELECT NUMB FROM directory
                  WHERE ACTIVE = TRUE
                  AND NUMB SIMILAR TO '(ABTR|GREW|POEW)%');

SIMILAR TO receives many different NUMB's as parameter. ABTR|GREW|POEW and around 3.500 more.

Table definitions:

CREATE TABLE directory (
    id BIGINT PRIMARY KEY NOT NULL,
    active BOOLEAN NOT NULL,
    numb VARCHAR(8),
    branch VARCHAR(20),
    city VARCHAR(50),
    modified_date TIMESTAMP NOT NULL,
    name VARCHAR(200)
);
CREATE INDEX dir_numb_index ON directory (numb);
CREATE INDEX dir_branch_index ON bank_directory (branch);
CREATE INDEX dir_city_index ON bank_directory (city);
CREATE INDEX dir_name_index ON bank_directory (name);

CREATE TABLE client (
    id BIGINT PRIMARY KEY NOT NULL,
    active BOOLEAN NOT NULL,
    number VARCHAR(8),
    code VARCHAR(9) NOT NULL,
    modified_date TIMESTAMP NOT NULL,
    name VARCHAR(200),
);
CREATE INDEX client_number_index ON client (number);
CREATE INDEX client_code_index ON client (code);
CREATE INDEX client_name_index ON client (name);

CREATE TABLE client_hierarchy (
    id BIGINT PRIMARY KEY NOT NULL,
    active BOOLEAN NOT NULL,
    client_id VARCHAR(9),
    client_parent_id VARCHAR(9),
    modified_date TIMESTAMP NOT NULL
);
CREATE INDEX client_id_hierarchy_index ON client_hierarchy (client_id);
CREATE INDEX client_parent_id_hierarchy_index ON client_hierarchy (client_parent_id);

Table client has 5K rows and client_hierarchy 900K rows.

Output of EXPLAIN (ANALYZE, BUFFERS):

Unique  (cost=21869.20..21890.98 rows=3760 width=23) (actual time=3117.734..3118.110 rows=707 loops=1)
  Buffers: shared hit=7786
  ->  Sort  (cost=21869.20..21880.09 rows=4357 width=23) (actual time=3117.733..3117.869 rows=998 loops=1)
        Sort Key: gc.number
        Sort Method: quicksort  Memory: 102kB
        Buffers: shared hit=7786
        ->  Hash Semi Join  (cost=1179.86..21605.84 rows=4357 width=23) (actual time=2825.705..3115.586 rows=998 loops=1)
              Hash Cond: ((gc.number)::text = (directory.numb)::text)
              Buffers: shared hit=7786
              ->  Hash Join  (cost=158.03..20524.10 rows=4357 width=23) (actual time=2.101..311.200 rows=3777 loops=1)
                    Hash Cond: ((ch.client_id)::text = (gc.code)::text)
                    Buffers: shared hit=7280
                    ->  Seq Scan on client_hierarchy ch  (cost=0.00..15955.00 rows=873500 width=10) (actual time=0.008..161.576 rows=873349 loops=1)
                          Filter: active
                          Buffers: shared hit=7220
                    ->  Hash  (cost=103.57..103.57 rows=4357 width=13) (actual time=1.969..1.969 rows=4391 loops=1)
                          Buckets: 8192  Batches: 1  Memory Usage: 262kB
                          Buffers: shared hit=60
                          ->  Seq Scan on gbo_client gc  (cost=0.00..103.57 rows=4357 width=13) (actual time=0.004..1.016 rows=4391 loops=1)
                                Filter: active
                                Buffers: shared hit=60
              ->  Hash  (cost=953.95..953.95 rows=5430 width=9) (actual time=2803.112..2803.112 rows=5563 loops=1)
                    Buckets: 8192  Batches: 1  Memory Usage: 287kB
                    Buffers: shared hit=506
                    ->  Seq Scan on directory  (cost=0.00..953.95 rows=5430 width=9) (actual time=1.353..2800.528 rows=5563 loops=1)
                          Filter: (active AND ((numb)::text ~ '^(?:(?:AACM|AACO).*)$'::text))
                          Rows Removed by Filter: 30318
                          Buffers: shared hit=506
Planning time: 8.529 ms
Execution time: 3118.224 ms

It is not necessary to guarantee order, just need remove duplicated rows with DISTINCT ON, because is the same client…they are dirty data.

Is it possible to optimize this?

SOLUTION

Base on feedback here, my new query:

SELECT client.numb, client.code, ch.client_id, ch.client_parent_id
 FROM (
      SELECT DISTINCT ON (gc.number) gc.number, gc.code
      FROM client gc
      WHERE EXISTS (
                    SELECT *
                    FROM client_hierarchy ch
                    WHERE ch.client_id = gc.code
                    )
      AND EXISTS (
                    SELECT *
                    FROM bank_directory
                    WHERE active AND numb = gc.number
      )
      AND numb ~ '^(?:AACM|AACO)'
      ) as client
 INNER JOIN client_hierarchy ch ON client.code = ch.client_id;

New EXPLAIN (ANALYZE, BUFFERS):

Nested Loop  (cost=11080.16..20448.74 rows=1251 width=23) (actual time=368.303..373.574 rows=707 loops=1)
  Buffers: shared hit=8236
  ->  Unique  (cost=11079.73..11086.15 rows=1251 width=13) (actual time=368.286..368.684 rows=707 loops=1)
        Buffers: shared hit=5402
        ->  Sort  (cost=11079.73..11082.94 rows=1284 width=13) (actual time=368.285..368.410 rows=998 loops=1)
              Sort Key: gc.number
              Sort Method: quicksort  Memory: 71kB
              Buffers: shared hit=5402
              ->  Hash Semi Join  (cost=1312.74..11013.44 rows=1284 width=13) (actual time=19.826..366.642 rows=998 loops=1)
                    Hash Cond: ((gc.number)::text = (directory.numb)::text)
                    Buffers: shared hit=5402
                    ->  Nested Loop Semi Join  (cost=0.42..9683.47 rows=1284 width=13) (actual time=0.987..346.895 rows=1070 loops=1)
                          Buffers: shared hit=4896
                          ->  Seq Scan on client gc  (cost=0.00..114.46 rows=1284 width=13) (actual time=0.857..327.979 rows=1252 loops=1)
                                Filter: ((numb)::text ~ '^(?:AACM|AACO)'::text)
                                Rows Removed by Filter: 3139
                                Buffers: shared hit=60
                          ->  Index Only Scan using client_id_hierarchy_index on client_hierarchy ch_1  (cost=0.42..7.44 rows=1 width=5) (actual time=0.014..0.014 rows=1 loops=1252)
                                Index Cond: (client_id = (gc.code)::text)
                                Heap Fetches: 1070
                                Buffers: shared hit=4836
                    ->  Hash  (cost=864.36..864.36 rows=35836 width=9) (actual time=18.790..18.790 rows=35881 loops=1)
                          Buckets: 65536  Batches: 1  Memory Usage: 1949kB
                          Buffers: shared hit=506
                          ->  Seq Scan on directory  (cost=0.00..864.36 rows=35836 width=9) (actual time=0.010..11.200 rows=35881 loops=1)
                                Filter: active
                                Buffers: shared hit=506
  ->  Index Scan using client_id_hierarchy_index on client_hierarchy ch  (cost=0.42..7.46 rows=1 width=10) (actual time=0.006..0.006 rows=1 loops=707)
        Index Cond: ((client_id)::text = (gc.code)::text)
        Buffers: shared hit=2834
Planning time: 63.201 ms
Execution time: 373.721 ms

Now I get results in 300-400ms. Down from 3,5 seconds.

Best Answer

We don't actually have the selectivity and costs of the query to optimize this. You need to EXPLAIN (ANALYZE, BUFFERS) as mentioned in the comments. That said, a few points..

  • WHERE ACTIVE = TRUE is triggering a seq scan (you can also just say WHERE active for clarity). You can add an index on active which may produce a faster result.
  • You can rewrite the query, to use EXISTS which may be faster depending on how numb are returned from the inner query.
  • Depending on the domain of numb, that may be a suitable to be turned into an enum which is faster.
  • You call DISTINCT ON. You may find it faster to do two exist tests here..

like this

SELECT gc.number, gc.code, ch.client_id, ch.client_parent_id
FROM client gc
WHERE EXISTS (
  SELECT *
  FROM client_hierarchy ch
  WHERE ch.client_id = gc.code
)
AND EXISTS (
  SELECT *
  FROM directory
  WHERE active
    AND numb = gc.number
    AND numb SIMILAR TO '(ABTR|GREW|POEW)%'
);

Also, it becomes clear when written like this that this is awkward and can be pushed down..

    AND numb = gc.number
    AND numb SIMILAR TO '(ABTR|GREW|POEW)%'

into this..

SELECT gc.number, gc.code, ch.client_id, ch.client_parent_id
FROM client gc
WHERE EXISTS (
  SELECT *
  FROM client_hierarchy ch
  WHERE ch.client_id = gc.code
)
AND EXISTS (
  SELECT *
  FROM directory
  WHERE active
    AND numb = gc.number
)
AND number SIMILAR TO '(ABTR|GREW|POEW)%';

Lastly SIMILAR TO should probably never be used. See this awesome post by Erwin for more information.