I have a query in PostgreSQL 9.6 that is taking a particularly long time to run:
SELECT DISTINCT ON (gc.number)
gc.number, gc.code, ch.client_id, ch.client_parent_id
FROM client gc, client_hierarchy ch
WHERE gc.code = ch.client_id
AND gc.number in (SELECT NUMB FROM directory
WHERE ACTIVE = TRUE
AND NUMB SIMILAR TO '(ABTR|GREW|POEW)%');
SIMILAR TO
receives many different NUMB
's as parameter. ABTR|GREW|POEW
and around 3.500 more.
Table definitions:
CREATE TABLE directory (
id BIGINT PRIMARY KEY NOT NULL,
active BOOLEAN NOT NULL,
numb VARCHAR(8),
branch VARCHAR(20),
city VARCHAR(50),
modified_date TIMESTAMP NOT NULL,
name VARCHAR(200)
);
CREATE INDEX dir_numb_index ON directory (numb);
CREATE INDEX dir_branch_index ON bank_directory (branch);
CREATE INDEX dir_city_index ON bank_directory (city);
CREATE INDEX dir_name_index ON bank_directory (name);
CREATE TABLE client (
id BIGINT PRIMARY KEY NOT NULL,
active BOOLEAN NOT NULL,
number VARCHAR(8),
code VARCHAR(9) NOT NULL,
modified_date TIMESTAMP NOT NULL,
name VARCHAR(200),
);
CREATE INDEX client_number_index ON client (number);
CREATE INDEX client_code_index ON client (code);
CREATE INDEX client_name_index ON client (name);
CREATE TABLE client_hierarchy (
id BIGINT PRIMARY KEY NOT NULL,
active BOOLEAN NOT NULL,
client_id VARCHAR(9),
client_parent_id VARCHAR(9),
modified_date TIMESTAMP NOT NULL
);
CREATE INDEX client_id_hierarchy_index ON client_hierarchy (client_id);
CREATE INDEX client_parent_id_hierarchy_index ON client_hierarchy (client_parent_id);
Table client
has 5K rows and client_hierarchy
900K rows.
Output of EXPLAIN (ANALYZE, BUFFERS)
:
Unique (cost=21869.20..21890.98 rows=3760 width=23) (actual time=3117.734..3118.110 rows=707 loops=1)
Buffers: shared hit=7786
-> Sort (cost=21869.20..21880.09 rows=4357 width=23) (actual time=3117.733..3117.869 rows=998 loops=1)
Sort Key: gc.number
Sort Method: quicksort Memory: 102kB
Buffers: shared hit=7786
-> Hash Semi Join (cost=1179.86..21605.84 rows=4357 width=23) (actual time=2825.705..3115.586 rows=998 loops=1)
Hash Cond: ((gc.number)::text = (directory.numb)::text)
Buffers: shared hit=7786
-> Hash Join (cost=158.03..20524.10 rows=4357 width=23) (actual time=2.101..311.200 rows=3777 loops=1)
Hash Cond: ((ch.client_id)::text = (gc.code)::text)
Buffers: shared hit=7280
-> Seq Scan on client_hierarchy ch (cost=0.00..15955.00 rows=873500 width=10) (actual time=0.008..161.576 rows=873349 loops=1)
Filter: active
Buffers: shared hit=7220
-> Hash (cost=103.57..103.57 rows=4357 width=13) (actual time=1.969..1.969 rows=4391 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 262kB
Buffers: shared hit=60
-> Seq Scan on gbo_client gc (cost=0.00..103.57 rows=4357 width=13) (actual time=0.004..1.016 rows=4391 loops=1)
Filter: active
Buffers: shared hit=60
-> Hash (cost=953.95..953.95 rows=5430 width=9) (actual time=2803.112..2803.112 rows=5563 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 287kB
Buffers: shared hit=506
-> Seq Scan on directory (cost=0.00..953.95 rows=5430 width=9) (actual time=1.353..2800.528 rows=5563 loops=1)
Filter: (active AND ((numb)::text ~ '^(?:(?:AACM|AACO).*)$'::text))
Rows Removed by Filter: 30318
Buffers: shared hit=506
Planning time: 8.529 ms
Execution time: 3118.224 ms
It is not necessary to guarantee order, just need remove duplicated rows with DISTINCT ON
, because is the same client…they are dirty data.
Is it possible to optimize this?
SOLUTION
Base on feedback here, my new query:
SELECT client.numb, client.code, ch.client_id, ch.client_parent_id
FROM (
SELECT DISTINCT ON (gc.number) gc.number, gc.code
FROM client gc
WHERE EXISTS (
SELECT *
FROM client_hierarchy ch
WHERE ch.client_id = gc.code
)
AND EXISTS (
SELECT *
FROM bank_directory
WHERE active AND numb = gc.number
)
AND numb ~ '^(?:AACM|AACO)'
) as client
INNER JOIN client_hierarchy ch ON client.code = ch.client_id;
New EXPLAIN (ANALYZE, BUFFERS)
:
Nested Loop (cost=11080.16..20448.74 rows=1251 width=23) (actual time=368.303..373.574 rows=707 loops=1)
Buffers: shared hit=8236
-> Unique (cost=11079.73..11086.15 rows=1251 width=13) (actual time=368.286..368.684 rows=707 loops=1)
Buffers: shared hit=5402
-> Sort (cost=11079.73..11082.94 rows=1284 width=13) (actual time=368.285..368.410 rows=998 loops=1)
Sort Key: gc.number
Sort Method: quicksort Memory: 71kB
Buffers: shared hit=5402
-> Hash Semi Join (cost=1312.74..11013.44 rows=1284 width=13) (actual time=19.826..366.642 rows=998 loops=1)
Hash Cond: ((gc.number)::text = (directory.numb)::text)
Buffers: shared hit=5402
-> Nested Loop Semi Join (cost=0.42..9683.47 rows=1284 width=13) (actual time=0.987..346.895 rows=1070 loops=1)
Buffers: shared hit=4896
-> Seq Scan on client gc (cost=0.00..114.46 rows=1284 width=13) (actual time=0.857..327.979 rows=1252 loops=1)
Filter: ((numb)::text ~ '^(?:AACM|AACO)'::text)
Rows Removed by Filter: 3139
Buffers: shared hit=60
-> Index Only Scan using client_id_hierarchy_index on client_hierarchy ch_1 (cost=0.42..7.44 rows=1 width=5) (actual time=0.014..0.014 rows=1 loops=1252)
Index Cond: (client_id = (gc.code)::text)
Heap Fetches: 1070
Buffers: shared hit=4836
-> Hash (cost=864.36..864.36 rows=35836 width=9) (actual time=18.790..18.790 rows=35881 loops=1)
Buckets: 65536 Batches: 1 Memory Usage: 1949kB
Buffers: shared hit=506
-> Seq Scan on directory (cost=0.00..864.36 rows=35836 width=9) (actual time=0.010..11.200 rows=35881 loops=1)
Filter: active
Buffers: shared hit=506
-> Index Scan using client_id_hierarchy_index on client_hierarchy ch (cost=0.42..7.46 rows=1 width=10) (actual time=0.006..0.006 rows=1 loops=707)
Index Cond: ((client_id)::text = (gc.code)::text)
Buffers: shared hit=2834
Planning time: 63.201 ms
Execution time: 373.721 ms
Now I get results in 300-400ms. Down from 3,5 seconds.
Best Answer
We don't actually have the selectivity and costs of the query to optimize this. You need to
EXPLAIN (ANALYZE, BUFFERS)
as mentioned in the comments. That said, a few points..WHERE ACTIVE = TRUE
is triggering a seq scan (you can also just sayWHERE active
for clarity). You can add an index on active which may produce a faster result.EXISTS
which may be faster depending on hownumb
are returned from the inner query.numb
, that may be a suitable to be turned into anenum
which is faster.DISTINCT ON
. You may find it faster to do two exist tests here..like this
Also, it becomes clear when written like this that this is awkward and can be pushed down..
into this..
Lastly
SIMILAR TO
should probably never be used. See this awesome post by Erwin for more information.