I have a performance problem with PostgreSQL 9.6.17 (x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-39), 64-bit). Sometimes very inefficient query plan is chosen for relatively simple query.
There is dir_current
table with 750M rows:
\d sf.dir_current
Table "sf.dir_current"
Column | Type | Collation | Nullable | Default
--------------------+-------------+-----------+----------+-----------------------------------------------
id | bigint | | not null | nextval('sf.object_id_seq'::regclass)
volume_id | bigint | | not null |
parent_id | bigint | | |
blocks | sf.blkcnt_t | | |
rec_aggrs | jsonb | | not null |
...
Indexes:
"dir_current_pk" PRIMARY KEY, btree (id), tablespace "sf_current"
"dir_current_parentid_idx" btree (parent_id), tablespace "sf_current"
"dir_current_volumeid_id_unq" UNIQUE CONSTRAINT, btree (volume_id, id), tablespace "sf_current"
Foreign-key constraints:
"dir_current_parentid_fk" FOREIGN KEY (parent_id) REFERENCES sf.dir_current(id) DEFERRABLE INITIALLY DEFERRED
(some columns omitted as they're irrelevant here).
Now, a temporary table is created with ca. 1K rows:
CREATE TEMP TABLE dir_process AS (
SELECT sf.dir_current.id, volume_id, parent_id, depth, size, blocks, atime, ctime, mtime, sync_time, local_aggrs FROM sf.dir_current
WHERE ....
);
CREATE INDEX dir_process_indx ON dir_process(volume_id, id);
ANALYZE dir_process;
The actual condition ....
doesn't matter here – it selects some rows to be processed.
Here is a query which is sometimes very slow:
SELECT dir.id, dir.volume_id, dir.parent_id, dir.rec_aggrs, dir.blocks
FROM sf.dir_current AS dir
INNER JOIN dir_process ON dir.parent_id = dir_process.id
AND dir.volume_id = dir_process.volume_id
WHERE dir.volume_id = ANY($volume_ids); -- $volume_ids = manual input
This query is run thousands of times a day and only a few (say 5 a day) result in an inefficient query.
I can't post explain (analyze, buffers, format text)
as this issue occurs only sometimes so I need to depend on the logs. A few slow plans:
duration: 1822060.789 ms plan:
Merge Join (cost=150260.47..265775.37 rows=1 width=456) (actual rows=14305 loops=1)
Merge Cond: (dir.volume_id = dir_process.volume_id)
Join Filter: (dir.parent_id = dir_process.id)
Rows Removed by Join Filter: 23117117695
-> Index Scan using dir_current_volumeid_id_unq on dir_current dir (cost=0.12..922747.05 rows=624805 width=456) (actual rows=1231600 loops=1)
Index Cond: (volume_id = ANY ('{88}'::bigint[]))
-> Sort (cost=966.16..975.55 rows=18770 width=16) (actual rows=23115900401 loops=1)
Sort Key: dir_process.volume_id
Sort Method: quicksort Memory: 1648kB
-> Seq Scan on dir_process (cost=0.00..699.70 rows=18770 width=16) (actual rows=18770 loops=1)
duration: 10140968.829 ms plan:
Merge Join (cost=0.17..8389.13 rows=1 width=456) (actual rows=819 loops=1)
Merge Cond: (dir_process.volume_id = dir.volume_id)
Join Filter: (dir.parent_id = dir_process.id)
Rows Removed by Join Filter: 2153506735
-> Index Only Scan using dir_process_indx on dir_process (cost=0.06..659.76 rows=1166 width=16) (actual rows=1166 loops=1)
Heap Fetches: 1166
-> Index Scan using dir_current_volumeid_id_unq on dir_current dir (cost=0.12..885276.20 rows=602172 width=456) (actual rows=2153506389 loops=1)
Index Cond: (volume_id = ANY ('{5}'::bigint[]))
duration: 12524111.200 ms plan:
Merge Join (cost=480671.74..878819.79 rows=1 width=456) (actual rows=62 loops=1)
Merge Cond: (dir.volume_id = dir_process.volume_id)
Join Filter: (dir.parent_id = dir_process.id)
Rows Removed by Join Filter: 153595373018
-> Index Scan using dir_current_volumeid_id_unq on dir_current dir (cost=0.12..922747.05 rows=624805 width=456) (actual rows=2360101 loops=1)
Index Cond: (volume_id = ANY ('{441}'::bigint[]))
-> Sort (cost=2621.42..2653.96 rows=65080 width=16) (actual rows=153593012980 loops=1)
Sort Key: dir_process.volume_id
Sort Method: quicksort Memory: 4587kB
-> Seq Scan on dir_process (cost=0.00..1580.80 rows=65080 width=16) (actual rows=65080 loops=1)
The first reaction is as usual: "do I have up to date statistics?". The answer is yes: dir_current
is changed frequently but is also analyzed ca. once an hour. dir_process
is analyzed as soon as it's created.
Note that estimated number of rows matches pretty well:
- est. 624805, actual=1231600
- est. 624805, actual=2360101
Where the estimates are way off is the inner loop of merge join which actually shows the number of rows times the number of loops (153593012980 or 2153506389 or 23115900401). Poor executor is spinning in the inner loop, iterating over all rows with a given volume_id
, looking for a given id
.
The biggest problem seems to be that Postgres chooses to do a merge join on a very inefficient condition: dir.volume_id = dir_process.volume_id
instead of dir.parent_id = dir_process.id
. For a given volume_id
there is a few million rows in dir_current
, for a given parent_id
there is hundreds or thousands of rows, but not millions.
The other effective query plan is a nested loop, where outer loop is iterating over dir_process
and using an index to fetch the row from dir_current
.
I understand that I could disable merge join before running this query but I was wondering if there is a better solution.
Any more idea about what can be done to avoid this inefficient plan? How is it possible that it's chosen over nested loops?
I don't have control over Postgres version, unfortunately.
Best Answer
Statistic for the query planner:
n_distinct
I misread at first. You do run
ANALYZE
on the temporary table. You commented:But how do you do that exactly?
Firstly, the rows in
dir_process
are the parents of the rows indir_current
(dir_current.parent_id = dir_process.id
). The column relevant to your query isdir_process.id
, notparent_id
. Settingn_distinct
fordir_process.parent_id
does nothing for this query. Setting it fordir_process.id
would be wrong (n_distinct
is actually -1, i.e. all distinct). So that's either useless or wrong.Secondly, if done exactly like you described it (
dir_current is analyzed and statistics are set
) then nothing is achieved.ALTER TABLE
only saves the informationpg_attribute.attoptions
. Only The nextANALYZE
copies it topg_statistic
, which is actually consulted for query planning. RunANALYZE
afterALTER TABLE
! Demo:db<>fiddle here
Finally, the temp table is immutable and comparatively small. I see
rows=18770
androws=65080
for seq scan ondir_process
in the plans. If that small table is reused many times it should pay use actual counts instead of an estimate taken from the big table (ignoring that said statistic does not seem to apply to begin with). The actual data distribution in a small sample may be dramatically different.If you set
n_distinct
explicitly, use the actual distinct count for the current selection. Something like:But not for
dir_process.id
. You might set that to-1
if you don't make the columnUNIQUE
orPRIMARY KEY
.The manual about
n_distinct
:Related:
Query
That aside, your query only retrieves columns from the main table. Assuming that the the temp table
dir_process
has no duplicates on(volume_id, id)
, anEXISTS
semi-join is equivalent, and the predicatevolume_id = ANY(volume_ids)
should rather be applied to the comparatively tiny temp table (also equivalent, and statistics are up to date now!):Index
I see a
UNIQUE
constraint on(volume_id, id)
on the main table. Since the temp table seems to be a subset, the same constraint should apply and you can replace the index on the temp table:with a
UNIQUE
constraint as well:(Implements a unique index on
(volume_id, id)
internally.) Not sure if the Postgres 9.6 query planner is helped by it, but it can only get better.More importantly, an index on
(volume_id, parent_id)
on the main table should be useful:Currently there is an index on just
(parent_id)
. But I suspect there can be many rows perparent_id
, so the above should be better. (Scratch this, ifparent_id
is mostly unique!) The new index can mostly replace your existing indexdir_current_parentid_idx
on(parent_id)
. It's 50% bigger, that's the main downside. But you wouldn't want to maintain two indexes without good reason. See:Also, the
UNIQUE
constraintdir_current_volumeid_id_unq
on(volume_id, id)
has lost its usefulness for this query now. The "unique" aspect has always been redundant, asid
is the PK and unique on its own.Unless it has other purposes, it might be removed or reduced to an appendix to the PK with the
INCLUDE
clause in Postgres 11 or later!Or you might make my suggested index
dir_current_parentid_volumeid_idx
on(volume_id, parent_id)
(switched column order) instead to replace this. (And keep the index on just(parent_id)
, obviously.)See:
Should be very fast all the time now.