Postgresql – Very bad query plan in PostgreSQL 9.6

postgresqlpostgresql-9.6postgresql-performancestatistics

I have a performance problem with PostgreSQL 9.6.17 (x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-39), 64-bit). Sometimes very inefficient query plan is chosen for relatively simple query.

There is dir_current table with 750M rows:

\d sf.dir_current

                                         Table "sf.dir_current"
       Column       |    Type     | Collation | Nullable |                    Default
--------------------+-------------+-----------+----------+-----------------------------------------------
 id                 | bigint      |           | not null | nextval('sf.object_id_seq'::regclass)
 volume_id          | bigint      |           | not null |
 parent_id          | bigint      |           |          |
 blocks             | sf.blkcnt_t |           |          |
 rec_aggrs          | jsonb       |           | not null |
...
Indexes:
    "dir_current_pk" PRIMARY KEY, btree (id), tablespace "sf_current"
    "dir_current_parentid_idx" btree (parent_id), tablespace "sf_current"
    "dir_current_volumeid_id_unq" UNIQUE CONSTRAINT, btree (volume_id, id), tablespace "sf_current"
Foreign-key constraints:
    "dir_current_parentid_fk" FOREIGN KEY (parent_id) REFERENCES sf.dir_current(id) DEFERRABLE INITIALLY DEFERRED

(some columns omitted as they're irrelevant here).

Now, a temporary table is created with ca. 1K rows:

CREATE TEMP TABLE dir_process AS (
  SELECT sf.dir_current.id, volume_id, parent_id, depth, size, blocks, atime, ctime, mtime, sync_time, local_aggrs FROM sf.dir_current
  WHERE ....
);
CREATE INDEX dir_process_indx ON dir_process(volume_id, id);
ANALYZE dir_process;

The actual condition .... doesn't matter here – it selects some rows to be processed.

Here is a query which is sometimes very slow:

SELECT dir.id, dir.volume_id, dir.parent_id, dir.rec_aggrs, dir.blocks
FROM sf.dir_current AS dir
INNER JOIN dir_process ON dir.parent_id = dir_process.id 
                      AND dir.volume_id = dir_process.volume_id
WHERE dir.volume_id = ANY($volume_ids);  -- $volume_ids = manual input

This query is run thousands of times a day and only a few (say 5 a day) result in an inefficient query.

I can't post explain (analyze, buffers, format text) as this issue occurs only sometimes so I need to depend on the logs. A few slow plans:

duration: 1822060.789 ms  plan:
Merge Join  (cost=150260.47..265775.37 rows=1 width=456) (actual rows=14305 loops=1)
  Merge Cond: (dir.volume_id = dir_process.volume_id)
  Join Filter: (dir.parent_id = dir_process.id)
  Rows Removed by Join Filter: 23117117695
  ->  Index Scan using dir_current_volumeid_id_unq on dir_current dir  (cost=0.12..922747.05 rows=624805 width=456) (actual rows=1231600 loops=1)
        Index Cond: (volume_id = ANY ('{88}'::bigint[]))
  ->  Sort  (cost=966.16..975.55 rows=18770 width=16) (actual rows=23115900401 loops=1)
        Sort Key: dir_process.volume_id
        Sort Method: quicksort  Memory: 1648kB
        ->  Seq Scan on dir_process  (cost=0.00..699.70 rows=18770 width=16) (actual rows=18770 loops=1)
duration: 10140968.829 ms  plan:
Merge Join  (cost=0.17..8389.13 rows=1 width=456) (actual rows=819 loops=1)
  Merge Cond: (dir_process.volume_id = dir.volume_id)
  Join Filter: (dir.parent_id = dir_process.id)
  Rows Removed by Join Filter: 2153506735
  ->  Index Only Scan using dir_process_indx on dir_process  (cost=0.06..659.76 rows=1166 width=16) (actual rows=1166 loops=1)
        Heap Fetches: 1166
  ->  Index Scan using dir_current_volumeid_id_unq on dir_current dir  (cost=0.12..885276.20 rows=602172 width=456) (actual rows=2153506389 loops=1)
        Index Cond: (volume_id = ANY ('{5}'::bigint[]))   
duration: 12524111.200 ms  plan:
Merge Join  (cost=480671.74..878819.79 rows=1 width=456) (actual rows=62 loops=1)
  Merge Cond: (dir.volume_id = dir_process.volume_id)
  Join Filter: (dir.parent_id = dir_process.id)
  Rows Removed by Join Filter: 153595373018
  ->  Index Scan using dir_current_volumeid_id_unq on dir_current dir  (cost=0.12..922747.05 rows=624805 width=456) (actual rows=2360101 loops=1)
        Index Cond: (volume_id = ANY ('{441}'::bigint[]))
  ->  Sort  (cost=2621.42..2653.96 rows=65080 width=16) (actual rows=153593012980 loops=1)
        Sort Key: dir_process.volume_id
        Sort Method: quicksort  Memory: 4587kB
        ->  Seq Scan on dir_process  (cost=0.00..1580.80 rows=65080 width=16) (actual rows=65080 loops=1)        

The first reaction is as usual: "do I have up to date statistics?". The answer is yes: dir_current is changed frequently but is also analyzed ca. once an hour. dir_process is analyzed as soon as it's created.
Note that estimated number of rows matches pretty well:

  • est. 624805, actual=1231600
  • est. 624805, actual=2360101

Where the estimates are way off is the inner loop of merge join which actually shows the number of rows times the number of loops (153593012980 or 2153506389 or 23115900401). Poor executor is spinning in the inner loop, iterating over all rows with a given volume_id, looking for a given id.

The biggest problem seems to be that Postgres chooses to do a merge join on a very inefficient condition: dir.volume_id = dir_process.volume_id instead of dir.parent_id = dir_process.id. For a given volume_id there is a few million rows in dir_current, for a given parent_id there is hundreds or thousands of rows, but not millions.
The other effective query plan is a nested loop, where outer loop is iterating over dir_process and using an index to fetch the row from dir_current.

I understand that I could disable merge join before running this query but I was wondering if there is a better solution.

Any more idea about what can be done to avoid this inefficient plan? How is it possible that it's chosen over nested loops?

I don't have control over Postgres version, unfortunately.

Best Answer

Statistic for the query planner: n_distinct

I misread at first. You do run ANALYZE on the temporary table. You commented:

The temp table is immutable: it's created once, analyzed and never changes. It exists because we had a problem with plain query being inefficient, so we extracted the rows we're interested in to a temp table to have statistics about it. Then the second query is used to find rows from dir_current which are children of those interesting ones (i.e. dir_current.parent_id=dir_process.id)

dir_current is analyzed and statistics are set according to our experience (i.e. there is 10 rows on average with a given parent_id so n_distinct is set to -0.1). Given volume_id OTOH has millions of rows.

But how do you do that exactly?

Firstly, the rows in dir_process are the parents of the rows in dir_current (dir_current.parent_id = dir_process.id). The column relevant to your query is dir_process.id, not parent_id. Setting n_distinct for dir_process.parent_id does nothing for this query. Setting it for dir_process.id would be wrong (n_distinct is actually -1, i.e. all distinct). So that's either useless or wrong.

Secondly, if done exactly like you described it (dir_current is analyzed and statistics are set) then nothing is achieved. ALTER TABLE only saves the information pg_attribute.attoptions. Only The next ANALYZE copies it to pg_statistic, which is actually consulted for query planning. Run ANALYZE after ALTER TABLE! Demo:

db<>fiddle here

Finally, the temp table is immutable and comparatively small. I see rows=18770 and rows=65080 for seq scan on dir_process in the plans. If that small table is reused many times it should pay use actual counts instead of an estimate taken from the big table (ignoring that said statistic does not seem to apply to begin with). The actual data distribution in a small sample may be dramatically different.
If you set n_distinct explicitly, use the actual distinct count for the current selection. Something like:

DO
$$
BEGIN
EXECUTE format('ALTER TABLE dir_process ALTER COLUMN volume_id SET (n_distinct=%s)'
             , (SELECT count(DISTINCT volume_id) FROM dir_process));
END
$$;

But not for dir_process.id. You might set that to -1 if you don't make the column UNIQUE or PRIMARY KEY.

The manual about n_distinct:

If greater than zero, the estimated number of distinct values in the column. If less than zero, the negative of the number of distinct values divided by the number of rows. (The negated form is used when ANALYZE believes that the number of distinct values is likely to increase as the table grows; the positive form is used when the column seems to have a fixed number of possible values.) For example, -1 indicates a unique column in which the number of distinct values is the same as the number of rows.

Related:

Query

That aside, your query only retrieves columns from the main table. Assuming that the the temp table dir_process has no duplicates on (volume_id, id), an EXISTS semi-join is equivalent, and the predicate volume_id = ANY(volume_ids) should rather be applied to the comparatively tiny temp table (also equivalent, and statistics are up to date now!):

SELECT c.id, c.volume_id, c.parent_id, c.rec_aggrs, c.blocks
FROM   sf.dir_current c
WHERE  EXISTS (
   SELECT FROM dir_process p
   WHERE  p.volume_id = ANY($volume_ids)
   AND    p.id = c.parent_id
   AND    p.volume_id = c.volume_id
   );

Index

I see a UNIQUE constraint on (volume_id, id) on the main table. Since the temp table seems to be a subset, the same constraint should apply and you can replace the index on the temp table:

CREATE INDEX dir_process_indx ON dir_process(volume_id, id);

with a UNIQUE constraint as well:

ALTER TABLE dir_process ADD CONSTRAINT dir_process_volumeid_id_unq
UNIQUE (volume_id, id);

(Implements a unique index on (volume_id, id) internally.) Not sure if the Postgres 9.6 query planner is helped by it, but it can only get better.

More importantly, an index on (volume_id, parent_id) on the main table should be useful:

CREATE INDEX dir_current_parentid_volumeid_idx ON dir_current(parent_id, volume_id);

Currently there is an index on just (parent_id). But I suspect there can be many rows per parent_id, so the above should be better. (Scratch this, if parent_id is mostly unique!) The new index can mostly replace your existing index dir_current_parentid_idx on (parent_id). It's 50% bigger, that's the main downside. But you wouldn't want to maintain two indexes without good reason. See:

Also, the UNIQUE constraint dir_current_volumeid_id_unq on (volume_id, id) has lost its usefulness for this query now. The "unique" aspect has always been redundant, as id is the PK and unique on its own.

Unless it has other purposes, it might be removed or reduced to an appendix to the PK with the INCLUDE clause in Postgres 11 or later!

Or you might make my suggested index dir_current_parentid_volumeid_idx on (volume_id, parent_id) (switched column order) instead to replace this. (And keep the index on just (parent_id), obviously.)

See:

Should be very fast all the time now.