Postgresql – Strange optimisation effect using WITH construction

amazon-rdsexecution-planoptimizationperformancepostgresqlquery-performance

I have 4 tables, let's name them as:

  1. table A, 15M rows
  2. table B, 40K rows,
  3. table C, 30K rows,
  4. table D, 25M rows

(kk – means millions)

and I had a legacy query, which was constructed like this:

select C.<some_fields>,B.<some_fields>,D.<some_fields> from C
inner join A on C.x = A.x
inner join D on D.z = 123 and D.a_id = A.a_id
inner join B on C.x = B.x and B.z = 123
where A.type = 'Xxx'

This query was extremely slow, it taken up to 3 minutes to execute results (for particular cases it returns 35k rows).

But when I've changed it to the following structure:

with t as (
   select C.<some_fields>,D.<some_fields> from C
   inner join A on C.x = A.x
   inner join D on D.z = 123 and D.a_id = A.a_id
   where A.type = 'Xxx'
)
select t.*, B.<some_fields>,
inner join B on t.x = B.x and B.z = 123

It started working 30 times faster (i.e. it takes now up to 6 seconds to retrieve same results).

Let's assume, that indexes are constructed properly. And my idea to make such trick has born when I noticed that this block, which I've wrapped into with ( ... ) works very fast (and it returns very similar amount of data as entire query).

So my question is: what could be the reason? Why Postgres can not construct proper plan or do same trick internally?

UPDATE:

Execution plan for old query:

Nested Loop  (cost=1.83..1672.82 rows=1 width=54) (actual time=8.178..91515.625 rows=37373 loops=1)
  ->  Nested Loop  (cost=1.42..1671.47 rows=1 width=62) (actual time=8.108..90883.567 rows=37373 loops=1)
        Join Filter: (a.x = b.x)
        Rows Removed by Join Filter: 9132436
        ->  Index Scan using b_pkey on B b  (cost=0.41..8.43 rows=1 width=71) (actual time=0.022..0.782 rows=241 loops=1)
              Index Cond: (z = 123)
        ->  Nested Loop  (cost=1.00..1660.48 rows=146 width=149) (actual time=0.027..363.227 rows=38049 loops=241)
              ->  Index Only Scan using idx_1 on D d  (cost=0.56..424.59 rows=146 width=8) (actual time=0.017..50.869 rows=64176 loops=241)
                    Index Cond: (z = 123)
                    Heap Fetches: 15564503
              ->  Index Scan using a_pkey on A a  (cost=0.44..8.46 rows=1 width=149) (actual time=0.003..0.004 rows=1 loops=15466416)
                    Index Cond: (a_id = d.a_id)
  ->  Index Scan using c_pkey on C c (cost=0.41..1.08 rows=1 width=8) (actual time=0.005..0.007 rows=1 loops=37373)
        Index Cond: (x = a.x)
        Filter: ((type)::text = 'Xxx')
Planning time: 3.468 ms
Execution time: 91541.019 ms

Execution plan for new query:

Hash Join  (cost=1828.09..1830.28 rows=1 width=94) (actual time=0.654..1130.542 rows=37376 loops=1)
  Hash Cond: (t.x = b.x)
  CTE t
    ->  Nested Loop  (cost=1.42..1819.64 rows=81 width=158) (actual time=0.060..761.058 rows=38052 loops=1)
          ->  Nested Loop  (cost=1.00..1660.48 rows=146 width=149) (actual time=0.039..461.235 rows=38052 loops=1)
                ->  Index Only Scan using idx_1 on D d  (cost=0.56..424.59 rows=146 width=8) (actual time=0.024..73.972 rows=64179 loops=1)
                      Index Cond: (z = 123)
                      Heap Fetches: 64586
                ->  Index Scan using a_pkey on A a  (cost=0.44..8.46 rows=1 width=149) (actual time=0.004..0.004 rows=1 loops=64179)
                      Index Cond: (a_id = d.a_id)
          ->  Index Scan using c_pkey on C c  (cost=0.41..1.07 rows=1 width=17) (actual time=0.004..0.005 rows=1 loops=38052)
                Index Cond: (x = a.x)
                Filter: ((type)::text = 'Xxx')
  ->  CTE Scan on t  (cost=0.00..1.62 rows=81 width=104) (actual time=0.063..854.405 rows=38052 loops=1)
  ->  Hash  (cost=8.43..8.43 rows=1 width=71) (actual time=0.353..0.353 rows=241 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 34kB
        ->  Index Scan using b_pkey on B b  (cost=0.41..8.43 rows=1 width=71) (actual time=0.012..0.262 rows=241 loops=1)
              Index Cond: (z = 123)
Planning time: 1.221 ms
Execution time: 1147.267 ms

UPDATE-2:

Recently dear commentators have noticed that the issue was caused by bad estimates of rows number and they suggested me do vacuum analyze. But I'm running this server on Amazon-RDS where autovacuum feature is enabled. Also, I've tried to run the script for showing tables eligible for vacuum, suggested in Amazon RDS Documentation, and it shows me 0 tables eligible for vacuum.

UPDATE-3: The performed ANALYZE, suggested in the commentaries, didn't change bad rows estimations in the plans, but increased speed of "old" variant of query. I still have no full understanding about my core question: why does second type of query has dramatically higher speed (even w/o ANALYZE)?

Best Answer

From the explain, I can clearly see PostgreSQL produced bad estimates and chose bad plan for the first query. The equality selector on column z in table D had a completely wrong estimate (~ 500 times off).

The second time was better because WITH is a planning fence, as Craig pointed out.

Let's concentrate on the 1st query.

Planner produces bad estimates because of missing/old/inadequate statistics.

Almost all stats is visible in pg_stats. You should inspect this view and preferably paste relevant rows here.

SELECT * FROM pg_stats WHERE tablename='d' and attname='z';

If a column has some funny statistical distribution (non-uniform, non-gaussian, skewed, pseudo-random with hidden patterns, etc), then statistics module behind ANALYZE may be unable to catch regularities needed for the planner.

In many cases, using larger sample helps produce truthful estimates. There is a configuration parameter to raise the sample size for ANALYZE, which can be used like this:

SET default_statistics_target TO 200;
ANALYZE A;
ANALYZE B;
ANALYZE C;
SET default_statistics_target TO 1000;
ANALYZE D;

Experiment with values and retry your SELECT queries. If it helps, you can raise sample size permanently, using ALTER TABLE D ALTER COLUMN Z SET STATISTICS 1000.

PS. as always, you should make sure that memory / resources configuration is correct. All the setings in "Resource Configuration" section should be adjusted to actual server resources (database & memory size, raid array type, ssd drives).