PostgreSQL Recursive CTE Index Usage – Why Recursive CTE Does Not Use an Index

postgresqlpostgresql-13

Suppose the following relations:

  • match(match_id)
  • event(match_id, seq, gt, …)

There are the following indexes:

  • match(match_id)
  • event(match_id, seq)

Further notes:

  • gt is monotonically increasing
  • For a given match I have a collection of events which happens at a specific 'gt' time
  • both match and event are mat views.
  • List item

I am using postgresql 13.1

My goal is to come up with a RECURSIVE CTE query which calculates the delta between one event and the next one, however I find that very slow.
While this can be solved practically with a self-join, I am not interested in that, I want to find out why my CTE is slow. I believe it should not be that slow.

Further numbers:

  • number of matches is 400
  • each match has an average of 541 events

My RECURSIVE CTE query is the following:


WITH RECURSIVE
    delta_gts AS (
        SELECT m.match_id, 1 AS seq, 0 AS gt, 0 AS delta
        FROM matches m

        UNION

        SELECT dgt.match_id, ev.seq AS seq, ev.gt AS gt, (ev.gt - dgt.gt) AS delta
        FROM delta_gts dgt
        JOIN events ev ON ev.match_id = dgt.match_id AND ev.seq = (dgt.seq + 1)

    )

SELECT * FROM delta_gts g

Futher notes I also tried by adding the following (for one match only):

WHERE g.match_id = 'ita_1672780'

and I find out in the plan that there is no predicate pushdown. I think this was implemented in pgsql 13.1

This is the plan:

QUERY PLAN
CTE Scan on delta_gts g  (cost=160601.44..161032.40 rows=21548 width=76) (actual time=173.940..354185.831 rows=220268 loops=1)
"  Buffers: shared hit=5453034 read=596370, temp read=1340253 written=1581611"
  CTE delta_gts
    ->  Recursive Union  (cost=0.00..160601.44 rows=21548 width=76) (actual time=173.931..353944.926 rows=220268 loops=1)
"          Buffers: shared hit=5453034 read=596370, temp read=1340253 written=1580590"
          ->  Seq Scan on netcastingdocument_matches m  (cost=0.00..10.08 rows=408 width=28) (actual time=173.917..174.265 rows=408 loops=1)
                Buffers: shared hit=6
          ->  Hash Join  (cost=14121.22..16016.04 rows=2114 width=76) (actual time=259.550..305.356 rows=190 loops=1158)
                Hash Cond: ((dgt.match_id = ev.match_id) AND ((dgt.seq + 1) = ev.seq))
"                Buffers: shared hit=5453028 read=596370, temp read=1340253 written=1580590"
                ->  WorkTable Scan on delta_gts dgt  (cost=0.00..81.60 rows=4080 width=72) (actual time=0.005..0.067 rows=190 loops=1158)
                ->  Hash  (cost=8106.89..8106.89 rows=288289 width=24) (actual time=257.949..257.949 rows=288323 loops=1158)
                      Buckets: 65536  Batches: 8  Memory Usage: 2484kB
"                      Buffers: shared hit=5453022 read=596370, temp written=1565616"
                      ->  Seq Scan on netcastingdocument_events ev  (cost=0.00..8106.89 rows=288289 width=24) (actual time=0.016..92.171 rows=288323 loops=1158)
                            Buffers: shared hit=5453022 read=596370
Planning:
  Buffers: shared hit=107
Planning Time: 50.290 ms
JIT:
  Functions: 13
"  Options: Inlining false, Optimization false, Expressions true, Deforming true"
"  Timing: Generation 4.108 ms, Inlining 0.000 ms, Optimization 19.158 ms, Emission 154.531 ms, Total 177.796 ms"
Execution Time: 355489.930 ms

Considerations:

  • It is not using the index (match_id, seq) on the events table at all when the recursive part of the CTE is executed.
  • Disabling seqscan does the trick as it will use the index for events.

After some investigation it looks like that the issue is that a SeqScan is being performed for looking up the next event which is not right in my situation.

Best Answer

There may be several causes; I cannot be certain, because you didn't post the EXPLAIN (ANALYZE, BUFFERS) output for both executions.

  • PostgreSQL might mis-estimate the row counts. Running ANALYZE as you did is a good approach here, but in a recursive CTE the row counts are often hard to predict, and it is hard to fix these estimates.

    If you don't mind a nasty trick, you could try adding another superfluous join condition to make PostgreSQL think that the result will have fewer rows:

    JOIN events ev
       ON ev.match_id = dgt.match_id
          AND ev.seq = dgt.seq + 1
          AND ev.seq - 1 = dgt.seq
    
  • PostgreSQL might price an index scan too high, which induces it to choose a sequential scan and a hash join instead of a nested loop join.

    • If you have an SSD as disk, you should lower random_page_cost to 1 or 1.1 to give the PostgreSQL optimizer an idea that index scans are not four times as expensive as sequential scans.

    • If you have enough RAM, you should set effective_cache_size high enough, so that PostgreSQL knows that data are likely cached. That will also lower the cost of an index scan.