Postgresql – Equivalent Join Conditions Yield Vastly Different Results

optimizationperformancepostgresqlpostgresql-9.6query-performance

The following query aligns test data with the closest truth data equal to or earlier than the test data for all possible truth_id's. There are very few truth_ids and very few test ids, but each id has thousands of entries with consecutive epoch-times. Truth id's are not the same as test id's.

So for test id X, each row will be paired with the closest truth row of ID A,B,C…

For 4 unique test ids and 3 unique truth ids, and 1000 rows per test id and 2000 rows per truth id, the query should return (4*1000)*3 = 16000 rows. Note that the number of rows per truth id does not impact the number of rows returned.

This version quickly completes in 0.27 seconds:

SELECT * 
FROM
(
    SELECT
        epoch as test_epoch,
        id as test_id,
        truth_id,
        sqrt(x^2 + y^2) as test_r
    FROM test_data 
    CROSS JOIN ( SELECT DISTINCT id AS truth_id FROM truth_data) as tids
) AS test
LEFT JOIN
(
    SELECT
        epoch as truth_epoch,
        id as truth_id,
        sqrt(x^2 + y^2) as truth_r
    FROM truth_data     
) AS truth_low
ON (
    test.truth_id = truth_low.truth_id
    AND truth_low.truth_epoch = (select max(epoch) from truth_data where epoch <= test_epoch AND id = test.truth_id )
)

I already enforce truth_id equality in the third to last line. But if I use the truth_low.truth_id instead of test.truth_id, ie:

AND truth_low.truth_epoch = (select max(epoch) from truth_data where epoch <= test_epoch AND id = truth_low.truth_id )

The query is terribly slow and takes 27.53 seconds.

I don't understand how something so minor could cause such an inefficient shift in the query plan. I understand that the query plans are indeed different and that one is faster, but they queries are so similar that it seems unreasonable that this would occur. Can anyone explain?

The fast query has the query plan:

Hash Left Join  (cost=165.00..212716.05 rows=2800 width=44) (actual time=0.723..207.941 rows=800 loops=1)
  Hash Cond: ((truth_data_1.id = truth_data.id) AND ((SubPlan 1) = truth_data.epoch))
  Buffers: shared hit=72121
  ->  Nested Loop  (cost=70.00..150.04 rows=2800 width=32) (actual time=0.304..1.589 rows=800 loops=1)
        Buffers: shared hit=76
        ->  Seq Scan on track_data  (cost=0.00..45.00 rows=1400 width=28) (actual time=0.035..0.099 rows=400 loops=1)
              Buffers: shared hit=31
        ->  Materialize  (cost=70.00..70.05 rows=2 width=4) (actual time=0.001..0.001 rows=2 loops=400)
              Buffers: shared hit=45
              ->  HashAggregate  (cost=70.00..70.02 rows=2 width=4) (actual time=0.262..0.263 rows=2 loops=1)
                    Group Key: truth_data_1.id
                    Buffers: shared hit=45
                    ->  Seq Scan on truth_data truth_data_1  (cost=0.00..65.00 rows=2000 width=4) (actual time=0.009..0.099 rows=600 loops=1)
                          Buffers: shared hit=45
  ->  Hash  (cost=65.00..65.00 rows=2000 width=28) (actual time=0.237..0.237 rows=600 loops=1)
        Buckets: 2048  Batches: 1  Memory Usage: 41kB
        Buffers: shared hit=45
        ->  Seq Scan on truth_data  (cost=0.00..65.00 rows=2000 width=28) (actual time=0.012..0.112 rows=600 loops=1)
              Buffers: shared hit=45
  SubPlan 1
    ->  Aggregate  (cost=75.83..75.84 rows=1 width=8) (actual time=0.125..0.125 rows=1 loops=1600)
          Buffers: shared hit=72000
          ->  Seq Scan on truth_data truth_data_2  (cost=0.00..75.00 rows=333 width=8) (actual time=0.021..0.096 rows=150 loops=1600)
                Filter: ((epoch <= track_data.epoch) AND (id = truth_data_1.id))
                Rows Removed by Filter: 450
                Buffers: shared hit=72000
Planning time: 0.293 ms
Execution time: 208.129 ms

While the slow query has the plan:

Hash Left Join  (cost=160.00..212397803.04 rows=2800 width=44) (actual time=31.818..29884.990 rows=800 loops=1)
  Hash Cond: (truth_data_1.id = truth_data.id)
  Join Filter: (truth_data.epoch = (SubPlan 1))
  Rows Removed by Join Filter: 239200
  Buffers: shared hit=10800121
  ->  Nested Loop  (cost=70.00..150.04 rows=2800 width=32) (actual time=0.490..2.688 rows=800 loops=1)
        Buffers: shared hit=76
        ->  Seq Scan on track_data  (cost=0.00..45.00 rows=1400 width=28) (actual time=0.030..0.116 rows=400 loops=1)
              Buffers: shared hit=31
        ->  Materialize  (cost=70.00..70.05 rows=2 width=4) (actual time=0.001..0.003 rows=2 loops=400)
              Buffers: shared hit=45
              ->  HashAggregate  (cost=70.00..70.02 rows=2 width=4) (actual time=0.447..0.449 rows=2 loops=1)
                    Group Key: truth_data_1.id
                    Buffers: shared hit=45
                    ->  Seq Scan on truth_data truth_data_1  (cost=0.00..65.00 rows=2000 width=4) (actual time=0.026..0.207 rows=600 loops=1)
                          Buffers: shared hit=45
  ->  Hash  (cost=65.00..65.00 rows=2000 width=28) (actual time=0.235..0.235 rows=600 loops=1)
        Buckets: 2048  Batches: 1  Memory Usage: 41kB
        Buffers: shared hit=45
        ->  Seq Scan on truth_data  (cost=0.00..65.00 rows=2000 width=28) (actual time=0.017..0.117 rows=600 loops=1)
              Buffers: shared hit=45
  SubPlan 1
    ->  Aggregate  (cost=75.83..75.84 rows=1 width=8) (actual time=0.122..0.122 rows=1 loops=240000)
          Buffers: shared hit=10800000
          ->  Seq Scan on truth_data truth_data_2  (cost=0.00..75.00 rows=333 width=8) (actual time=0.020..0.093 rows=150 loops=240000)
                Filter: ((epoch <= track_data.epoch) AND (id = truth_data.id))
                Rows Removed by Filter: 450
                Buffers: shared hit=10800000
Planning time: 0.582 ms
Execution time: 29885.446 ms

EDIT 1:
For Erwin:

Thank you for your response. Would you please elaborate? I don't understand what you mean by the second variant's correlated subquery depending on two disjunct tables while the original does not. It seems that both versions of the correlated sub query reference a source id and source epoch from either the test or truth table, and then query the truth table.

I added more rows to my tables to get a better of idea of runtime and your suggested solution. I also added an index on truth_data(epoch) and truth_data(id, epoch ORDER BY DESC NULLS LAST).

Variant 1 yields the following explain:

Hash Left Join  (cost=248.00..2937.81 rows=4000 width=44) (actual time=4.621..135.524 rows=4000 loops=1)
  Hash Cond: ((truth_data_1.id = truth_data.id) AND ((SubPlan 2) = truth_data.epoch))
  Buffers: shared hit=28126
  ->  Nested Loop  (cost=99.00..197.05 rows=4000 width=32) (actual time=1.409..7.245 rows=4000 loops=1)
        Buffers: shared hit=77
        ->  Seq Scan on test_data  (cost=0.00..48.00 rows=2000 width=28) (actual time=0.039..0.297 rows=2000 loops=1)
              Buffers: shared hit=28
        ->  Materialize  (cost=99.00..99.05 rows=2 width=4) (actual time=0.001..0.001 rows=2 loops=2000)
              Buffers: shared hit=49
              ->  HashAggregate  (cost=99.00..99.02 rows=2 width=4) (actual time=1.355..1.355 rows=2 loops=1)
                    Group Key: truth_data_1.id
                    Buffers: shared hit=49
                    ->  Seq Scan on truth_data truth_data_1  (cost=0.00..89.00 rows=4000 width=4) (actual time=0.025..0.440 rows=4000 loops=1)
                          Buffers: shared hit=49
  ->  Hash  (cost=89.00..89.00 rows=4000 width=28) (actual time=3.168..3.168 rows=4000 loops=1)
        Buckets: 4096  Batches: 1  Memory Usage: 235kB
        Buffers: shared hit=49
        ->  Seq Scan on truth_data  (cost=0.00..89.00 rows=4000 width=28) (actual time=0.020..0.952 rows=4000 loops=1)
              Buffers: shared hit=49
  SubPlan 2
    ->  Result  (cost=0.60..0.61 rows=1 width=8) (actual time=0.012..0.012 rows=1 loops=8000)
          Buffers: shared hit=28000
          InitPlan 1 (returns $2)
            ->  Limit  (cost=0.28..0.60 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=8000)
                  Buffers: shared hit=28000
                  ->  Index Scan Backward using truth_data_epoch_idx on truth_data truth_data_2  (cost=0.28..212.35 rows=667 width=8) (actual time=0.007..0.007 rows=1 loops=8000)
                        Index Cond: ((epoch IS NOT NULL) AND (epoch <= test_data.epoch))
                        Filter: (id = truth_data_1.id)
                        Rows Removed by Filter: 1
                        Buffers: shared hit=28000
Planning time: 0.590 ms
Execution time: 136.088 ms

Your suggested solution yields:

Nested Loop Left Join  (cost=178.79..371076.05 rows=4000 width=40) (actual time=0.077..4886.702 rows=4000 loops=1)
  Buffers: shared hit=47454
  ->  Seq Scan on test_data test  (cost=0.00..48.00 rows=2000 width=28) (actual time=0.039..0.691 rows=2000 loops=1)
        Buffers: shared hit=28
  ->  Unique  (cost=178.79..185.45 rows=2 width=20) (actual time=2.169..2.438 rows=2 loops=2000)
        Buffers: shared hit=47426
        ->  Sort  (cost=178.79..182.12 rows=1333 width=20) (actual time=2.097..2.234 rows=2000 loops=2000)
              Sort Key: tru1.id, tru1.epoch DESC NULLS LAST
              Sort Method: quicksort  Memory: 346kB
              Buffers: shared hit=47426
              ->  Bitmap Heap Scan on truth_data tru1  (cost=30.61..109.60 rows=1333 width=20) (actual time=0.142..1.161 rows=2000 loops=2000)
                    Recheck Cond: (epoch <= test.epoch)
                    Heap Blocks: exact=33486
                    Buffers: shared hit=47426
                    ->  Bitmap Index Scan on truth_data_epoch_idx  (cost=0.00..30.28 rows=1333 width=0) (actual time=0.131..0.131 rows=2000 loops=2000)
                          Index Cond: (epoch <= test.epoch)
                          Buffers: shared hit=13940
Planning time: 0.310 ms
Execution time: 4887.492 ms

I noticed that it does not make use of the truth_data(id, epoch) index and instead uses just the truth_data(epoch) index. I'm guessing that's because there are so few id's that it doesn't need to index by id?

I think the the big reason between this performance difference is that variant 1 yields a merge join while the suggested solution yields a nested loop join.

EDIT 2:

I was able to answer at least my question about the disjunct tables. Comparing just the sub plans from the correlated sub queries of my variation 1 and 2:

Variation 1:

  SubPlan 2
    ->  Result  (cost=0.60..0.61 rows=1 width=8) (actual time=0.013..0.013 rows=1 loops=8000)
          Buffers: shared hit=28000
          InitPlan 1 (returns $2)
            ->  Limit  (cost=0.28..0.60 rows=1 width=8) (actual time=0.010..0.010 rows=1 loops=8000)
                  Buffers: shared hit=28000
                  ->  Index Scan Backward using truth_data_epoch_idx on truth_data truth_data_2  (cost=0.28..212.35 rows=667 width=8) (actual time=0.008..0.008 rows=1 loops=8000)
                        Index Cond: ((epoch IS NOT NULL) AND (epoch <= test_data.epoch))
                        Filter: (id = truth_data_1.id)
                        Rows Removed by Filter: 1
                        Buffers: shared hit=28000

Variation 2:

  SubPlan 2
    ->  Result  (cost=0.60..0.61 rows=1 width=8) (actual time=0.011..0.011 rows=1 loops=8000000)
          Buffers: shared hit=28000000
          InitPlan 1 (returns $2)
            ->  Limit  (cost=0.28..0.60 rows=1 width=8) (actual time=0.009..0.009 rows=1 loops=8000000)
                  Buffers: shared hit=28000000
                  ->  Index Scan Backward using truth_data_epoch_idx on truth_data truth_data_2  (cost=0.28..212.35 rows=667 width=8) (actual time=0.007..0.007 rows=1 loops=8000000)
                        Index Cond: ((epoch IS NOT NULL) AND (epoch <= test_data.epoch))
                        Filter: (id = truth_data.id)
                        Rows Removed by Filter: 1
                        Buffers: shared hit=28000000

It becomes evident that even though the ON condition references both tables, the correlated subquery only references the test data in variation 1, while it references both test and truth in variation 2. Thus variation 2's correlated sub query is dealing with a table approximately (size of truth_data) times bigger than variation 1's correlated sub query.

Best Answer

Why the difference in performance? The correlated subquery in the join condition of the 1st variant is already potentially expensive. The second variant makes it worse by depending on two disjunct tables - forcing a Carthesian Product between the two (for nothing as it turn out later). The query planner is no AI, nor does it try to be. That could quickly add more overhead cost for the plan than what the actual query costs. Postgres is not smart enough to make the logical deductions you mention. Many other possible factors, too, depending on undisclosed information.

Be that as it may, this equivalent query is less complex and should be (much) faster:

SELECT * 
FROM  (
   SELECT epoch           AS test_epoch
        , id              AS test_id
        , tids.truth_id
        , sqrt(x^2 + y^2) AS test_r
   FROM   test_data test
   CROSS  JOIN (SELECT DISTINCT id AS truth_id FROM truth_data) tids
   ) test
LEFT   JOIN LATERAL (
   SELECT tru1.epoch                AS truth_epoch
   --   , tru1.id                   AS truth_id   -- don't repeat same id
        , sqrt(tru1.x^2 + tru1.y^2) AS truth_r
   FROM   truth_data tru1
   WHERE  tru1.id     = test.truth_id
   AND    tru1.epoch <= test.test_epoch
   ORDER  BY tru1.epoch DESC NULLS LAST
   LIMIT  1
   ) tru ON true;

One minor difference: truth_id is not output a second time. It's noise and many clients can't deal with duplicate column names in the result.

If you don't need truth_id without data (as I suspect), you can simplify further:

SELECT test.epoch                AS test_epoch
     , test.id                   AS test_id
     , sqrt(test.x^2 + test.y^2) AS test_r
     , tru.*
FROM   test_data test
LEFT   JOIN LATERAL (
   SELECT DISTINCT ON (tru1.id)
          tru1.epoch                AS truth_epoch
        , tru1.id                   AS truth_id
        , sqrt(tru1.x^2 + tru1.y^2) AS truth_r
   FROM   truth_data tru1
   WHERE  tru.epoch <= test.test_epoch
   ORDER  BY tru1.id, tru1.epoch DESC NULLS LAST
   ) tru ON true;

Be sure to have a btree index on truth_data (id, epoch DESC NULLS LAST) for best performance.

NULLS LAST may be unnecessary, depending on undisclosed information. See:

More performance optimization may be possible, depending on undisclosed table definitions, cardinalities, value distribution / frequencies. See: