Postgresql – Single column join vs multicolumn join

postgresqlpostgresql-9.1

In a test with two schemas that conceptually contain the same data I got a bigger performance difference then expected.

In one scheme a surrogate key is used which means that in the join condition between the two tables a single join condition is used. This join is done using a hash join.

The other schema uses a natural key of 4 columns. In this case postgres uses a merge join which is much slower. However I had understood that postgresql can do multi column hash joins.

Is this speed difference normal? Can I get the speed of the 4 column join more on par?

Here are the two queries and the explain analyze:

Natural key

EXPLAIN ANALYZE
SELECT *
FROM kassabon kb
    JOIN kassabon_regel kbr ON (
        kb.filiaal_id=kbr.kassabon_filiaal_id AND
        kb.kassanr=kbr.kassabon_kassanr AND
        kb.jaar=kbr.kassabon_jaar AND
        kb.bonnr=kbr.kassabon_bonnr);

"Merge Join  (cost=289690.28..309517.94 rows=11981 width=280) (actual time=5507.702..9675.564 rows=907105 loops=1)"
"  Merge Cond: ((kbr.kassabon_bonnr = kb.bonnr) AND (kbr.kassabon_filiaal_id = kb.filiaal_id) AND (kbr.kassabon_kassanr = kb.kassanr) AND (kbr.kassabon_jaar = kb.jaar))"
"  ->  Sort  (cost=183246.72..185514.48 rows=907105 width=141) (actual time=4608.588..7210.120 rows=907105 loops=1)"
"        Sort Key: kbr.kassabon_bonnr, kbr.kassabon_filiaal_id, kbr.kassabon_kassanr, kbr.kassabon_jaar"
"        Sort Method: external merge  Disk: 134664kB"
"        ->  Seq Scan on kassabon_regel kbr  (cost=0.00..28374.05 rows=907105 width=141) (actual time=0.013..617.881 rows=907105 loops=1)"
"  ->  Materialize  (cost=106442.89..109233.22 rows=558067 width=139) (actual time=899.101..1321.491 rows=907133 loops=1)"
"        ->  Sort  (cost=106442.89..107838.05 rows=558067 width=139) (actual time=899.097..1182.711 rows=558067 loops=1)"
"              Sort Key: kb.bonnr, kb.filiaal_id, kb.kassanr, kb.jaar"
"              Sort Method: external merge  Disk: 48808kB"
"              ->  Seq Scan on kassabon kb  (cost=0.00..13117.67 rows=558067 width=139) (actual time=0.009..95.594 rows=558067 loops=1)"
"Total runtime: 9757.286 ms"

Surrogate key

EXPLAIN ANALYZE
SELECT *
FROM kassabon kb
    JOIN kassabon_regel kbr ON (
        kb.id=kbr.kassabon_id);

"Hash Join  (cost=31538.51..121752.89 rows=907105 width=270) (actual time=226.936..1817.946 rows=907105 loops=1)"
"  Hash Cond: (kbr.kassabon_id = kb.id)"
"  ->  Seq Scan on kassabon_regel kbr  (cost=0.00..27461.05 rows=907105 width=131) (actual time=0.005..159.972 rows=907105 loops=1)"
"  ->  Hash  (cost=13117.67..13117.67 rows=558067 width=139) (actual time=226.770..226.770 rows=558067 loops=1)"
"        Buckets: 16384  Batches: 8  Memory Usage: 7652kB"
"        ->  Seq Scan on kassabon kb  (cost=0.00..13117.67 rows=558067 width=139) (actual time=0.004..70.969 rows=558067 loops=1)"
"Total runtime: 1866.136 ms"

Best Answer

First, your estimates are bad. I would expect estimates to be harder to calculate on multi-column joins because, as in here, the values may be co-existing in ways they may not. Additionally the rows on joining may be wider and so you might have work_mem issues too.

Another important area to check is caching. Try:

explain (analyse true, buffers true) [query]

The two issues I could see would have to do with planning costs and multiple columns, particularly the question of coexistence of various values across the join, and the memory usage. Multi-column joins complicate both these areas.