In a test with two schemas that conceptually contain the same data I got a bigger performance difference then expected.
In one scheme a surrogate key is used which means that in the join condition between the two tables a single join condition is used. This join is done using a hash join.
The other schema uses a natural key of 4 columns. In this case postgres uses a merge join which is much slower. However I had understood that postgresql can do multi column hash joins.
Is this speed difference normal? Can I get the speed of the 4 column join more on par?
Here are the two queries and the explain analyze:
Natural key
EXPLAIN ANALYZE
SELECT *
FROM kassabon kb
JOIN kassabon_regel kbr ON (
kb.filiaal_id=kbr.kassabon_filiaal_id AND
kb.kassanr=kbr.kassabon_kassanr AND
kb.jaar=kbr.kassabon_jaar AND
kb.bonnr=kbr.kassabon_bonnr);
"Merge Join (cost=289690.28..309517.94 rows=11981 width=280) (actual time=5507.702..9675.564 rows=907105 loops=1)"
" Merge Cond: ((kbr.kassabon_bonnr = kb.bonnr) AND (kbr.kassabon_filiaal_id = kb.filiaal_id) AND (kbr.kassabon_kassanr = kb.kassanr) AND (kbr.kassabon_jaar = kb.jaar))"
" -> Sort (cost=183246.72..185514.48 rows=907105 width=141) (actual time=4608.588..7210.120 rows=907105 loops=1)"
" Sort Key: kbr.kassabon_bonnr, kbr.kassabon_filiaal_id, kbr.kassabon_kassanr, kbr.kassabon_jaar"
" Sort Method: external merge Disk: 134664kB"
" -> Seq Scan on kassabon_regel kbr (cost=0.00..28374.05 rows=907105 width=141) (actual time=0.013..617.881 rows=907105 loops=1)"
" -> Materialize (cost=106442.89..109233.22 rows=558067 width=139) (actual time=899.101..1321.491 rows=907133 loops=1)"
" -> Sort (cost=106442.89..107838.05 rows=558067 width=139) (actual time=899.097..1182.711 rows=558067 loops=1)"
" Sort Key: kb.bonnr, kb.filiaal_id, kb.kassanr, kb.jaar"
" Sort Method: external merge Disk: 48808kB"
" -> Seq Scan on kassabon kb (cost=0.00..13117.67 rows=558067 width=139) (actual time=0.009..95.594 rows=558067 loops=1)"
"Total runtime: 9757.286 ms"
Surrogate key
EXPLAIN ANALYZE
SELECT *
FROM kassabon kb
JOIN kassabon_regel kbr ON (
kb.id=kbr.kassabon_id);
"Hash Join (cost=31538.51..121752.89 rows=907105 width=270) (actual time=226.936..1817.946 rows=907105 loops=1)"
" Hash Cond: (kbr.kassabon_id = kb.id)"
" -> Seq Scan on kassabon_regel kbr (cost=0.00..27461.05 rows=907105 width=131) (actual time=0.005..159.972 rows=907105 loops=1)"
" -> Hash (cost=13117.67..13117.67 rows=558067 width=139) (actual time=226.770..226.770 rows=558067 loops=1)"
" Buckets: 16384 Batches: 8 Memory Usage: 7652kB"
" -> Seq Scan on kassabon kb (cost=0.00..13117.67 rows=558067 width=139) (actual time=0.004..70.969 rows=558067 loops=1)"
"Total runtime: 1866.136 ms"
Best Answer
First, your estimates are bad. I would expect estimates to be harder to calculate on multi-column joins because, as in here, the values may be co-existing in ways they may not. Additionally the rows on joining may be wider and so you might have work_mem issues too.
Another important area to check is caching. Try:
The two issues I could see would have to do with planning costs and multiple columns, particularly the question of coexistence of various values across the join, and the memory usage. Multi-column joins complicate both these areas.