I'm using Postgres 9.1
I have two tables I am joining:
wikidb=> \d page
Table "public.page"
Column | Type | Modifiers
-----------------------+---------------+------------------------------
page_id | bigint | not null
page_namespace | integer | not null default 0
page_title | text | not null default ''::text
[...]
Indexes:
[...]
"page_page_namespace_page_title_idx" UNIQUE, btree (page_namespace, page_title)
wikidb=> \d pagelinks
Table "public.pagelinks"
Column | Type | Modifiers
-------------------+---------+----------------------------
pl_from | bigint | not null default 0::bigint
pl_namespace | integer | not null default 0
pl_title | text | not null default ''::text
[...]
Indexes:
[...]
"pagelinks_pl_namespace_pl_title_pl_from_idx" btree (pl_namespace, pl_title, pl_from)
Notice how both have indices on the (namespace, title) columns. I am interested in finding out how many (pl_namespace,pl_title) pairs in the pagelinks table do not show up in the page table as (page_namespace, page_title).
If I use a join, then I get the following plan:
wikidb=> explain SELECT COUNT(*)
FROM pagelinks
LEFT OUTER JOIN page
ON page.page_namespace = pagelinks.pl_namespace AND
page.page_title = pagelinks.pl_title
WHERE page.page_namespace IS NULL AND page.page_title IS NULL;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Aggregate (cost=1310748.56..1310748.57 rows=1 width=0)
-> Merge Anti Join (cost=1189384.02..1310748.56 rows=1 width=0)
Merge Cond: ((pagelinks.pl_title = page.page_title) AND (pagelinks.pl_namespace = page.page_namespace))
-> Sort (cost=1144343.89..1164498.31 rows=8061768 width=19)
Sort Key: pagelinks.pl_title, pagelinks.pl_namespace
-> Seq Scan on pagelinks (cost=0.00..219551.68 rows=8061768 width=19)
-> Sort (cost=45038.32..45975.52 rows=374880 width=20)
Sort Key: page.page_title, page.page_namespace
-> Seq Scan on page (cost=0.00..10331.80 rows=374880 width=20)
(9 rows)
Which, as you can see, sorts each of both tables and merges them. I don't understand why it does this if the indices already contain both columns in sorted order.
Any explanations?
Best Answer
Most importantly, index-only scans are a major performance feature added to Postgres 9.2. Details in the Postgres Wiki.
Consider upgrading to a current version, Postgres 9.1 is getting old anyway.
In older versions, Postgres has to visit the table either way, so indexes must offer much more performance improvements to beat the added overhead. And the
UNIQUE
index tells us, there is at most a single row per(page_namespace, page_title)
, so your indexes are not going to help much - if at all, while you count the whole table.One minor improvement:
Testing just one column already proves everything there is to prove.
Otherwise your query is pretty much the optimum. Maybe
NOT EXISTS
can compete:A join or
NOT IN
on a row expression(pl_namespace,pl_title)
are typically slower.Basic techniques: