PostgreSQL – Why Query Plan Sorts Table Despite Indexed Columns

execution-planoptimizationperformancepostgresqlpostgresql-9.1query-performance

I'm using Postgres 9.1
I have two tables I am joining:

wikidb=> \d page
                         Table "public.page"
        Column         |     Type      |          Modifiers           
-----------------------+---------------+------------------------------
 page_id               | bigint        | not null
 page_namespace        | integer       | not null default 0
 page_title            | text          | not null default ''::text
 [...]
Indexes:
    [...]
    "page_page_namespace_page_title_idx" UNIQUE, btree (page_namespace, page_title)

wikidb=> \d pagelinks
                 Table "public.pagelinks"
      Column       |  Type   |         Modifiers          
-------------------+---------+----------------------------
 pl_from           | bigint  | not null default 0::bigint
 pl_namespace      | integer | not null default 0
 pl_title          | text    | not null default ''::text
 [...]
Indexes:
    [...]
    "pagelinks_pl_namespace_pl_title_pl_from_idx" btree (pl_namespace, pl_title, pl_from)

Notice how both have indices on the (namespace, title) columns. I am interested in finding out how many (pl_namespace,pl_title) pairs in the pagelinks table do not show up in the page table as (page_namespace, page_title).

If I use a join, then I get the following plan:

wikidb=> explain SELECT COUNT(*)
FROM pagelinks
LEFT OUTER JOIN page
  ON page.page_namespace = pagelinks.pl_namespace AND
     page.page_title = pagelinks.pl_title
WHERE page.page_namespace IS NULL AND page.page_title IS NULL;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1310748.56..1310748.57 rows=1 width=0)
   ->  Merge Anti Join  (cost=1189384.02..1310748.56 rows=1 width=0)
         Merge Cond: ((pagelinks.pl_title = page.page_title) AND (pagelinks.pl_namespace = page.page_namespace))
         ->  Sort  (cost=1144343.89..1164498.31 rows=8061768 width=19)
               Sort Key: pagelinks.pl_title, pagelinks.pl_namespace
               ->  Seq Scan on pagelinks  (cost=0.00..219551.68 rows=8061768 width=19)
         ->  Sort  (cost=45038.32..45975.52 rows=374880 width=20)
               Sort Key: page.page_title, page.page_namespace
               ->  Seq Scan on page  (cost=0.00..10331.80 rows=374880 width=20)
(9 rows)

Which, as you can see, sorts each of both tables and merges them. I don't understand why it does this if the indices already contain both columns in sorted order.

Any explanations?

Best Answer

I don't understand why it does this if the indices already contain both columns in sorted order.

I'm using Postgres 9.1.

Most importantly, index-only scans are a major performance feature added to Postgres 9.2. Details in the Postgres Wiki.

Consider upgrading to a current version, Postgres 9.1 is getting old anyway.

In older versions, Postgres has to visit the table either way, so indexes must offer much more performance improvements to beat the added overhead. And the UNIQUE index tells us, there is at most a single row per (page_namespace, page_title), so your indexes are not going to help much - if at all, while you count the whole table.

One minor improvement:

SELECT COUNT(*) AS ct
FROM   pagelinks l
LEFT   JOIN page p ON p.page_namespace = l.pl_namespace
                  AND p.page_title = l.pl_title
WHERE  p.page_namespace IS NULL;
AND    p.page_title IS NULL;

Testing just one column already proves everything there is to prove.

Otherwise your query is pretty much the optimum. Maybe NOT EXISTS can compete:

SELECT count(*) AS ct
FROM   pagelinks l
WHERE  NOT EXISTS (
   SELECT 1
   FROM   page 
   WHERE  page_namespace = l.pl_namespace
   AND    page_title = l.pl_title
   );

A join or NOT IN on a row expression (pl_namespace,pl_title) are typically slower.

Basic techniques: