Postgresql – Postgres is performing sequential scan instead of index scan

indexperformancepostgresqlpostgresql-9.4query-performance

I have a table with about 10 million rows in it and an index on a date field. When I try and extract the unique values of the indexed field Postgres runs a sequential scan even though the result set has only 26 items. Why is the optimiser picking this plan? And what can I do avoid it?

From other answers I suspect this is as much related to the query as to the index.

explain select "labelDate" from pages group by "labelDate";
                              QUERY PLAN
-----------------------------------------------------------------------
 HashAggregate  (cost=524616.78..524617.04 rows=26 width=4)
   Group Key: "labelDate"
   ->  Seq Scan on pages  (cost=0.00..499082.42 rows=10213742 width=4)
(3 rows)

Table structure:

http=# \d pages
                                       Table "public.pages"
     Column      |          Type          |        Modifiers
-----------------+------------------------+----------------------------------
 pageid          | integer                | not null default nextval('...
 createDate      | integer                | not null
 archive         | character varying(16)  | not null
 label           | character varying(32)  | not null
 wptid           | character varying(64)  | not null
 wptrun          | integer                | not null
 url             | text                   |
 urlShort        | character varying(255) |
 startedDateTime | integer                |
 renderStart     | integer                |
 onContentLoaded | integer                |
 onLoad          | integer                |
 PageSpeed       | integer                |
 rank            | integer                |
 reqTotal        | integer                | not null
 reqHTML         | integer                | not null
 reqJS           | integer                | not null
 reqCSS          | integer                | not null
 reqImg          | integer                | not null
 reqFlash        | integer                | not null
 reqJSON         | integer                | not null
 reqOther        | integer                | not null
 bytesTotal      | integer                | not null
 bytesHTML       | integer                | not null
 bytesJS         | integer                | not null
 bytesCSS        | integer                | not null
 bytesHTML       | integer                | not null
 bytesJS         | integer                | not null
 bytesCSS        | integer                | not null
 bytesImg        | integer                | not null
 bytesFlash      | integer                | not null
 bytesJSON       | integer                | not null
 bytesOther      | integer                | not null
 numDomains      | integer                | not null
 labelDate       | date                   |
 TTFB            | integer                |
 reqGIF          | smallint               | not null
 reqJPG          | smallint               | not null
 reqPNG          | smallint               | not null
 reqFont         | smallint               | not null
 bytesGIF        | integer                | not null
 bytesJPG        | integer                | not null
 bytesPNG        | integer                | not null
 bytesFont       | integer                | not null
 maxageMore      | smallint               | not null
 maxage365       | smallint               | not null
 maxage30        | smallint               | not null
 maxage1         | smallint               | not null
 maxage0         | smallint               | not null
 maxageNull      | smallint               | not null
 numDomElements  | integer                | not null
 numCompressed   | smallint               | not null
 numHTTPS        | smallint               | not null
 numGlibs        | smallint               | not null
 numErrors       | smallint               | not null
 numRedirects    | smallint               | not null
 maxDomainReqs   | smallint               | not null
 bytesHTMLDoc    | integer                | not null
 maxage365       | smallint               | not null
 maxage30        | smallint               | not null
 maxage1         | smallint               | not null
 maxage0         | smallint               | not null
 maxageNull      | smallint               | not null
 numDomElements  | integer                | not null
 numCompressed   | smallint               | not null
 numHTTPS        | smallint               | not null
 numGlibs        | smallint               | not null
 numErrors       | smallint               | not null
 numRedirects    | smallint               | not null
 maxDomainReqs   | smallint               | not null
 bytesHTMLDoc    | integer                | not null
 fullyLoaded     | integer                |
 cdn             | character varying(64)  |
 SpeedIndex      | integer                |
 visualComplete  | integer                |
 gzipTotal       | integer                | not null
 gzipSavings     | integer                | not null
 siteid          | numeric                |
Indexes:
    "pages_pkey" PRIMARY KEY, btree (pageid)
    "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate")
    "idx_pages_cdn" btree (cdn)
    "idx_pages_labeldate" btree ("labelDate") CLUSTER
    "idx_pages_urlshort" btree ("urlShort")
Triggers:
    pages_label_date BEFORE INSERT OR UPDATE ON pages
      FOR EACH ROW EXECUTE PROCEDURE fix_label_date()

Best Answer

This is a known issue regarding Postgres optimization. If the distinct values are few - like in your case - and you are in 8.4+ version, a very fast workaround using a recursive query is described here: Loose Indexscan.

Your query could be rewritten (the LATERAL needs 9.3+ version):

WITH RECURSIVE pa AS 
( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) 
  UNION ALL
    SELECT n.labelDate 
    FROM pa AS p
         , LATERAL 
              ( SELECT labelDate 
                FROM pages 
                WHERE labelDate > p.labelDate 
                ORDER BY labelDate 
                LIMIT 1
              ) AS n
) 
SELECT labelDate 
FROM pa ;

Erwin Brandstetter has a thorough explanation and several variations of the query in this answer (on a related but different issue): Optimize GROUP BY query to retrieve latest record per user