Postgresql being slow on count distinct for dates

performancepostgresqlquery-performance

I have a very simple, but very big, table.
Its schema is like this

(yadda int, yadda1 int, yaddate date, ... other stuff).

Now, yaddate has an index by itself and it is also in other indexes together with other columns (eg. (yadda1, date)).

The table itself is some 100M rows.

When I run

  select distinct date from mybigtable;

the time needed to get the list is in the range of 200 seconds.
Explain Analyze tells me it's doing a seq scan and I don't understand why, since I the index is there.

First thing I am trying is reindex on the date only column index.

  1. Am I doing something wrong?
  2. Since obviously there's something I am missing about seq and index scan, can someone shed some light?
  3. How can I make that query faster?

TIA.

Best Answer

There is a trick with distinct to get it fast using index, that you can try. It involves creating a function looking like that:

CREATE OR REPLACE FUNCTION small_distinct(IN tablename character varying, IN fieldname character varying, IN sample anyelement DEFAULT '1800-01-01'::date)
  RETURNS SETOF anyelement AS
$BODY$
BEGIN
   EXECUTE 'SELECT '||fieldName||' FROM '||tableName||' ORDER BY '||fieldName
      ||' LIMIT 1'  INTO result;
   WHILE result IS NOT NULL LOOP
      RETURN NEXT;
      EXECUTE 'SELECT '||fieldName||' FROM '||tableName
         ||' WHERE '||fieldName||' > $1 ORDER BY ' || fieldName || ' LIMIT 1'
         INTO result USING result;
   END LOOP;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100
  ROWS 1000;

Then create an index on the column you want to count distinct, and select small_distinct('yourtable', 'yaddate'); should return you the distinct values you want, without the need to read the table.

Try it, be beware, I'm not sure it will work right out of the box, as I quickly adapted it from a varchar function.