Postgresql being slow on count distinct for dates

performancepostgresqlquery-performance

I have a very simple, but very big, table.
Its schema is like this

(yadda int, yadda1 int, yaddate date, ... other stuff).

Now, yaddate has an index by itself and it is also in other indexes together with other columns (eg. (yadda1, date)).

The table itself is some 100M rows.

When I run

  select distinct date from mybigtable;

the time needed to get the list is in the range of 200 seconds.
Explain Analyze tells me it's doing a seq scan and I don't understand why, since I the index is there.

First thing I am trying is reindex on the date only column index.

Am I doing something wrong?
Since obviously there's something I am missing about seq and index scan, can someone shed some light?
How can I make that query faster?

TIA.

Best Answer

There is a trick with distinct to get it fast using index, that you can try. It involves creating a function looking like that:

CREATE OR REPLACE FUNCTION small_distinct(IN tablename character varying, IN fieldname character varying, IN sample anyelement DEFAULT '1800-01-01'::date)
  RETURNS SETOF anyelement AS
$BODY$
BEGIN
   EXECUTE 'SELECT '||fieldName||' FROM '||tableName||' ORDER BY '||fieldName
      ||' LIMIT 1'  INTO result;
   WHILE result IS NOT NULL LOOP
      RETURN NEXT;
      EXECUTE 'SELECT '||fieldName||' FROM '||tableName
         ||' WHERE '||fieldName||' > $1 ORDER BY ' || fieldName || ' LIMIT 1'
         INTO result USING result;
   END LOOP;
END;
$BODY$
  LANGUAGE plpgsql VOLATILE
  COST 100
  ROWS 1000;

Then create an index on the column you want to count distinct, and select small_distinct('yourtable', 'yaddate'); should return you the distinct values you want, without the need to read the table.

Try it, be beware, I'm not sure it will work right out of the box, as I quickly adapted it from a varchar function.

Related Solutions

PostgreSQL Sequential Scan instead of Index Scan Why

Notice this line:

->  Index Scan using data_area_pkey on data_area  (cost=0.00..52.13 rows=1 width=8) 
    (actual time=0.006..0.008 rows=0 loops=335130)

If you compute the total cost, considering loops, it is 52.3 * 335130 = 17527299. This is larger than 14857017.62 for the seq_scan alternative. That is why it does not use the index.

So the optimizer is overestimating the cost of the index scan. I'd guess that your data is sorted on the index (either due to a clustered index or to how it was loaded) and/or you have plenty of cache memory and/or a nice fast disk. Hence there is little random I/O going on.

You should also check the correlation in pg_stats, that is used by the optimizer to assess clustering when computing the index cost, and finally try changing random_page_cost and cpu_index_tuple_cost, to match your system.

Postgresql – Postgres not using the index even when rows returned is 5% of the table

From your query plans, it looks like you're comparing ints to ints in the first query plan, and int to numeric in the second plan.

Your first compare:

Index Cond: (("timestamp" >= 1431100800) AND ("timestamp" <= 1431108000))

and

timestamp >= 1431100800 and timestamp <= 1431108000

In the second query, it's numeric values:

Filter: ((numvalues[1] IS NOT NULL) AND (("timestamp")::numeric >= 1431100800.00) AND (("timestamp")::numeric <= 1431108000.00))

and

timestamp >= 1431093600.00 and timestamp <= 1431100800.00

Casting to numeric causes the index to be ignored in favor of a sequential scan.

You can see this with a very simple example, set up below:

CREATE TABLE t2 (a int);
CREATE INDEX t2_a_idx ON t2(a);
INSERT INTO t2 (a) SELECT i FROM generate_series(1,1000000) AS i;
VACUUM ANALYZE VERBOSE t2;

My first query plan looks like this:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000;

Index Only Scan using t2_a_idx on t2 (cost=0.42..7134.65 rows=250413 width=4) 
(actual time=0.019..29.926 rows=250000 loops=1)
Index Cond: (a > 750000)
Heap Fetches: 0
Planning time: 0.137 ms
Execution time: 39.114 ms
(5 rows)
Time: 39.540 ms

While a second query using numerics looks like this:

EXPLAIN ANALYZE SELECT * FROM t2 WHERE a > 750000.00;

Seq Scan on t2  (cost=0.00..19425.00 rows=333333 width=4) (actual time=122.803..175.326 rows=250000 loops=1)
Filter: ((a)::numeric > 750000.00)
Rows Removed by Filter: 750000
Planning time: 0.058 ms
Execution time: 184.194 ms
(5 rows)
Time: 184.487 ms

In the second instance here, the index is ignored in favor of a sequential scan because of the cast to a numeric value, which looks like exactly what's happening in your two examples.

One last aside, you might be able to speed your query up via a SET query before executing it as well:

SET work_mem = 2GB;

If your server can handle it, because your sorts are spilling to disk, as noted in this line from your query plan:

Sort Method: external merge  Disk: 1387704kB

Hope this helps. =)

Best Answer

Related Solutions

PostgreSQL Sequential Scan instead of Index Scan Why

Postgresql – Postgres not using the index even when rows returned is 5% of the table

Related Question