Postgresql – Why is PostgreSQL selecting a slow Seq scan when an index scan would be better

optimizationperformancepostgresqlpostgresql-9.6query-performance

I have the following tables in a Postgres 9.6 DB:

create table baseDimensions(
    id serial not null primary key,
    dimension1 date not null,
    dimension2 int not null,
    dimension3 text not null,
    -- ...
    dimension10 boolean not null,
    unique(dimension1, dimension2, ..., dimension10)
);

create table interestingData(
    baseDimensionId int not null references baseDimensions(id),
    subdimension1 int not null,
    subdimension2 boolean not null,
    -- ...
    value1 int not null,
    value2 bigint not null,
    -- ...
    primary key(baseDimensionId, subdimension1, subdimension2)
)

So in text: I have got lots of dimensions (e.g. if the tables would be for the sales of a retail store the dimensions could be date of sale, customerId, whether the customer is a repeated customer, the color of the jacket the customer was wearing etc. etc) and some values for those dimensions (e.g. to stay with the retail example: A value could be the amount the customer paid). There are multiple "interestingData" tables that all however share the ten base dimensions, so to save some disk space I extracted those base dimensions into a separate table.

I purposefully selected the order of columns in the unique index so that typical filter conditions are left-most. So most of the time I am going to filter data by dimension1.

Now I would like to know the average value1 on a specific day (dimension1 for each dimension3). This can be answered with the following query:

select dimension3, avg(value1)
from baseDimensions b
join interestingData c on b.id = c.baseDimensionId
where b.dimension1 = '2016-08-20'::date
group by dimension3

The explain+analyze for this query looks like this:

HashAggregate  (cost=1141685.72..1141686.14 rows=28 width=41) (actual time=55824.222..55827.068 rows=3358 loops=1)
  Group Key: b.dimension3
  ->  Hash Join  (cost=63915.68..1139740.91 rows=259308 width=41) (actual time=9956.393..55684.492 rows=267004 loops=1)
        Hash Cond: (c.basedimensionid = b.id)
        ->  Seq Scan on interestingData c  (cost=0.00..628619.84 rows=28705684 width=20) (actual time=0.007..31909.112 rows=28705684 loops=1)
        ->  Hash  (cost=62303.57..62303.57 rows=83369 width=29) (actual time=93.587..93.587 rows=81101 loops=1)
              Buckets: 131072  Batches: 2  Memory Usage: 3565kB
              ->  Index Scan using baseDimensions_dimensions_key on baseDimensions b  (cost=0.56..62303.57 rows=83369 width=29) (actual time=0.021..59.422 rows=81101 loops=1)
                    Index Cond: (dimension1 = '2016-08-20'::date)
Planning time: 0.232 ms
Execution time: 55827.909 ms

With 55 seconds this query is slow and my interpretation is that the reason is the sequence scan on table interestingData. But is the query planner right here? Would using an index be even slower? I wanted to know, so I tried to disable the sequence scan forcefully with set enable_seqscan=false. Here is the new explain+analyze:

HashAggregate  (cost=1790548.21..1790548.63 rows=28 width=41) (actual time=1023.655..1025.661 rows=3358 loops=1)
  Group Key: b.dimension3
  ->  Nested Loop  (cost=1.12..1788603.40 rows=259308 width=41) (actual time=0.034..848.152 rows=267004 loops=1)
        ->  Index Scan using baseDimensions_dimensions_key on baseDimensions b  (cost=0.56..62303.57 rows=83369 width=29) (actual time=0.019..76.750 rows=81101 loops=1)
              Index Cond: (dimension1 = '2016-08-20'::date)
        ->  Index Scan using interestingData_pkey on interestingData c  (cost=0.56..20.08 rows=63 width=20) (actual time=0.003..0.007 rows=3 loops=81101)
              Index Cond: (basedimensionid = b.id)
Planning time: 0.250 ms
Execution time: 1026.478 ms

And wow… The query is suddenly more than 50x as fast!

However I am not supposed to use set enable_seqscan=false in production. So why is the query planner performing so horribly and what can I do about it?

Stats

Table stats to get a better idea about the situation (the tables are going to grow a lot in the future, as new data arrives every day, but currently it is not planned to delete old data):

select
    count(*) total,
    count(distinct dimension1) distinctDays,
    min(dimension1) startDate,
    max(dimension1) endDate
from baseDimensions

  total    | distinctdays | startdate  |  enddate   
-----------+--------------+------------+------------
 9,229,054 |          171 | 2016-07-17 | 2017-01-19
select
    count(*) total
from interestingData ;

   total   
------------
 28,705,684

Variants

By request of ypercube (in the comments) a variant with index on (dimension1, id, dimension3):

HashAggregate  (cost=1141043.72..1141044.14 rows=28 width=41) (actual time=45213.910..45217.416 rows=3358 loops=1)
  Group Key: b.dimension3
  ->  Hash Join  (cost=63273.68..1139098.91 rows=259308 width=41) (actual time=6871.808..45082.855 rows=267004 loops=1)
        Hash Cond: (combined.basedimensionid = b.id)
        ->  Seq Scan on interestingData c  (cost=0.00..628619.84 rows=28705684 width=20) (actual time=0.007..22862.174 rows=28705684 loops=1)
        ->  Hash  (cost=61661.57..61661.57 rows=83369 width=29) (actual time=67.638..67.638 rows=81101 loops=1)
              Buckets: 131072  Batches: 2  Memory Usage: 3565kB
              ->  Index Only Scan using dim1_id_dim3_idx on baseDimensions b  (cost=0.56..61661.57 rows=83369 width=29) (actual time=0.060..36.704 rows=81101 loops=1)
                    Index Cond: (dimension1 = '2016-08-20'::date)
                    Heap Fetches: 81101
Planning time: 0.265 ms
Execution time: 45218.174 ms

I also tried with index (dimension1, dimension3, id), but the explain result is exactly the same.

Best Answer

It looks like all the data you need to resolve this query via the nested loop was already cached in RAM. PostgreSQL's planner would not be aware of that, it assumes it is going to have to go to disk to get much of the data.

If it is all cached, that could happen either fairly or unfairly.

Fairly could be that you have a lot of RAM and most of your data is in cache most of the time. In that case, lowering random_page_cost to be the same or nearly the same as seq_page_cost is the correct response. But be warned that if you ever restart the server, you could be in for a painful warm up period as the cache gets reloaded from disk.

Another "fairly in cache" would be that you actually run the exact same query with the exact same parameter (2016-08-20) very often in production, and so that particular set of data stays in memory even though most of your data does not. In this case, lowering random_page_cost could fix this particularly query, but make other ones worse. One solution would be to lower random_page_cost or set enable_seqscan=off just for this query and reset it afterwards, if your framework lets you do that.

"Unfairly in cache" would be that you keep testing this query with a parameter of 2016-08-20, when your real query will use a different parameter each time. This means your performance testing server gets to re-use the same data over and over without reading it from disk, while your production server would not be able to do so. In this case you need to improve your testing/benchmarking method to be more realistic.