Postgresql – Optimizing queries on a range of timestamps (one column)

indexoptimizationperformancepostgresqlpostgresql-9.3postgresql-performance

I am using Postgres 9.3 through Heroku.

I have a table, "traffic", with 1M+ records that has many inserts and updates every day. I need to perform SUM operations across this table over different time ranges and those calls can take up to 40 seconds and would love to hear suggestions on how to improve that.

I have the following index in place on this table:

CREATE INDEX idx_traffic_partner_only ON traffic (dt_created) WHERE campaign_id IS NULL AND uuid_self <> uuid_partner;

Here is an example SELECT statement:

SELECT SUM("clicks") AS clicks, SUM("impressions") AS impressions
FROM "traffic"
WHERE "uuid_self" != "uuid_partner"
AND "campaign_id" is NULL
AND "dt_created" >= 'Sun, 29 Mar 2015 00:00:00 +0000'
AND "dt_created" <= 'Mon, 27 Apr 2015 23:59:59 +0000' 

And this is the EXPLAIN ANALYZE:

Aggregate  (cost=21625.91..21625.92 rows=1 width=16) (actual time=41804.754..41804.754 rows=1 loops=1)
  ->  Index Scan using idx_traffic_partner_only on traffic  (cost=0.09..20085.11 rows=308159 width=16) (actual time=1.409..41617.976 rows=302392 loops=1)
      Index Cond: ((dt_created >= '2015-03-29'::date) AND (dt_created <= '2015-04-27'::date))
Total runtime: 41804.893 ms

http://explain.depesz.com/s/gGA

This question is very similar to another on SE, but that one used an index across two column timestamp ranges and the index planner for that query had estimates that were way off. The main suggestion there was to create a sorted multi-column index, but for single column indexes that doesn't have much of an effect. The other suggestions were to use CLUSTER / pg_repack and GIST indexes, but I haven't tried them yet, since I'd like to see if there is a better solution using regular indexes.

Optimizing queries on a range of timestamps (two columns)

For reference, I tried the following indexes, which were not used by the DB:

INDEX idx_traffic_2 ON traffic (campaign_id, uuid_self, uuid_partner, dt_created);
INDEX idx_traffic_3 ON traffic (dt_created);
INDEX idx_traffic_4 ON traffic (uuid_self);
INDEX idx_traffic_5 ON traffic (uuid_partner);

EDIT: Ran EXPLAIN (ANALYZE, VERBOSE, COSTS, BUFFERS) and these were results:

Aggregate  (cost=20538.62..20538.62 rows=1 width=8) (actual time=526.778..526.778 rows=1 loops=1)
  Output: sum(clicks), sum(impressions)
  Buffers: shared hit=47783 read=29803 dirtied=4
  I/O Timings: read=184.936
  ->  Index Scan using idx_traffic_partner_only on public.traffic  (cost=0.09..20224.74 rows=313881 width=8) (actual time=0.049..431.501 rows=302405 loops=1)
      Output: id, uuid_self, uuid_partner, impressions, clicks, dt_created... (other fields redacted)
      Index Cond: ((traffic.dt_created >= '2015-03-29'::date) AND (traffic.dt_created <= '2015-04-27'::date))
      Buffers: shared hit=47783 read=29803 dirtied=4
      I/O Timings: read=184.936
Total runtime: 526.881 ms

http://explain.depesz.com/s/7Gu6

Table definition:

CREATE TABLE traffic (
    id              serial,
    uuid_self       uuid not null,
    uuid_partner    uuid not null,
    impressions     integer NOT NULL DEFAULT 1,
    clicks          integer NOT NULL DEFAULT 0,
    campaign_id     integer,
    dt_created      DATE DEFAULT CURRENT_DATE NOT NULL,
    dt_updated      TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
)

id is the primary key and uuid_self, uuid_partner, and campaign_id are all foreign keys. The dt_updated field is updated with a postgres function.

Best Answer

Two things that are very odd here:

  1. The query select 300k rows from a table with 1M+ rows. For 30 % (or anything over 5 % - depends on row size and other factors) it doesn't typically pay to use an index at all. We should see a sequential scan.

    The exception would be index-only scans, which I don't see here. The multicolumn index @Craig suggested would be the best option if you get index-only scans out of it. With lots of updates like you mentioned, this may not work out, in which case you are better off without the additional columns - and just the index you already have. You might be able to make it work for you with more aggressive autovacuum settings for the table. You can adjust parameters for individual tables.

  2. While Postgres is going to use the index I would certainly expect to see a bitmap index scan for that many rows, not a plain index scan, which is typically the better choice for a low percentage of rows. As soon as Postgres expects multiple hits per data page (judging from its statistics on the table) it will typically switch to a bitmap index scan.

Judging from that I would suspect that your cost settings are inadequate (and possibly the table statistics, too). You may have set random_page_cost and / or cpu_index_tuple_cost too low, relative to seq_page_cost. Follow the links and read the manual.

Would also fit with the observation that cold cache is large factor, as we worked out in comments. Either you are accessing (parts of) tables that nobody has touched in a long time or you are running on a test system where the cache isn't populated (yet)?
Else you just don't have enough RAM available to cache most of the relevant data in your DB. Consequently, random access is much more expensive than sequential access when data resides in cache. Depending on the actual situation you may have to adjust to get better query plans.

One other factor has to be mentioned for slow response on the first read only: Hint bits. Read details in the Postgres Wiki and this related question:

Or the table is extremely bloated, in which case an index scan would make sense and I would refer back to CLUSTER / pg_repack in my previous answer that you quoted. (Or just VACUUM FULL) And investigate your VACUUM settings. Those are important with many inserts and updates every day.

Depending on UPDATE patterns also consider a FILLFACTOR below 100. If you mostly update only newly added rows, set the lower FILLFACTER after compacting your table, so that only new pages keep some wiggle room for updates.

Schema

campaign_id is 99%+ NULL and dt_updated is 0% NULL.

Adjust the sequence of columns slightly, to save 8 bytes per row (in the 99% of cases where campaign_id is NULL):

CREATE TABLE traffic (
    uuid_self       uuid not null REFERENCES ... ,
    uuid_partner    uuid not null REFERENCES ... ,
    id              serial PRIMARY KEY,
    impressions     integer NOT NULL DEFAULT 1,
    clicks          integer NOT NULL DEFAULT 0,
    campaign_id     integer,
    dt_created      DATE DEFAULT CURRENT_DATE NOT NULL,
    dt_updated      TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP,
);

Detailed explanation and links to more:

To measure: