PostgreSQL – Optimize View for Averaging Timestamps into Hours

index-tuningperformancepostgresqlpostgresql-9.4postgresql-performance

I have this table:

CREATE TABLE spp.rtprices (
  "interval" timestamp without time zone NOT NULL,
  rtlmp numeric(12,6),
  rtmcc numeric(12,6),
  rtmcl numeric(12,6),
  node_id integer NOT NULL,
  CONSTRAINT rtprices_pkey PRIMARY KEY ("interval", node_id),
  CONSTRAINT rtprices_node_id_fkey FOREIGN KEY (node_id)
      REFERENCES spp.nodes (node_id) MATCH SIMPLE
      ON UPDATE RESTRICT ON DELETE RESTRICT
)

And one more relevant index:

CREATE INDEX rtprices_node_id_interval_idx ON spp.rtprices (node_id, "interval");

Against it I've made this view:

  CREATE OR REPLACE VIEW spp.rtprices_hourly AS 
  SELECT (rtprices."interval" - '00:05:00'::interval)::date::timestamp without time zone AS pricedate,
  date_part('hour'::text, date_trunc('hour'::text, rtprices."interval" - '00:05:00'::interval))::integer + 1 AS hour,
  rtprices.node_id,
  round(avg(rtprices.rtlmp), 2) AS rtlmp,
  round(avg(rtprices.rtmcc), 2) AS rtmcc,
  round(avg(rtprices.rtmcl), 2) AS rtmcl
  FROM spp.rtprices
  GROUP BY date_part('hour'::text, date_trunc('hour'::text, rtprices."interval" - '00:05:00'::interval))::integer + 1,
           rtprices.node_id,
           (rtprices."interval" - '00:05:00'::interval)::date::timestamp without time zone;

The point of which is to give averages of the numeric columns for each hour (The timestamps have data every 5 minutes). The problem is that a query for a single day for a single node_id takes upwards of 30 seconds for 24 records.

explain analyze select * from spp.rtprices_hourly
where node_id=20 and pricedate='2015-02-02'

Returns this:

   "HashAggregate  (cost=1128767.71..1128773.79 rows=135 width=28) (actual time=31155.023..31155.065 rows=24 loops=1)"
"  Group Key: ((date_part('hour'::text, date_trunc('hour'::text, (rtprices."interval" - '00:05:00'::interval))))::integer + 1), rtprices.node_id, (((rtprices."interval" - '00:05:00'::interval))::date)::timestamp without time zone"
"  ->  Bitmap Heap Scan on rtprices  (cost=10629.42..1128732.91 rows=2320 width=28) (actual time=25071.410..31153.715 rows=288 loops=1)"
"        Recheck Cond: (node_id = 20)"
"        Rows Removed by Index Recheck: 7142233"
"        Filter: (((("interval" - '00:05:00'::interval))::date)::timestamp without time zone = '2015-02-02 00:00:00'::timestamp without time zone)"
"        Rows Removed by Filter: 124909"
"        Heap Blocks: exact=43076 lossy=82085"
"        ->  Bitmap Index Scan on rtprices_node_id_interval_idx  (cost=0.00..10628.84 rows=464036 width=0) (actual time=68.999..68.999 rows=125197 loops=1)"
"              Index Cond: (node_id = 20)"
"Planning time: 5.243 ms"
"Execution time: 31155.392 ms"

Best Answer

Simpler view

For this objective:

The point of which is to give averages of the numeric columns for each hour

.. it seems just as good to truncate to full hours, which is simpler and cheaper:

CREATE OR REPLACE VIEW spp.rtprices_hourly AS 
SELECT date_trunc('hour', "interval") AS hour
     , node_id
     , round(avg(rtlmp), 2) AS rtlmp
     , round(avg(rtmcc), 2) AS rtmcc
     , round(avg(rtmcl), 2) AS rtmcl
FROM   spp.rtprices
GROUP  BY 1, 2;

Faster query

Either way, an equivalent query on the view with sargable predicates would be:

SELECT *
FROM   spp.rtprices_hourly
WHERE  node_id = 20
AND    hour >= '2015-02-02 0:0'::timestamp
AND    hour <  '2015-02-03 0:0'::timestamp;

This is faster, but still not as fast as it could be. The major performance hit is because the index can only be used with an index condition on node_id, which is preserved in its original state in the view. That's why your index rtprices_node_id_interval_idx with node_id first is important. Why?

The second predicate on hour has to be filtered after the tuple has been fetched from the heap (the row has been read from the table). The large majority of rows is discarded late into the process, much work for nothing.

Much faster with direct query

It would be much faster to run the original query and apply predicates before you aggregate:

SELECT date_trunc('hour', "interval") AS hour
     , node_id
     , round(avg(rtlmp), 2) AS rtlmp
     , round(avg(rtmcc), 2) AS rtmcc
     , round(avg(rtmcl), 2) AS rtmcl
FROM   spp.rtprices
WHERE  node_id = 20
AND    "interval" >= '2015-02-02 0:0'::timestamp
AND    "interval" <  '2015-02-03 0:0'::timestamp
GROUP  BY 1, 2;

You'll see index conditions for all predicates now. The more efficient index is still the one with node_id first. Why?

Fast & short: create a function

So, this won't work nicely with a view. Use a function instead:

CREATE OR REPLACE FUNCTION rtprices_hourly(_node_id int
                                         , _from timestamp
                                         , _to timestamp = NULL)
  RETURNS TABLE (
    hour    timestamp
  , node_id int
  , rtlmp   numeric
  , rtmcc   numeric
  , rtmcl   numeric) AS
$func$
SELECT date_trunc('hour', r."interval")  -- AS hour
     , r.node_id
     , round(avg(r.rtlmp), 2)  -- AS rtlmp
     , round(avg(r.rtmcc), 2)  -- AS rtmcc
     , round(avg(r.rtmcl), 2)  -- AS rtmcl
FROM   spp.rtprices r
WHERE  r.node_id     = _node_id
AND    r."interval" >= _from
AND    r."interval" <  COALESCE(_to, _from + interval '1 day')
GROUP  BY 1, 2
$func$  LANGUAGE sql STABLE;
  • Be wary of naming conflicts between OUT parameters and column names. That's why I table-qualified all columns here.

Now you get top performance with a simple query:

SELECT * FROM rtprices_hourly(1, '2015-2-2 0:0'::timestamp, '2015-2-3 0:0'::timestamp);

I put in a convenience feature, the second parameter defaults to "one day later" if you omit it:

SELECT * FROM rtprices_hourly(1, '2015-2-2 0:0'::timestamp);

More about function parameters and default values:

You can query any range:

SELECT * FROM rtprices_hourly(1, '2015-2-2 10:0'::timestamp, '2015-2-2 20:0'::timestamp);