Postgresql – Postgres Index a query with MAX and groupBy

aggregateindexindex-tuningpostgresqlpostgresql-9.4

Is there any way to index the following query?

SELECT run_id, MAX ( frame ) , MAX ( time ) FROM run.frames_stat GROUP BY run_id;

I've tried creating sorted (non-composite) indexes on frame and time, and an index on run_id, but the query planner doesn't use them.

Misc info:

  • Unfortunately (and for reasons I won't get into) I cannot change the query
  • The frames_stat table has 42 million rows
  • The table is unchanging (no further inserts/deletes will ever take place)
  • The query was always slow, it's just gotten slower because this dataset is larger than in the past.
  • There are no indexes on the table
  • We are using Postgres 9.4
  • The db's "work_mem" size is 128MB (if that's relevant).
  • Hardware: 130GB Ram, 10 core Xeon

Schema:

CREATE TABLE run.frame_stat (
  id bigint NOT NULL,
  run_id bigint NOT NULL,
  frame bigint NOT NULL,
  heap_size bigint NOT NULL,
  "time" timestamp without time zone NOT NULL,
  CONSTRAINT frame_stat_pkey PRIMARY KEY (id)
)

Explain analyze:

HashAggregate  (cost=1086240.000..1086242.800 rows=280 width=24) (actual time=14182.426..14182.545 rows=280 loops=1)
  Group Key: run_id
  ->  Seq Scan on zulu  (cost=0.000..770880.000 rows=42048000 width=24) (actual time=0.037..4077.182 rows=42048000 loops=1)

Best Answer

Too bad

If you cannot change the query at all, that's too bad. You won't get a good solution. If you had not table-qualified the table (run.frames_stat), you could create a materialized view (see below) with the same name in another schema (or just a temporary one) and adapt the search_path (optionally just in sessions where this is desirable) - for hugely superior performance.

Here's a recipe for such a technique:

@Joishi's idea with a RULE would be a measure of (desperate) last resort. But I would rather not go there. Too many pitfalls with unexpected behavior.

Better query / indexes

If you could change the query, you should try to emulate a loose index scan:

This is even more efficient when based on a separate table with one row per relevant run_id - let's call it run_tbl. Create it if you don't have it, yet!
Implemented with correlated subqueries:

SELECT run_id
    , (SELECT frame
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY frame DESC NULLS LAST
       LIMIT  1) AS max_frame
    , (SELECT "time"
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY "time" DESC NULLS LAST
       LIMIT  1) AS max_time
FROM   run_tbl r;

Create two multicolumn indexes with matching sort order for lightening performance:

CREATE index fun_frame_idx ON run.frames_stat (run_id, frame DESC NULLS LAST);
CREATE index fun_frame_idx ON run.frames_stat (run_id, "time" DESC NULLS LAST);

NULLS LAST is only necessary if there can be null values. But it won't hurt either way.

With only 280 distinct run_id, this will be very fast.

MATERIALIZED VIEW

Or, based on these key pieces of information:

The "frames_stat" table has 42 million rows

rows=280 -- number of returned rows = disctinct run_id

The table is unchanging (no inserts/deletes)

Use a MATERIALIZED VIEW, it will be tiny (only 280 rows) and super fast.
You still need to change the query to base it on the MV instead of the table.

Aside: never use reserved words like time (in standard SQL) as identifier.