Postgresql – Postgres Index a query with MAX and groupBy

aggregateindexindex-tuningpostgresqlpostgresql-9.4

Is there any way to index the following query?

SELECT run_id, MAX ( frame ) , MAX ( time ) FROM run.frames_stat GROUP BY run_id;

I've tried creating sorted (non-composite) indexes on frame and time, and an index on run_id, but the query planner doesn't use them.

Misc info:

Unfortunately (and for reasons I won't get into) I cannot change the query
The frames_stat table has 42 million rows
The table is unchanging (no further inserts/deletes will ever take place)
The query was always slow, it's just gotten slower because this dataset is larger than in the past.
There are no indexes on the table
We are using Postgres 9.4
The db's "work_mem" size is 128MB (if that's relevant).
Hardware: 130GB Ram, 10 core Xeon

Schema:

CREATE TABLE run.frame_stat (
  id bigint NOT NULL,
  run_id bigint NOT NULL,
  frame bigint NOT NULL,
  heap_size bigint NOT NULL,
  "time" timestamp without time zone NOT NULL,
  CONSTRAINT frame_stat_pkey PRIMARY KEY (id)
)

Explain analyze:

HashAggregate  (cost=1086240.000..1086242.800 rows=280 width=24) (actual time=14182.426..14182.545 rows=280 loops=1)
  Group Key: run_id
  ->  Seq Scan on zulu  (cost=0.000..770880.000 rows=42048000 width=24) (actual time=0.037..4077.182 rows=42048000 loops=1)

Best Answer

Too bad

If you cannot change the query at all, that's too bad. You won't get a good solution. If you had not table-qualified the table (~~run.~~frames_stat), you could create a materialized view (see below) with the same name in another schema (or just a temporary one) and adapt the search_path (optionally just in sessions where this is desirable) - for hugely superior performance.

Here's a recipe for such a technique:

How can I fake inet_client_addr() for unit tests in PostgreSQL?

@Joishi's idea with a RULE would be a measure of (desperate) last resort. But I would rather not go there. Too many pitfalls with unexpected behavior.

Better query / indexes

If you could change the query, you should try to emulate a loose index scan:

Optimize GROUP BY query to retrieve latest record per user

This is even more efficient when based on a separate table with one row per relevant run_id - let's call it run_tbl. Create it if you don't have it, yet!
Implemented with correlated subqueries:

SELECT run_id
    , (SELECT frame
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY frame DESC NULLS LAST
       LIMIT  1) AS max_frame
    , (SELECT "time"
       FROM   run.frames_stat
       WHERE  run_id = r.run_id
       ORDER  BY "time" DESC NULLS LAST
       LIMIT  1) AS max_time
FROM   run_tbl r;

Create two multicolumn indexes with matching sort order for lightening performance:

CREATE index fun_frame_idx ON run.frames_stat (run_id, frame DESC NULLS LAST);
CREATE index fun_frame_idx ON run.frames_stat (run_id, "time" DESC NULLS LAST);

NULLS LAST is only necessary if there can be null values. But it won't hurt either way.

Unused index in range of dates query

With only 280 distinct run_id, this will be very fast.

MATERIALIZED VIEW

Or, based on these key pieces of information:

The "frames_stat" table has 42 million rows

rows=280 -- number of returned rows = disctinct run_id

The table is unchanging (no inserts/deletes)

Use a MATERIALIZED VIEW, it will be tiny (only 280 rows) and super fast.
You still need to change the query to base it on the MV instead of the table.

Aside: never use reserved words like time (in standard SQL) as identifier.

Best Answer

Too bad

Better query / indexes

MATERIALIZED VIEW

Related Solutions

Postgresql – How to speed up a Postgres query containing lots of Joins with an ILIKE condition

Postgresql – postgres explain plan with giant gaps between operations

Related Question