Is there any way to index the following query?
SELECT run_id, MAX ( frame ) , MAX ( time ) FROM run.frames_stat GROUP BY run_id;
I've tried creating sorted (non-composite) indexes on frame
and time
, and an index on run_id
, but the query planner doesn't use them.
Misc info:
- Unfortunately (and for reasons I won't get into) I cannot change the query
- The
frames_stat
table has 42 million rows - The table is unchanging (no further inserts/deletes will ever take place)
- The query was always slow, it's just gotten slower because this dataset is larger than in the past.
- There are no indexes on the table
- We are using Postgres 9.4
- The db's "work_mem" size is 128MB (if that's relevant).
- Hardware: 130GB Ram, 10 core Xeon
Schema:
CREATE TABLE run.frame_stat (
id bigint NOT NULL,
run_id bigint NOT NULL,
frame bigint NOT NULL,
heap_size bigint NOT NULL,
"time" timestamp without time zone NOT NULL,
CONSTRAINT frame_stat_pkey PRIMARY KEY (id)
)
Explain analyze:
HashAggregate (cost=1086240.000..1086242.800 rows=280 width=24) (actual time=14182.426..14182.545 rows=280 loops=1)
Group Key: run_id
-> Seq Scan on zulu (cost=0.000..770880.000 rows=42048000 width=24) (actual time=0.037..4077.182 rows=42048000 loops=1)
Best Answer
Too bad
If you cannot change the query at all, that's too bad. You won't get a good solution. If you had not table-qualified the table (
run.
frames_stat
), you could create a materialized view (see below) with the same name in another schema (or just a temporary one) and adapt thesearch_path
(optionally just in sessions where this is desirable) - for hugely superior performance.Here's a recipe for such a technique:
@Joishi's idea with a
RULE
would be a measure of (desperate) last resort. But I would rather not go there. Too many pitfalls with unexpected behavior.Better query / indexes
If you could change the query, you should try to emulate a loose index scan:
This is even more efficient when based on a separate table with one row per relevant
run_id
- let's call itrun_tbl
. Create it if you don't have it, yet!Implemented with correlated subqueries:
Create two multicolumn indexes with matching sort order for lightening performance:
NULLS LAST
is only necessary if there can be null values. But it won't hurt either way.With only 280 distinct
run_id
, this will be very fast.MATERIALIZED VIEW
Or, based on these key pieces of information:
Use a
MATERIALIZED VIEW
, it will be tiny (only 280 rows) and super fast.You still need to change the query to base it on the MV instead of the table.
Aside: never use reserved words like
time
(in standard SQL) as identifier.