I need to select N rows with seq > X
which reduce_key is null
and for each row group with the same reduce_key
select the one with the biggest seq
. Result should be strictly sequential (but with rows with the same reduce_key
filtered except the one with biggest seq
) and limited with N.
Here is a query which does that:
explain analyze SELECT
x2."user_id",
x2."seq",
x2."timestamp",
x2."reduce_key",
x2."mapping"
FROM
"user_sequence" x2
WHERE
(
(x2."user_id" = 860862404)
AND (x2."seq" > 3562974)
)
AND (
(x2."reduce_key" IS NULL)
OR (
x2."seq" IN (
SELECT
MAX ("seq")
FROM
"user_sequence"
WHERE
("user_id" = 860862404)
AND (
("seq" >= x2."seq")
AND ("reduce_key" = x2."reduce_key")
)
GROUP BY
"reduce_key"
)
)
)
ORDER BY
x2."seq"
LIMIT 1000
The problem is that it's being executed ~70ms on table with tens millions of rows. Is there any more efficient way to reduce rows by key leaving rows with null key?
Table structure:
actor=# \d user_sequence;
Table "public.user_sequence"
Column | Type | Modifiers
------------+---------+-----------
user_id | integer | not null
seq | integer | not null
timestamp | bigint | not null
mapping | bytea | not null
reduce_key | text |
Indexes:
"user_sequence_pkey" PRIMARY KEY, btree (user_id, seq)
"user_sequence_user_id_reduce_key_seq_idx" btree (user_id, reduce_key, seq)
EXPLAIN ANALYZE
:
Limit (cost=0.57..231389.06 rows=1000 width=103) (actual time=0.408..74.754 rows=308 loops=1)
-> Index Scan using user_sequence_pkey on user_sequence x2 (cost=0.57..24444343.74 rows=105642 width=103) (act
ual time=0.406..74.645 rows=308 loops=1)
Index Cond: ((user_id = 860862404) AND (seq > 3562974))
Filter: ((reduce_key IS NULL) OR (SubPlan 1))
Rows Removed by Filter: 692
SubPlan 1
-> GroupAggregate (cost=0.57..447.59 rows=1 width=33) (actual time=0.096..0.097 rows=1 loops=730)
Group Key: user_sequence.reduce_key
-> Index Only Scan using user_sequence_user_id_reduce_key_seq_idx on user_sequence (cost=0.57..4
47.03 rows=110 width=33) (actual time=0.036..0.081 rows=68 loops=730)
Index Cond: ((user_id = 860862404) AND (reduce_key = x2.reduce_key) AND (seq >= x2.seq))
Heap Fetches: 49371
Planning time: 0.808 ms
Execution time: 74.890 ms
(13 rows)
I think that problem is with filter but how to make it use index?
I am using PostgreSQL 9.4.
Best Answer
Assuming you want what your current query does (which seems different from what your description says).
You ask:
Your query plan shows that Postgres is already using indexes for every step of the way. The added
FILTER
step only filters rows according to your additional predicates.I posted a different idea at first that was missing
ORDER BY seq
. This query using a simpleNOT EXISTS
anti-semi-join should be a better equivalent:Your existing indexes look good for it. The second one might be a bit more efficient with descending sort order for the last item:
Normally, there would be a simpler alternative with
DISTINCT ON
, but that would fold all rows withreduce_key IS NULL
into a single row while you want them all.Related answer with more explanation:
Aside: You have a lot of unnecessary parentheses and double-quotes. Maybe an inefficient ORM?