Consistent rows
The important question which does not seem to be on your radar yet:
From each set of rows for the same seriesName
, do you want the columns of one row, or just any values from multiple rows (which may or may not go together)?
Your answer does the latter, you combine the maximum dbid
with the maximum retreivaltime
, which may come from a different row.
To get consistent rows, use DISTINCT ON
and wrap it in a subquery to order the result differently:
SELECT * FROM (
SELECT DISTINCT ON (seriesName)
dbid, seriesName, retreivaltime
FROM FileItems
WHERE sourceSite = 'mk'
ORDER BY seriesName, retreivaltime DESC NULLS LAST -- latest retreivaltime
) sub
ORDER BY retreivaltime DESC NULLS LAST
LIMIT 100;
Details for DISTINCT ON
:
Aside: should probably be retrievalTime
, or better yet: retrieval_time
. Unquoted mixed case identifiers are a common source of confusion in Postgres.
Better Performance with rCTE
Since we are dealing with a big table here, we'd need a query that can use an index, which is not the case for the above query (except for WHERE sourceSite = 'mk'
)
On closer inspection, your problem seems to be a special case of a loose index scan. Postgres does not support loose index scans natively, but it can be emulated with a recursive CTE. There is a code example for the simple case in the Postgres Wiki.
Related answer on SO with more advanced solutions, explanation, fiddle:
Your case is more complex, though. But I think I found a variant to make it work for you. Building on this index (without WHERE sourceSite = 'mk'
)
CREATE INDEX mi_special_full_idx ON MangaItems
(retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)
Or (with WHERE sourceSite = 'mk'
)
CREATE INDEX mi_special_granulated_idx ON MangaItems
(sourceSite, retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)
The first index can be used for both queries, but is not fully efficient with the additional WHERE condition. The second index is of very limited use for the first query. Since you have both variants of the query consider creating both indexes.
I added dbid
at the end to allow Index Only scans.
This query with a recursive CTE makes use of the index. I tested with Postgres 9.3 and it works for me: no sequential scan, all index-only scans:
WITH RECURSIVE cte AS (
(
SELECT dbid, seriesName, retreivaltime, 1 AS rn, ARRAY[seriesName] AS arr
FROM MangaItems
WHERE sourceSite = 'mk'
ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
LIMIT 1
)
UNION ALL
SELECT i.dbid, i.seriesName, i.retreivaltime, c.rn + 1, c.arr || i.seriesName
FROM cte c
, LATERAL (
SELECT dbid, seriesName, retreivaltime
FROM MangaItems
WHERE (retreivaltime, seriesName) < (c.retreivaltime, c.seriesName)
AND sourceSite = 'mk' -- repeat condition!
AND seriesName <> ALL(c.arr)
ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
LIMIT 1
) i
WHERE c.rn < 101
)
SELECT dbid
FROM cte
ORDER BY rn;
You need to include seriesName
in ORDER BY
, since retreivaltime
is not unique. "Almost" unique is still not unique.
Explain
The non-recursive query starts with the latest row.
The recursive query adds the next-latest row with a seriesName
that's not in the list, yet etc., until we have 100 rows.
Essential parts are the JOIN
condition (b.retreivaltime, b.seriesName) < (c.retreivaltime, c.seriesName)
and the ORDER BY
clause ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
. Both match the sort order of the index, which allows for the magic to happen.
Collecting seriesName
in an array to rule out duplicates. The cost for b.seriesName <> ALL(c.foo_arr)
grows progressively with the number of rows, but for just 100 rows it is still cheap.
Just returning dbid
as clarified in the comments.
Alternative with partial indexes:
We have been dealing with similar problems before. Here is a highly optimized complete solution based on partial indexes and a looping function:
Probably the fastest way (except for a materialized view) if done right. But more complex.
Materialized View
Since you do not have a lot of write operations and they are not performance-critical as stated in the comments (should be in the question), save the top n pre-computed rows in a materialized view and refresh it after relevant changes to the underlying table. Base your performance-critical queries on the materialized view instead.
Could just be a "thin" mv of the latest 1000 dbid
or so. In the query, join to the original table. For instance, if content is sometimes updated, but the top n rows can remain unchanged.
Or a "fat" mv with whole rows to return. Faster, yet. Needs to be refreshed more often, obviously.
Details in the manual here and here.
The query you have
You could simplify your query using a WINDOW
clause, but that's just shortening the syntax, not changing the query plan.
SELECT id, trans_ref_no, amount, trans_date, entity_id
, SUM(amount) OVER w AS trans_total
, COUNT(*) OVER w AS trans_count
FROM transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
ORDER BY trans_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
- Also using the slightly faster
count(*)
, since id
is certainly defined NOT NULL
?
- And you don't need to
ORDER BY entity_id
since you already PARTITION BY entity_id
You can simplify further, though:
Don't add ORDER BY
to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:
SELECT id, trans_ref_no, amount, trans_date, entity_id
, SUM(amount) OVER w AS trans_total
, COUNT(*) OVER w AS trans_count
FROM transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);
Simpler, faster, but still just a better version of what you have, with static months.
The query you might want
... is not clearly defined, so I'll build on these assumptions:
Count transactions and amount for every 30-day period within the first and last transaction of any entity_id
. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.
SELECT entity_id, trans_date
, COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
, COALESCE(sum(daily_count) OVER w, 0) AS trans_count
FROM (
SELECT entity_id
, generate_series (min(trans_date)::timestamp
, GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
, interval '1 day')::date AS trans_date
FROM transactiondb
GROUP BY 1
) x
LEFT JOIN (
SELECT entity_id, trans_date
, sum(amount) AS daily_amount, count(*) AS daily_count
FROM transactiondb
GROUP BY 1, 2
) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);
This lists all 30-day periods for each entity_id
with your aggregates and with trans_date
being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...
The basic difficulty is the same as discussed here:
The frame definition of a window cannot depend on values of the current row.
And rather call generate_series()
with timestamp
input:
The query you actually want
After question update and discussion:
Accumulate rows of the same entity_id
in a 30-day window starting at each actual transaction.
Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL
joins, yet:
SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
, sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM transactiondb t0
JOIN transactiondb t1 USING (entity_id)
WHERE t1.trans_date >= t0.trans_date
AND t1.trans_date < t0.trans_date + 30 -- exclude upper bound
-- AND t0.entity_id = 114284 -- or pick a single entity ...
GROUP BY t0.id -- is PK!
ORDER BY t0.trans_date, t0.id
SQL Fiddle.
A rolling window could only make sense (with respect to performance) with data for most days.
This does not aggregate duplicates on (trans_date, entity_id)
per day, but all rows of the same day are always included in the 30-day window.
For a big table, a covering index like this could help quite a bit:
CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);
The last column amount
is only useful if you get index-only scans out of it. Else drop it.
But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.
Best Answer
Your assumptions are almost correct.
The order of events:
FROM
listWHERE
clause.GROUP BY
/ aggregate functionsORDER BY
DISTINCT ON
(has to match leading columns ofORDER BY
)LIMIT
(not in your case).Window functions operate logically independent of the
ORDER BY
clause. But there are synergies for performance if both use the same sort order.There is not enough information to tell whether this query does what it is supposed to do.
Related answers to shed some more light on the topic: