The query you have
You could simplify your query using a WINDOW
clause, but that's just shortening the syntax, not changing the query plan.
SELECT id, trans_ref_no, amount, trans_date, entity_id
, SUM(amount) OVER w AS trans_total
, COUNT(*) OVER w AS trans_count
FROM transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
ORDER BY trans_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
- Also using the slightly faster
count(*)
, since id
is certainly defined NOT NULL
?
- And you don't need to
ORDER BY entity_id
since you already PARTITION BY entity_id
You can simplify further, though:
Don't add ORDER BY
to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:
SELECT id, trans_ref_no, amount, trans_date, entity_id
, SUM(amount) OVER w AS trans_total
, COUNT(*) OVER w AS trans_count
FROM transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);
Simpler, faster, but still just a better version of what you have, with static months.
The query you might want
... is not clearly defined, so I'll build on these assumptions:
Count transactions and amount for every 30-day period within the first and last transaction of any entity_id
. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.
SELECT entity_id, trans_date
, COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
, COALESCE(sum(daily_count) OVER w, 0) AS trans_count
FROM (
SELECT entity_id
, generate_series (min(trans_date)::timestamp
, GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
, interval '1 day')::date AS trans_date
FROM transactiondb
GROUP BY 1
) x
LEFT JOIN (
SELECT entity_id, trans_date
, sum(amount) AS daily_amount, count(*) AS daily_count
FROM transactiondb
GROUP BY 1, 2
) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);
This lists all 30-day periods for each entity_id
with your aggregates and with trans_date
being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...
The basic difficulty is the same as discussed here:
The frame definition of a window cannot depend on values of the current row.
And rather call generate_series()
with timestamp
input:
The query you actually want
After question update and discussion:
Accumulate rows of the same entity_id
in a 30-day window starting at each actual transaction.
Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL
joins, yet:
SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
, sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM transactiondb t0
JOIN transactiondb t1 USING (entity_id)
WHERE t1.trans_date >= t0.trans_date
AND t1.trans_date < t0.trans_date + 30 -- exclude upper bound
-- AND t0.entity_id = 114284 -- or pick a single entity ...
GROUP BY t0.id -- is PK!
ORDER BY t0.trans_date, t0.id
SQL Fiddle.
A rolling window could only make sense (with respect to performance) with data for most days.
This does not aggregate duplicates on (trans_date, entity_id)
per day, but all rows of the same day are always included in the 30-day window.
For a big table, a covering index like this could help quite a bit:
CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);
The last column amount
is only useful if you get index-only scans out of it. Else drop it.
But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.
The COUNT(*)
version is able to simply seek into the index you have on the status column once for each status you are selecting, whereas the SUM(...)
version needs to seek the index twelve times (the total number of unique status types).
Clearly seeking an index three times is going to be faster than seeking it 12 times.
The first plan requires a memory grant of 238MB, whereas the second plan requires a memory grant of 650MB. It may be that the larger memory grant could not be immediately filled, making the query that much slower.
Alter the second query to be:
SELECT Approved = SUM(CASE WHEN Status = 'Approved' THEN 1 ELSE 0 END),
Valid = SUM(CASE WHEN Status = 'Valid' THEN 1 ELSE 0 END),
Reject = SUM(CASE WHEN Status = 'Reject' THEN 1 ELSE 0 END)
FROM dbo.Claims c
WHERE c.Status = 'Approved'
OR c.Status = 'Valid'
OR c.Status = 'Reject';
This will allow the query optimizer to eliminate 75% of the index seeks, and should result in both a lower required memory grant, lower I/O requirements, and faster time-to-result.
The SUM(CASE WHEN ...)
construct essentially prevents the query optimizer from pushing the Status
predicates down into the index seek portion of the plan.
Best Answer
You mostly answered the question yourself already. I have a few morsels to add:
In PostgreSQL (and other RDBMS that support the
boolean
type) you can use theboolean
result of the test directly. Cast it tointeger
andSUM()
:Or use it in a
NULLIF()
expression andCOUNT()
:Or with a simple
OR NULL
:Or various other expressions. Performance is almost identical.
COUNT()
is typically very slightly faster thanSUM()
. UnlikeSUM()
and like Paul already commented,COUNT()
never returnsNULL
, which may be convenient. Related:Since Postgres 9.4 there's also the aggregate
FILTER
clause. See:It's faster than all of the above by around 5 - 10 %:
If the query is as simple as your test case, with only a single count and nothing else, you can rewrite:
... which is the true king of performance, even without index.
With an applicable index it can be faster by orders of magnitude, especially with index-only scans.
Benchmarks
Postgres 13
db<>fiddle here
Basically the same results as for Postgres 10 below. (I added a test without the new parallelism.)
Postgres 10
I ran a new series of tests for Postgres 10, including the aggregate
FILTER
clause and demonstrating the role of an index for small and big counts.Simple setup:
Actual times vary quite a bit due to background noise and specifics of the test bed. Showing typical best times from a bigger set of tests. These two cases should capture the essence:
Test 1 counting ~ 1 % of all rows
db<>fiddle here
Test 2 counting ~ 33 % of all rows
db<>fiddle here
The last test in each set used an index-only scan, which is why it helped for counting one third of all rows. Plain index or bitmap index scans cannot compete with a sequential scan when involving roughly 5 % or more of all rows.
Old test for Postgres 9.1
To verify I ran a quick test with
EXPLAIN ANALYZE
on a real life table in PostgreSQL 9.1.6.74208 of 184568 rows qualified with the condition
kat_id > 50
. All queries return the same result. I ran each like 10 times in turns to exclude caching effects and appended the best result as note:Hardly any real difference in performance.