The query you have
You could simplify your query using a WINDOW
clause, but that's just shortening the syntax, not changing the query plan.
SELECT id, trans_ref_no, amount, trans_date, entity_id
, SUM(amount) OVER w AS trans_total
, COUNT(*) OVER w AS trans_count
FROM transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
ORDER BY trans_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);
- Also using the slightly faster
count(*)
, since id
is certainly defined NOT NULL
?
- And you don't need to
ORDER BY entity_id
since you already PARTITION BY entity_id
You can simplify further, though:
Don't add ORDER BY
to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:
SELECT id, trans_ref_no, amount, trans_date, entity_id
, SUM(amount) OVER w AS trans_total
, COUNT(*) OVER w AS trans_count
FROM transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);
Simpler, faster, but still just a better version of what you have, with static months.
The query you might want
... is not clearly defined, so I'll build on these assumptions:
Count transactions and amount for every 30-day period within the first and last transaction of any entity_id
. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.
SELECT entity_id, trans_date
, COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
, COALESCE(sum(daily_count) OVER w, 0) AS trans_count
FROM (
SELECT entity_id
, generate_series (min(trans_date)::timestamp
, GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
, interval '1 day')::date AS trans_date
FROM transactiondb
GROUP BY 1
) x
LEFT JOIN (
SELECT entity_id, trans_date
, sum(amount) AS daily_amount, count(*) AS daily_count
FROM transactiondb
GROUP BY 1, 2
) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);
This lists all 30-day periods for each entity_id
with your aggregates and with trans_date
being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...
The basic difficulty is the same as discussed here:
The frame definition of a window cannot depend on values of the current row.
And rather call generate_series()
with timestamp
input:
The query you actually want
After question update and discussion:
Accumulate rows of the same entity_id
in a 30-day window starting at each actual transaction.
Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL
joins, yet:
SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
, sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM transactiondb t0
JOIN transactiondb t1 USING (entity_id)
WHERE t1.trans_date >= t0.trans_date
AND t1.trans_date < t0.trans_date + 30 -- exclude upper bound
-- AND t0.entity_id = 114284 -- or pick a single entity ...
GROUP BY t0.id -- is PK!
ORDER BY t0.trans_date, t0.id
SQL Fiddle.
A rolling window could only make sense (with respect to performance) with data for most days.
This does not aggregate duplicates on (trans_date, entity_id)
per day, but all rows of the same day are always included in the 30-day window.
For a big table, a covering index like this could help quite a bit:
CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);
The last column amount
is only useful if you get index-only scans out of it. Else drop it.
But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.
Your query can be simplified in various respects:
SELECT to_char(day, 'DD/MM/YYYY') AS date
, hashtag
, count(d.*)::int AS count
FROM (
campaign c
CROSS JOIN json_array_elements_text(c.configuration#>'{query,hashtags}') ch(hashtag)
CROSS JOIN (SELECT g::date AS day
FROM generate_series(timestamp '2016-01-20', '2016-01-26', interval '1 day') g) day
)
NATURAL LEFT JOIN (
SELECT "createdAt"::date AS day, dh.hashtag
FROM data, json_array_elements_text(meta#>'{matchedOn,hashtags}') dh(hashtag)
WHERE "campaignId" = 1
AND "createdAt" >= '2016-01-20'
AND "createdAt" < '2016-01-27'
) d
WHERE c.id = 1
GROUP BY day, hashtag
ORDER BY day, hashtag, count;
This should be faster for multiple reasons. Not least it can use a multi-column index on data("campaignId", "createdAt")
- which you should create unless you have it.
LEFT JOIN
is the core feature you need. Then count the column, only non-null values count ...
One of the very rare cases where a NATURAL JOIN
is useful, btw. You don't need it, though. It's just a minor syntax shortcut.
And rather use timestamp
input for generate_series()
:
Best Answer
This is special case of relational-division. Predicate dates just happen to be in sequence and duplicates are excluded. There are many possible solutions. The best fit depends on the complete use case.
If, for example, you need to optimize performance and your table is big and the column
did
has high cardinality and typically only fewdid
qualify and there is an index on(dt, did)
, then this query should be substantially faster than queries withGROUP BY
&count()
:Why? Because it can answer your query with a few comparatively very cheap index (only) scans, excluding non-qualifying rows early, while queries based on
GROUP BY
always have to process the whole table.ctid
in my example is safe fallback to identify individual columns, while we don't know your actual design.This gets verbose for long date ranges. It may pay to generate the query dynamically. Or a mixed approach might make sense. Or use a recursive CTE to make it short and still very fast:
This variant returns an array of qualifying IDs and a count.
db<>fiddle here
Related: (recommended reading!)
It all depends on the actual, complete use case.
Also related: