How Order of Operations Affects Query Results in PostgreSQL

postgresql

I've written a query, and I'm kinda sure that it does what it want. I'm thinking, "huh, I wonder if I know how Postgres works?". I'm going to list a specific assumption about this query below, and if that is wrong, please correct my understanding.

Assumption: The OVER (PARTITION BY) will execute against the data after the final GROUP BY and ORDER BY are executed, thereby limiting the records that are summed to those that are indeed DISTINCT ON (a.id, scl.new_status_id).

Note: This is a Python string that will be formatted into a proper SQL query using a Python library, hence the %s format characters. If that causes confusion for you, I will update it. Please assume that I am passing in arguments that make a valid SQL query.

SELECT DISTINCT ON (a.id, scl.new_status_id)
    scl.new_status_id
    ,(scl.change_date AT TIME ZONE %s)::date as d
    ,SUM(COUNT(*)) OVER (PARTITION BY scl.new_status_id
                         order by (scl.change_date AT TIME ZONE %s)::date)

FROM automated_responses_statuschangelogs scl
JOIN application_app a ON scl.applicant_id = a.id
WHERE a.applied_class in %s
AND a.applied_track in %s
AND a.applied_date <= %s
AND a.applied_date > %s
AND scl.new_status_id IN %s
AND scl.new_status_id NOT IN %s

GROUP BY a.id, scl.new_status_id, d
ORDER BY a.id, scl.new_status_id, d desc

Best Answer

Your assumptions are almost correct.
The order of events:

Build table from FROM list
- Apply join conditions and filters in WHERE clause.
GROUP BY / aggregate functions
Window functions
ORDER BY
DISTINCT ON (has to match leading columns of ORDER BY)
LIMIT (not in your case).

Window functions operate logically independent of the ORDER BY clause. But there are synergies for performance if both use the same sort order.

There is not enough information to tell whether this query does what it is supposed to do.

Related answers to shed some more light on the topic:

Consistent rows

The important question which does not seem to be on your radar yet:
From each set of rows for the same seriesName, do you want the columns of one row, or just any values from multiple rows (which may or may not go together)?

Your answer does the latter, you combine the maximum dbid with the maximum retreivaltime, which may come from a different row.

To get consistent rows, use DISTINCT ON and wrap it in a subquery to order the result differently:

SELECT * FROM (
   SELECT DISTINCT ON (seriesName)
          dbid, seriesName, retreivaltime
   FROM   FileItems
   WHERE  sourceSite = 'mk' 
   ORDER  BY seriesName, retreivaltime DESC NULLS LAST  -- latest retreivaltime
   ) sub
ORDER BY retreivaltime DESC NULLS LAST
LIMIT  100;

Details for DISTINCT ON:

Select first row in each GROUP BY group?

Aside: should probably be retrievalTime, or better yet: retrieval_time. Unquoted mixed case identifiers are a common source of confusion in Postgres.

Better Performance with rCTE

Since we are dealing with a big table here, we'd need a query that can use an index, which is not the case for the above query (except for WHERE sourceSite = 'mk')

On closer inspection, your problem seems to be a special case of a loose index scan. Postgres does not support loose index scans natively, but it can be emulated with a recursive CTE. There is a code example for the simple case in the Postgres Wiki.

Related answer on SO with more advanced solutions, explanation, fiddle:

Optimize GROUP BY query to retrieve latest record per user

Your case is more complex, though. But I think I found a variant to make it work for you. Building on this index (without WHERE sourceSite = 'mk')

CREATE INDEX mi_special_full_idx ON MangaItems
(retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)

Or (with WHERE sourceSite = 'mk')

CREATE INDEX mi_special_granulated_idx ON MangaItems
(sourceSite, retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)

The first index can be used for both queries, but is not fully efficient with the additional WHERE condition. The second index is of very limited use for the first query. Since you have both variants of the query consider creating both indexes.

I added dbid at the end to allow Index Only scans.

This query with a recursive CTE makes use of the index. I tested with Postgres 9.3 and it works for me: no sequential scan, all index-only scans:

WITH RECURSIVE cte AS (
   (
   SELECT dbid, seriesName, retreivaltime, 1 AS rn, ARRAY[seriesName] AS arr
   FROM   MangaItems
   WHERE  sourceSite = 'mk'
   ORDER  BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT i.dbid, i.seriesName, i.retreivaltime, c.rn + 1, c.arr || i.seriesName
   FROM   cte c
   ,      LATERAL (
      SELECT dbid, seriesName, retreivaltime
      FROM   MangaItems
      WHERE (retreivaltime, seriesName) < (c.retreivaltime, c.seriesName)
      AND    sourceSite = 'mk'  -- repeat condition!
      AND    seriesName <> ALL(c.arr)
      ORDER  BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
      LIMIT  1
      ) i
   WHERE  c.rn < 101
   )
SELECT dbid
FROM   cte
ORDER  BY rn;

You need to include seriesName in ORDER BY, since retreivaltime is not unique. "Almost" unique is still not unique.

Explain

The non-recursive query starts with the latest row.
The recursive query adds the next-latest row with a seriesName that's not in the list, yet etc., until we have 100 rows.
Essential parts are the JOIN condition (b.retreivaltime, b.seriesName) < (c.retreivaltime, c.seriesName) and the ORDER BY clause ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST. Both match the sort order of the index, which allows for the magic to happen.
Collecting seriesName in an array to rule out duplicates. The cost for b.seriesName <> ALL(c.foo_arr) grows progressively with the number of rows, but for just 100 rows it is still cheap.
Just returning dbid as clarified in the comments.

Alternative with partial indexes:

We have been dealing with similar problems before. Here is a highly optimized complete solution based on partial indexes and a looping function:

Can spatial index help a "range - order by - limit" query

Probably the fastest way (except for a materialized view) if done right. But more complex.

Materialized View

Since you do not have a lot of write operations and they are not performance-critical as stated in the comments (should be in the question), save the top n pre-computed rows in a materialized view and refresh it after relevant changes to the underlying table. Base your performance-critical queries on the materialized view instead.

Could just be a "thin" mv of the latest 1000 dbid or so. In the query, join to the original table. For instance, if content is sometimes updated, but the top n rows can remain unchanged.
Or a "fat" mv with whole rows to return. Faster, yet. Needs to be refreshed more often, obviously.

Details in the manual here and here.

PostgreSQL – How to Calculate Rolling Sum, Count, and Average Over Date Interval

The query you have

You could simplify your query using a WINDOW clause, but that's just shortening the syntax, not changing the query plan.

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
             ORDER BY trans_date
             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

Also using the slightly faster count(*), since id is certainly defined NOT NULL?
And you don't need to ORDER BY entity_id since you already PARTITION BY entity_id

You can simplify further, though:
Don't add ORDER BY to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);

Simpler, faster, but still just a better version of what you have, with static months.

The query you might want

... is not clearly defined, so I'll build on these assumptions:

Count transactions and amount for every 30-day period within the first and last transaction of any entity_id. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.

SELECT entity_id, trans_date
     , COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
     , COALESCE(sum(daily_count)  OVER w, 0) AS trans_count
FROM  (
   SELECT entity_id
        , generate_series (min(trans_date)::timestamp
                         , GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
                         , interval '1 day')::date AS trans_date
   FROM   transactiondb 
   GROUP  BY 1
   ) x
LEFT JOIN (
   SELECT entity_id, trans_date
        , sum(amount) AS daily_amount, count(*) AS daily_count
   FROM   transactiondb
   GROUP  BY 1, 2
   ) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
             ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);

This lists all 30-day periods for each entity_id with your aggregates and with trans_date being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...

The basic difficulty is the same as discussed here:

Referencing current row in FILTER clause of window function

The frame definition of a window cannot depend on values of the current row.

And rather call generate_series() with timestamp input:

Generating time series between two dates in PostgreSQL

The query you actually want

After question update and discussion:
Accumulate rows of the same entity_id in a 30-day window starting at each actual transaction.

Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL joins, yet:

SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
     , sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM   transactiondb t0
JOIN   transactiondb t1 USING (entity_id)
WHERE  t1.trans_date >= t0.trans_date
AND    t1.trans_date <  t0.trans_date + 30  -- exclude upper bound
-- AND    t0.entity_id = 114284  -- or pick a single entity ...
GROUP  BY t0.id  -- is PK!
ORDER  BY t0.trans_date, t0.id

SQL Fiddle.

A rolling window could only make sense (with respect to performance) with data for most days.

This does not aggregate duplicates on (trans_date, entity_id) per day, but all rows of the same day are always included in the 30-day window.

For a big table, a covering index like this could help quite a bit:

CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);

The last column amount is only useful if you get index-only scans out of it. Else drop it.

But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.