Postgresql – How to get next/previous partition count using window function

postgresqlwindow functions

I'm trying to get the current_count, next_count, and previous_count for date.month partitions with window functions:

WITH tmp1 AS (
SELECT location.city, date.month, date.year
     , COUNT(*) OVER W AS current_count, COUNT(*) OVER W2 AS 
count_next
FROM fact, crime, date, location 
WHERE fact.location_key = location.location_key and crime.crime_key = 
fact.crime_key and fact.date_key = date.date_key

WINDOW W AS (
        PARTITION BY location.city, date.year, date.month
        ORDER BY date.month
    ),

    W2 AS (
        PARTITION BY location.city, date.year, date.month
        ORDER BY date.month
        RANGE BETWEEN CURRENT ROW AND FOLLOWING
    )
)

SELECT *
FROM tmp1 
GROUP BY city, year, month,current_count, count_next

However, the code above works for current_count only, the count_next shows the same result as the current_count.

How to show the count for next month and previous month? With window functions?

Best Answer

Do the aggregation first. Then use the lead() and lag() window functions. Looks like you were aiming for something like this:

SELECT *
     , lead(curr_count) OVER w AS next_count
     , lag(curr_count)  OVER w AS prev_count
FROM (
   SELECT l.city, d.year, d.month, count(*) AS curr_count
   FROM   fact     f
   JOIN   crime    c ON c.crime_key = f.crime_key    -- see below
   JOIN   date     d ON d.date_key = f.date_key
   JOIN   location l ON l.location_key = f.location_key
   GROUP  BY l.city, d.year, d.month
   ) sub
WINDOW w AS (PARTITION BY city ORDER BY year, month)
ORDER  BY city, year, month;

You can use a CTE or a subquery (like I did) to apply the logic in two steps. May be easier to understand at first. But you can add window functions over aggregate functions in the same query level. See:

So this is equivalent and shorter (probably same query plan / performance):

SELECT l.city, d.year, d.month
     , count(*)              AS curr_count
     , lead(count(*)) OVER w AS next_count
     , lag(count(*))  OVER w AS prev_count
FROM   fact     f
JOIN   crime    c USING (crime_key)    -- see below
JOIN   date     d USING (date_key)
JOIN   location l USING (location_key)
GROUP  BY l.city, d.year, d.month
WINDOW w AS (PARTITION BY l.city ORDER BY d.year, d.month);

Use equivalent explicit join syntax. Much easier to read.

The USING shorthand in the join clause of my second query is only legit if column names are unambiguous. Else stick to the explicit ON syntax in the first query.

Why join to table crime? Does the join eliminate or multiply rows? Else you can remove it completely.

Use table aliases to keep the noise low.

You had PARTITION BY location.city, date.year, date.month ORDER BY date.month. But I suggest you really want PARTITION BY city ORDER BY year, month to make the query work across year boundaries?

It would probably be more efficient to use a single column of type date (4 bytes) instead of year and month. (But don't name it just "date".)

Note how I skipped ORDER BY city, year, month in the second query. You normally get rows sorted that way anyway due to the window definition, but there are no guarantees without explicit ORDER BY. And you did not specify that order is required. If so, you might want to be explicit about this and add ORDER BY.