Postgresql – How to get next/previous partition count using window function

postgresqlwindow functions

I'm trying to get the current_count, next_count, and previous_count for date.month partitions with window functions:

WITH tmp1 AS (
SELECT location.city, date.month, date.year
     , COUNT(*) OVER W AS current_count, COUNT(*) OVER W2 AS 
count_next
FROM fact, crime, date, location 
WHERE fact.location_key = location.location_key and crime.crime_key = 
fact.crime_key and fact.date_key = date.date_key

WINDOW W AS (
        PARTITION BY location.city, date.year, date.month
        ORDER BY date.month
    ),

    W2 AS (
        PARTITION BY location.city, date.year, date.month
        ORDER BY date.month
        RANGE BETWEEN CURRENT ROW AND FOLLOWING
    )
)

SELECT *
FROM tmp1 
GROUP BY city, year, month,current_count, count_next

However, the code above works for current_count only, the count_next shows the same result as the current_count.

How to show the count for next month and previous month? With window functions?

Best Answer

Do the aggregation first. Then use the lead() and lag() window functions. Looks like you were aiming for something like this:

SELECT *
     , lead(curr_count) OVER w AS next_count
     , lag(curr_count)  OVER w AS prev_count
FROM (
   SELECT l.city, d.year, d.month, count(*) AS curr_count
   FROM   fact     f
   JOIN   crime    c ON c.crime_key = f.crime_key    -- see below
   JOIN   date     d ON d.date_key = f.date_key
   JOIN   location l ON l.location_key = f.location_key
   GROUP  BY l.city, d.year, d.month
   ) sub
WINDOW w AS (PARTITION BY city ORDER BY year, month)
ORDER  BY city, year, month;

You can use a CTE or a subquery (like I did) to apply the logic in two steps. May be easier to understand at first. But you can add window functions over aggregate functions in the same query level. See:

Best way to get result count before LIMIT was applied

So this is equivalent and shorter (probably same query plan / performance):

SELECT l.city, d.year, d.month
     , count(*)              AS curr_count
     , lead(count(*)) OVER w AS next_count
     , lag(count(*))  OVER w AS prev_count
FROM   fact     f
JOIN   crime    c USING (crime_key)    -- see below
JOIN   date     d USING (date_key)
JOIN   location l USING (location_key)
GROUP  BY l.city, d.year, d.month
WINDOW w AS (PARTITION BY l.city ORDER BY d.year, d.month);

Use equivalent explicit join syntax. Much easier to read.

The USING shorthand in the join clause of my second query is only legit if column names are unambiguous. Else stick to the explicit ON syntax in the first query.

Why join to table crime? Does the join eliminate or multiply rows? Else you can remove it completely.

Use table aliases to keep the noise low.

You had PARTITION BY location.city, date.year, date.month ORDER BY date.month. But I suggest you really want PARTITION BY city ORDER BY year, month to make the query work across year boundaries?

It would probably be more efficient to use a single column of type date (4 bytes) instead of year and month. (But don't name it just "date".)

Note how I skipped ORDER BY city, year, month in the second query. You normally get rows sorted that way anyway due to the window definition, but there are no guarantees without explicit ORDER BY. And you did not specify that order is required. If so, you might want to be explicit about this and add ORDER BY.

Related Solutions

Postgresql – Get all partition names for a table

Use the first query from the answer you linked and add a simple WHERE clause to get the partitions of a single table:

SELECT
    nmsp_parent.nspname AS parent_schema,
    parent.relname      AS parent,
    nmsp_child.nspname  AS child_schema,
    child.relname       AS child
FROM pg_inherits
    JOIN pg_class parent            ON pg_inherits.inhparent = parent.oid
    JOIN pg_class child             ON pg_inherits.inhrelid   = child.oid
    JOIN pg_namespace nmsp_parent   ON nmsp_parent.oid  = parent.relnamespace
    JOIN pg_namespace nmsp_child    ON nmsp_child.oid   = child.relnamespace
WHERE parent.relname='parent_table_name';

Postgresql – How to get the aggregate of a window function in Postgres

You can have window functions on the result of aggregate functions in a single query level.

This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.

SELECT perm
      ,combo
      ,avg(value)                 AS perm_average_value
      ,sum(avg(value) * count(*)) OVER w_combo /
       sum(count(*)) OVER w_combo AS combo_average_value
      ,stddev_pop(value)          AS perm_stddev
      ,0                          AS combo_stddev  -- doesn't work!
      ,count(*)                   AS perm_count
      ,sum(count(*)) OVER w_combo AS combo_count
FROM   foo
GROUP  BY perm, combo
WINDOW w_combo  AS (PARTITION BY combo);

For combo_average_value you would need this expression

sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo

Since you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)

This works:

SELECT DISTINCT ON (perm, combo)
       perm
      ,combo
      ,avg(value)        OVER wpc AS perm_average_value
      ,avg(value)        OVER wc  AS combo_average_value
      ,stddev_pop(value) OVER wpc AS perm_stddev
      ,stddev_pop(value) OVER wc  AS combo_stddev
      ,count(*)          OVER wpc AS perm_count
      ,count(*)          OVER wc  AS combo_count
FROM   foo
WINDOW wc  AS (PARTITION BY combo)
      ,wpc AS (PARTITION BY perm, combo);

I am using two different windows here, and reduce the rows with DISTINCT which is applied even after window functions.

But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.

Better performance with altered table layout

Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:

CREATE TABLE combo ( 
  combo_id serial PRIMARY KEY
 ,combo    int[] NOT NULL
);

CREATE TABLE perm ( 
  perm_id  serial PRIMARY KEY
 ,perm     int[] NOT NULL
);

CREATE TABLE value (
  perm_id  int REFERENCES perm(perm_id)
 ,combo_id int REFERENCES combo(combo_id)
 ,value numeric NOT NULL DEFAULT 0
);

If you don't need referential integrity you can omit the foreign key constraints.

The connection to combo_id could also be placed in the table perm, but in this scenario I would store it (slightly de-normalized) in value for better performance.

This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your numeric column. (If you don't need extreme precision, a double precision or even a real column might do, too.)

More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance

Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.

You would first aggregate in a subquery and then join to perm and combo for best performance.

Best Answer

Related Solutions

Postgresql – Get all partition names for a table

Postgresql – How to get the aggregate of a window function in Postgres

Better performance with altered table layout

Related Question