Postgres – Window Function Rank and Count

postgresqlwindow functions

I have some tables that track athletes time around a track:

competitions, competition_users and competition_user_sessions

I am constructing a query that extracts the current rank and total no of competitors for each supplied 'competition_user', I can get rank ok, but my count (totalUsers) doesn't count all competitors in the competition, it only seems to count up to the supplied user (eg. gives the same as the rank)

SELECT compUserId, rank, totalUsers 
    FROM (
        SELECT cu.competition_user_id as compUserId, cu.user_id as userId,  
    count(*) OVER w as totalUsers, rank() OVER w as rank 
        FROM competition_users cu 
        LEFT JOIN current_competition_sessions ccs ON cu.competition_user_id = ccs.competition_user_id 
        LEFT JOIN competition_user_sessions cus ON cus.competition_user_session_id = ccs.competition_user_session_id 
        WHERE cu.left_competition = false 
        AND cu.competition_id in (:compIds)
        WINDOW w AS (PARTITION BY cu.competition_id ORDER BY cus.time_in_seconds ASC) 
    ) as sub 
WHERE compUserId in (:compUserIds)

My understanding was that the default frame for was the entire window, whereas this seems to be counting from frame start to current row?

Best Answer

Your issue appears to be that you are applying the same WINDOW (named w) for both your COUNT(*) and your rank().

When you use a WINDOW which contains an ORDER BY clause, and you then apply certain aggregations such as SUM or COUNT, it applies the aggregation continuously across the ordering, which is why your COUNT and rank() are identical.

If you modify your query have multiple windows as

SELECT compUserId, rank, totalUsers 
FROM (
    SELECT cu.competition_user_id as compUserId, cu.user_id as userId,  
    count(*) OVER (PARTITION BY cu.competition_id) as totalUsers, 
    rank() OVER (PARTITION BY cu.competition_id ORDER BY cus.time_in_seconds ASC) as rank 
    FROM competition_users cu 
    LEFT JOIN current_competition_sessions ccs ON cu.competition_user_id = ccs.competition_user_id 
    LEFT JOIN competition_user_sessions cus ON cus.competition_user_session_id = ccs.competition_user_session_id 
    WHERE cu.left_competition = false 
    AND cu.competition_id in (:compIds)
) as sub 
WHERE compUserId in (:compUserIds);

so that you are only applying the PARTITION BY to your COUNT(*) window, and have both PARTITION BY and ORDER BY clauses for your rank(), I believe you'll get the results you want.

Refer to this SQL FIDDLE as a reference, where I have a generic id field, a com_num field to represent the competition id, and a com_time field to represent a competitors time.

Related Solutions

Postgresql – Working of window functions and idea window size for window function

The elementary difference is that window functions are applied to all rows in a result set to compute additional columns after the rest of the result set has been determined. No row is dropped. They are available since PostgreSQL 8.4.

The LIMIT and OFFSET clauses of the SELECT command on the other hand do not compute additional columns. They just pick a certain "window" of rows from the result set (in cooperation with the ORDER BY clause) and discard the rest. Have been there like for ever.

While certain tasks can be tackled with either of these tools, they are very different in nature.

For your simple task

sorting data on date and then bring the latest data first

you don't need either of them. Just add:

ORDER BY date_col DESC

According to your comment, you would need:

SELECT col1, col2
FROM   tbl
ORDER  BY date_col DESC
LIMIT  100   -- 100 latest rows
OFFSET 0;    -- just noise, but may be easier to code

Retrieve more:

...
LIMIT  100
OFFSET 100;  -- next page of 100 rows ...

Be sure to have an index on date_col in either case!

How to Get Aggregate of a Window Function in PostgreSQL

You can have window functions on the result of aggregate functions in a single query level.

This would all work nicely after a few modifications - except that it fails for the standard deviation on mathematical principal. The involved calculations are not linear, so you cannot simply combine standard deviations of sub-populations.

SELECT perm
      ,combo
      ,avg(value)                 AS perm_average_value
      ,sum(avg(value) * count(*)) OVER w_combo /
       sum(count(*)) OVER w_combo AS combo_average_value
      ,stddev_pop(value)          AS perm_stddev
      ,0                          AS combo_stddev  -- doesn't work!
      ,count(*)                   AS perm_count
      ,sum(count(*)) OVER w_combo AS combo_count
FROM   foo
GROUP  BY perm, combo
WINDOW w_combo  AS (PARTITION BY combo);

For combo_average_value you would need this expression

sum(avg(value) * count(*)) OVER w_combo / sum(count(*)) OVER w_combo

Since you need a weighted average. (The average of a group with 10 members weighs more than the average of a group with just 2 members!)

This works:

SELECT DISTINCT ON (perm, combo)
       perm
      ,combo
      ,avg(value)        OVER wpc AS perm_average_value
      ,avg(value)        OVER wc  AS combo_average_value
      ,stddev_pop(value) OVER wpc AS perm_stddev
      ,stddev_pop(value) OVER wc  AS combo_stddev
      ,count(*)          OVER wpc AS perm_count
      ,count(*)          OVER wc  AS combo_count
FROM   foo
WINDOW wc  AS (PARTITION BY combo)
      ,wpc AS (PARTITION BY perm, combo);

I am using two different windows here, and reduce the rows with DISTINCT which is applied even after window functions.

But I seriously doubt it will be faster than your original query. I am pretty sure it isn't.

Better performance with altered table layout

Arrays have an overhead of 24 bytes (slight variations depending on type). Also, you seem to have quite a few items per array and many repetitions. For a huge table like yours it would pay to normalize the schema. Example layout:

CREATE TABLE combo ( 
  combo_id serial PRIMARY KEY
 ,combo    int[] NOT NULL
);

CREATE TABLE perm ( 
  perm_id  serial PRIMARY KEY
 ,perm     int[] NOT NULL
);

CREATE TABLE value (
  perm_id  int REFERENCES perm(perm_id)
 ,combo_id int REFERENCES combo(combo_id)
 ,value numeric NOT NULL DEFAULT 0
);

If you don't need referential integrity you can omit the foreign key constraints.

The connection to combo_id could also be placed in the table perm, but in this scenario I would store it (slightly de-normalized) in value for better performance.

This would result in a row size of 32 bytes (tuple header + padding: 24 bytes, 2 x int (8 byte), no padding), plus the unknown size of your numeric column. (If you don't need extreme precision, a double precision or even a real column might do, too.)

More on physical storage in this related answer on SO or here:
Configuring PostgreSQL for read performance

Anyway, that's only a fraction of what you have now and would make your query a lot faster by size alone. Grouping and sorting on simple integers is also a lot faster.

You would first aggregate in a subquery and then join to perm and combo for best performance.

Best Answer

Related Solutions

Postgresql – Working of window functions and idea window size for window function

How to Get Aggregate of a Window Function in PostgreSQL

Better performance with altered table layout

Related Question