PostgreSQL – Grouping() Equivalent in PostgreSQL

aggregatepostgresqlsql serverwindow functions

I have a SQL query that runs in MS SQL Server and it has grouping(), rank() and rollup() functions / keywords in it. The query is given below.

How can this be translated to PostgreSQL?

select top 100
    sum(ss_net_profit)/sum(ss_ext_sales_price) as gross_margin  
   ,i_category  
   ,i_class  
   ,grouping(i_category)+grouping(i_class) as lochierarchy  
   ,rank() over (partition by grouping(i_category)+grouping(i_class)
                             ,case when grouping(i_class) = 0 then i_category end   
                 order by sum(ss_net_profit)/sum(ss_ext_sales_price) asc
                ) as rank_within_parent  
 from  
    ssv_itemv_storev_IJGBAgg  
 where  
    d_year = 2001   
    and s_state in ('TN')  
 group by rollup(i_category,i_class)  
 order by  
    lochierarchy desc  
   ,case when (grouping(i_category)+grouping(i_class) = 0) then i_category end  
   ,rank_within_parent;

Best Answer

ROLLUP was implemented in Postgres 9.5.

Original answer for Postgres 9.4:

The rollup() in group by rollup(i_category,i_class) causes aggregation in hierarchical steps. It's an extension of the ISO SQL standard in SQL Server that's not implemented in Postgres. For two items, you need three steps in standard SQL - as well as in Postgres:

aggregate total
group by i_category
group by i_category, i_class

The grouping() function comes with this extension and indicates whether a given column is aggregated in each row. It's used here to sort the total and group sums first in the result.

rank() is a standard window function that works the same in Postgres, but we have to rewrite it due to the above adjustments.

top 100 translates to FETCH FIRST 100 ROWS ONLY in standard SQL, which is implemented in Postgres as well as the shorter LIMIT 100 (FETCH FIRST syntax works in SQL Server for 2012+ versions, too.) It's a bit odd to cut off after 100 rows in this aggregate query.

I am using a CTE to reuse the intermediary results in multiple aggregations below. Except for the LIMIT everything should basically work in SQL Server just as well:

WITH cte AS (
   SELECT sum(ss_net_profit) AS sum_profit
        , sum(ss_ext_sales_price) AS sum_price
        , i_category, i_class
   FROM   ssv_itemv_storev_IJGBAgg
   WHERE  d_year = 2001
   AND    s_state = 'TN'
   GROUP  BY i_category, i_class
   )
SELECT sum(sum_profit)/sum(sum_price) AS gross_margin
     , NULL AS i_category, NULL AS i_class
     , 2 AS lochierarchy  
     , 1 AS rank_within_parent
FROM   cte

UNION ALL
(  -- parentheses required!
SELECT sum(sum_profit)/sum(sum_price) AS gross_margin
     , i_category, NULL AS i_class
     , 1 AS lochierarchy  
     , rank() OVER (ORDER BY sum(sum_profit)/sum(sum_price)) AS rank_within_parent
FROM   cte
GROUP  BY i_category
ORDER  BY rank_within_parent, i_category  -- last item = my addition to break ties
)

UNION ALL
(
SELECT sum_profit/sum_price AS gross_margin
     , i_category, i_class
     , 0 AS lochierarchy  
     , rank() OVER (PARTITION BY i_category
                    ORDER BY sum_profit/sum_price) AS rank_within_parent
FROM   cte
ORDER  BY i_category, rank_within_parent, i_class  -- last item = my addition to break ties
)
LIMIT 100;

Related Solutions

Sql-server – Cumulative Game Score SQL

Okay, so here is the query modified to work the way you want:

DECLARE @players table
(
    PlayerID uniqueidentifier NOT NULL PRIMARY KEY,
    PlayerName nvarchar(64) NOT NULL
);

DECLARE @playerScores table
(
    ID bigint NOT NULL IDENTITY PRIMARY KEY,
    PlayerID uniqueidentifier NOT NULL,
    DateCreated datetime NOT NULL,
    Score int NOT NULL,
    TimeTaken bigint NOT NULL,
    PuzzleID int NOT NULL
);

DECLARE @puzzleId int = 0;

SELECT TOP 50
    a.PlayerID,
    p.PlayerName,
    a.Score,
    a.TimeTaken,
    a.PlayedDate
    FROM
    (
        SELECT
            ps.PlayerID,
            ps.Score,
            ps.TimeTaken,
            ps.DateCreated AS PlayedDate,
            ROW_NUMBER()
                OVER
                (
                    PARTITION BY ps.PlayerID
                    ORDER BY ps.Score DESC, ps.TimeTaken, ps.DateCreated
                ) AS RN
            FROM @playerScores ps
            WHERE ps.PuzzleID = @puzzleId
    ) a
    INNER JOIN @players p ON p.PlayerID = a.PlayerID
    WHERE a.RN = 1
    ORDER BY
        a.Score DESC,
        a.TimeTaken,
        a.PlayedDate;

Having written this (note: indexes are not optimized), and looking at the other queries you're going to need to write, what I would actually recommend is to abandon this type of query entirely, and create a denormalized high-score table (rows are unique on the combination of PlayerID, PuzzleID), on which to run aggregates instead.

The reason why is because the GameResult table is going to grow huge in the database, and so it will be less and less efficient to run aggregates on it directly as time passes, and the requirements are incompatible with doing something like creating an indexed view to summarize the information.

Also, if you aren't doing this already, it's highly likely you'll want to use an asynchronous process to compute the "leaderboards" periodically and cache the results, instead of computing them just-in-time. (You could do something like merge the current player's score with the cached leaderboards so the player can see themself on the leaderboards immediately if they got a high score.) See my answer here for some ideas to consider when implementing a caching mechanism.

PostgreSQL – How to Calculate Rolling Sum, Count, and Average Over Date Interval

The query you have

You could simplify your query using a WINDOW clause, but that's just shortening the syntax, not changing the query plan.

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date)
             ORDER BY trans_date
             ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING);

Also using the slightly faster count(*), since id is certainly defined NOT NULL?
And you don't need to ORDER BY entity_id since you already PARTITION BY entity_id

You can simplify further, though:
Don't add ORDER BY to the window definition at all, it's not relevant to your query. Then you don't need to define a custom window frame, either:

SELECT id, trans_ref_no, amount, trans_date, entity_id
     , SUM(amount) OVER w AS trans_total
     , COUNT(*)    OVER w AS trans_count
FROM   transactiondb
WINDOW w AS (PARTITION BY entity_id, date_trunc('month',trans_date);

Simpler, faster, but still just a better version of what you have, with static months.

The query you might want

... is not clearly defined, so I'll build on these assumptions:

Count transactions and amount for every 30-day period within the first and last transaction of any entity_id. Exclude leading and trailing periods without activity, but include all possible 30-day periods within those outer bounds.

SELECT entity_id, trans_date
     , COALESCE(sum(daily_amount) OVER w, 0) AS trans_total
     , COALESCE(sum(daily_count)  OVER w, 0) AS trans_count
FROM  (
   SELECT entity_id
        , generate_series (min(trans_date)::timestamp
                         , GREATEST(min(trans_date), max(trans_date) - 29)::timestamp
                         , interval '1 day')::date AS trans_date
   FROM   transactiondb 
   GROUP  BY 1
   ) x
LEFT JOIN (
   SELECT entity_id, trans_date
        , sum(amount) AS daily_amount, count(*) AS daily_count
   FROM   transactiondb
   GROUP  BY 1, 2
   ) t USING (entity_id, trans_date)
WINDOW w AS (PARTITION BY entity_id ORDER BY trans_date
             ROWS BETWEEN CURRENT ROW AND 29 FOLLOWING);

This lists all 30-day periods for each entity_id with your aggregates and with trans_date being the first day (incl.) of the period. To get values for each individual row join to the base table once more ...

The basic difficulty is the same as discussed here:

Referencing current row in FILTER clause of window function

The frame definition of a window cannot depend on values of the current row.

And rather call generate_series() with timestamp input:

Generating time series between two dates in PostgreSQL

The query you actually want

After question update and discussion:
Accumulate rows of the same entity_id in a 30-day window starting at each actual transaction.

Since your data is distributed sparsely, it should be more efficient to run a self-join with a range condition, all the more since Postgres 9.1 does not have LATERAL joins, yet:

SELECT t0.id, t0.amount, t0.trans_date, t0.entity_id
     , sum(t1.amount) AS trans_total, count(*) AS trans_count
FROM   transactiondb t0
JOIN   transactiondb t1 USING (entity_id)
WHERE  t1.trans_date >= t0.trans_date
AND    t1.trans_date <  t0.trans_date + 30  -- exclude upper bound
-- AND    t0.entity_id = 114284  -- or pick a single entity ...
GROUP  BY t0.id  -- is PK!
ORDER  BY t0.trans_date, t0.id

SQL Fiddle.

A rolling window could only make sense (with respect to performance) with data for most days.

This does not aggregate duplicates on (trans_date, entity_id) per day, but all rows of the same day are always included in the 30-day window.

For a big table, a covering index like this could help quite a bit:

CREATE INDEX transactiondb_foo_idx
ON transactiondb (entity_id, trans_date, amount);

The last column amount is only useful if you get index-only scans out of it. Else drop it.

But it's not going to be used while you select the whole table anyway. It would support queries for a small subset.