Sql-server – Query to get queue position for each group

gaps-and-islandsgroup bysql serversql-server-2008-r2t-sql

I have a queue table with the following sample data:

id  company location
1   acme    new york
2   acme    philadelphia
3   genco   st.louis
4   genco   san diego
5   genco   san francisco
6   acme    miami

I want to query for the queue position of each company group to show where in line each company is (imagine acme has 1,000 rows before 500 Genco starts and then acme has 5,000 more after the Genco records process). Result I would want would look like the following:

queuePositionId company
1    acme
3    genco
6    acme

I played around with ranking and grouping but the fact that a company group can be multiple times anywhere in the queue kept messing with my aggregates. I also tried a dense_rank but couldn't figure out the ordering. Any ideas?

Best Answer

This is a Gaps and Islands question. See here for more details on problems like this.

This should do what you need:

-- Generate test data
DECLARE @Companies TABLE
(
    ID INT
    ,Company NVARCHAR(100)
    ,Location NVARCHAR(100)
);

INSERT @Companies
SELECT *
FROM    (VALUES (1, 'acme', 'new york')
                ,(2, 'acme', 'philadelphia')
                ,(3, 'genco', 'st.louis')
                ,(4, 'genco', 'san diego')
                ,(5, 'genco', 'san francisco')
                ,(6, 'acme', 'miami')
        ) AS CompanyLocations(ID, Company, Location);

-- Find company positions
;WITH cte_Companies
AS
(
    SELECT ID
           ,Company
           ,CASE 
              WHEN LAG(Company) OVER(ORDER BY ID) = Company  
              THEN 1
              ELSE 0
            END AS CompanyPosition
    FROM @Companies
)

SELECT ID, Company
FROM cte_Companies
WHERE CompanyPosition = 0

UPDATE Andriy noted that my solution was a SQL Servre 2012+ solution. The following code should work for versions down to 2005.

-- Generate test data
DECLARE @Companies TABLE
(
    ID INT
    ,Company NVARCHAR(100)
    ,Location NVARCHAR(100)
);

INSERT @Companies
SELECT *
FROM    (VALUES (1, 'acme', 'new york')
                ,(2, 'acme', 'philadelphia')
                ,(3, 'genco', 'st.louis')
                ,(4, 'genco', 'san diego')
                ,(5, 'genco', 'san francisco')
                ,(6, 'acme', 'miami')
                -- Further test data
                ,(7, 'genco', 'London')
                ,(8, 'genco', 'Portsmouth')
        ) AS CompanyLocations(ID, Company, Location);

-- Find company positions

SELECT ID, Company
FROM @Companies c1
WHERE NOT EXISTS    (
                        SELECT *
                        FROM @Companies c2
                        WHERE c1.Company = c2.Company
                        AND c1.ID - 1 = c2.ID
                    )

Related Solutions

T-SQL Purpose of MAX in this Group By Query

You say: "My best educated guess is that somehow max is being used to avoid multiple grouping columns"

That is correct.

and then: "... but how can this return the correct results?"

It returns correct results because the Symbol is the primary key in both the Investments and the Price tables. Therefore, any aggregate function over a P.column or an I.column is aggregating identical values. And MAX(c) when c is 2, 2, 2 or 2 is of course 2.

Could the query be written somehow else, possibly without all these aggregations? Yes, see a related question: Why do wildcards in GROUP BY statements not work?

It would have to be a rather long GROUP BY clause or have the aggregations moved into a subquery with only the Holdings_Secure table (where Symbol is not the Primary key) and then joined to the other two:

SELECT
    I.Symbol Symbol
,   I.Ticker CUSIP
,   I.Name Name

,   H.TotalQuantity
,   H.TotalMarketValue
,   H.Price

,   I.CategoryCode5 BUY_SELL
,   I.EquivFactor1 PriceTgt
,   P.LastPrice CurrPrice
,   I.AssetClass Target
,   I.Industry Industry
,   I.CategoryCode1 Risk
FROM
    ( SELECT 
          SUM(Quantity) TotalQuantity
      ,   SUM(MarketValue) TotalMarketValue
      ,   MAX(PriceLC) Price
      ,   Symbol
      FROM
          HOLDINGS_SECURE
      WHERE
          Quantity > 0
      GROUP BY
          Symbol
    ) H
  JOIN
    INVESTMENTS I
      ON H.Symbol = I.Symbol
  JOIN
    PRICE P
      ON H.Symbol = P.Symbol
WHERE
        I.Product = 'stock'
    AND I.CategoryCode5 NOT IN ('X', '') ;

PostgreSQL: Generate a series of dates for each group in a table

1. `CROSS JOIN`, `LEFT JOIN LATERAL` to subquery

SELECT a.user_id, COALESCE(b.balance, 0) AS balance, d.as_of_date
FROM   (
   SELECT d::date AS as_of_date  -- cast to date right away
   FROM   generate_series(timestamp '2016-01-01', '2016-01-03', interval '1 day') d
   ) d
JOIN   accounts a ON a.create_date <= d.as_of_date
LEFT   JOIN LATERAL (
   SELECT balance
   FROM   balances
   WHERE  user_id = a.user_id
   AND    as_of_date <= d.as_of_date
   ORDER  BY as_of_date DESC
   LIMIT  1
   ) b ON true
ORDER  BY a.user_id, d.as_of_date;

Returns your desired result - except that as_of_date is an actual date, not a timestamp like in your example. That should be more appropriate.

Users that are created already, but don't have any transactions, yet, are listed with a balance of 0. You did not define how to deal with the corner case.

Rather use timestamp input for generate_series():

Generating time series between two dates in PostgreSQL

It's crucial for performance that you back this up with a multicolumn index:

CREATE INDEX balances_multi_idx ON balances (user_id, as_of_date DESC, balance);

We have had a very similar case on SO just this week:

Aggregating the most recent joined records per week

Find more explanation there.

2. `CROSS JOIN`, `LEFT JOIN` , window functions

SELECT user_id
     , COALESCE(max(balance) OVER (PARTITION BY user_id, grp
                                   ORDER BY as_of_date), 0) AS balance
     , as_of_date
FROM  (
   SELECT a.user_id, b.balance, d.as_of_date
        , count(b.user_id) OVER (PARTITION BY user_id ORDER BY as_of_date) AS grp
   FROM   (
      SELECT d::date AS as_of_date  -- cast to date right away
      FROM   generate_series(timestamp '2016-01-01', '2016-01-03', interval '1 day') d
      ) d
   JOIN   accounts a ON a.create_date <= d.as_of_date
   LEFT   JOIN balances b USING (user_id, as_of_date)
   ) sub
ORDER  BY user_id, as_of_date;

Same result. If you have the multicolumn index mentioned above and can get index-only scans out of it, the first solution is most probably faster.

The main feature is the running count of values to form groups. since count() does not count NULL values, all dates without balance fall into the same group (grp) as the most recent balance. Then use a simple max() over the same window frame extended by grp to copy the last balance for dangling gaps.

Best Answer

Related Solutions

T-SQL Purpose of MAX in this Group By Query

PostgreSQL: Generate a series of dates for each group in a table

1. CROSS JOIN, LEFT JOIN LATERAL to subquery

2. CROSS JOIN, LEFT JOIN , window functions

Related Question

1. `CROSS JOIN`, `LEFT JOIN LATERAL` to subquery

2. `CROSS JOIN`, `LEFT JOIN` , window functions