SQL Server Query – Mixing Aggregates and Non-Aggregates

sql serversql-server-2012

Hope someone can clarify how to do this query from SQL Server 2012, or maybe what it is asking;
it seems I am being asked to do a query that combines aggregates with non-aggregates, leading to what seems like an impossible query, since one can only group by fields that appear in the Select clause, which messes up
the query.

I am asked to do a query on the table Invoices:

Invoices (InvoiceID, VendorID,InvoiceDate, InvoiceTotal,...)

I am asked to do a query that produces 6 columns:
3 columns already appear "as is" in the table: VendorID, InvoiceDate, InvoiceTotal

And 3 other columns are aggregates:

4) Sum(InvoiceTotal) AS VendorTotal: Sum of invoices for each VendorID

5) Count(InvoiceID) AS VendorCount: Count of Invoices for each VendorId

6) Avg(InvoiceTotal) AS VendorAvg: Average of Invoices per VendorId

Now, problem is that one cannot combine in the same query (with exceptions I cannot use here) aggregates and non-aggregates unless one groups by the non-
aggregates. So I can do:

Select 
    VendorId, InvoiceDate, InvoiceTotal, 
    Sum(InvoiceTotal) AS VendorTotal, 
    Count(InvoiceId) AS VendorCount, 
    Avg(InvoiceTotal) AS VendorAvg 
GROUP BY
    VendorID

Now, I have no trouble if I only group by VendorID. But in order to get a
valid query, if I select either InvoiceDate or InvoiceTotal — both non-aggregates — I must group by each of them in order to get a valid query.

But grouping by these last two basically undoes the other aggregates: If
I group by invoice date, I lose the aggregation by Vendor, since each vendor
has different invoice dates.

I have tried doing self joins, using Invoices AS I1 join Invoices AS I2
and doing aggregates in I1 and non-aggregates on I2, but this does not seem to work.

Am I missing something obvious here? Any Ideas?

Edit: The answer was found to be using OVER (PARTITION BY VendorId) , like so:

Select 
VendorId, InvoiceDate, InvoiceTotal, 
    Sum(InvoiceTotal) OVER (PARTITION BY VendorId) AS VendorTotal, 
    Count(InvoiceId) OVER (PARTITION BY VendorId) AS VendorCount, 
    Avg(InvoiceTotal) OVER (PARTITION BY VendorId) AS VendorAvg 
GROUP BY
    VendorID

Best Answer

The solution is what you already found, in DBMS like SQL Server 2005+ that have implemented window functions, we can use them to get aggregates over each partition while keeping the original table, so not collapsing the table as a group by does. I'm not writing the query here, please edit your answer.

In older versions (or other DBMS) that don't have the OVER () clause syntax available, we could use GROUP BY in a derived table (or a CTE) and then a "self" join back to the original table, like this:

SELECT 
    i.VendorID, i.InvoiceDate, i.InvoiceTotal, 
    grp.VendorTotal, 
    grp.VendorCount, 
    grp.VendorAvg 
FROM
      dbo.Invoices AS i             -- our original table
  JOIN
      ( SELECT 
            VendorID,
            Sum(InvoiceTotal) AS VendorTotal, 
            Count(InvoiceId) AS VendorCount, 
            Avg(InvoiceTotal) AS VendorAvg 
        FROM 
            dbo.Invoices  
        GROUP BY
            VendorID 
      ) AS grp                      -- the aggregated table 
  ON 
      grp.VendorID = i.VendorID ;

Related Solutions

SQL Server Running Total – How to Use Running Total Aggregates for Financial Accumulations

Interesting problem. Here is my attempt.

First we calculate a running total of the Allocation.InvoiceAmt over the various InvoiceId:

WITH inv AS
  ( SELECT 
      InvoiceId, Account, 
      InvAmt = InvoiceAmt,
      SumInvAmt = SUM(InvoiceAmt) OVER 
        (PARTITION BY InvoiceId 
         ORDER BY Account
         ROWS BETWEEN UNBOUNDED PRECEDING
                  AND CURRENT ROW)
    FROM Allocation 
  )

and the same for Payment.Amount:

, pay AS
  ( SELECT 
      PaymentId, InvoiceId, PaymentType, CheckNumber, 
      PayAmt = Amount,
      SumPayAmt = SUM(Amount) OVER 
        (PARTITION BY InvoiceId 
         ORDER BY PaymentId
         ROWS BETWEEN UNBOUNDED PRECEDING
                  AND CURRENT ROW)
    FROM Payment 
  )

Then we combine the previous two CTEs:

SELECT 
    inv.InvoiceId,
    pay.PaymentId,
    inv.Account,
    PaymentAllocated = 
      CASE WHEN SumPayAmt <= SumInvAmt - InvAmt
             OR SumInvAmt <= SumPayAmt - PayAmt
      THEN 0
      ELSE
          CASE WHEN SumPayAmt <= SumInvAmt THEN SumPayAmt      -- these would be
               ELSE SumInvAmt END                              -- simpler if there
        - CASE WHEN SumPayAmt-PayAmt <= SumInvAmt-InvAmt       -- was a LEAST() and 
               THEN SumInvAmt-InvAmt                           -- a GREATEST() function
               ELSE SumPayAmt-PayAmt END
      END
FROM inv JOIN pay
  ON inv.InvoiceId = pay.InvoiceId
ORDER BY 
    inv.InvoiceId,
    pay.PaymentId,
    inv.Account ;

Test at SQL-Fiddle

SQL Server Query Performance – Optimizing Group By with Many Columns

The non-clustered index you have tested is not the best for this query. It can be used for the WHERE clause and for doing an index scan instead of a full table scan but it cannot be used for the GROUP BY.

The best possible index would have to be a partial index (to filter the unwanted rows from the WHERE clause), then have all the columns used in the GROUP BY and then INCLUDE all the other columns used in the SELECT:

CREATE INDEX special_ix 
  ON dbo.Commissions_Output
    ( company, location, account, 
      salesroute, employee, producttype, 
      item, loadjdate, commissionrate ) 
INCLUDE 
  ( [Extended Sales Price], [Delivered Qty] ) 
WHERE 
  ( [Extended Sales Price] <> 0 ) ;

Best Answer

Related Solutions

SQL Server Running Total – How to Use Running Total Aggregates for Financial Accumulations

SQL Server Query Performance – Optimizing Group By with Many Columns

Related Question