Sql-server – Which is faster, SUM(CASE) or CTE PIVOT

performancepivotquery-performancesql server

There are two types of ways to perform a PIVOT. Before SQL Server 2005, when PIVOT was introduced, most people did this:

SELECT RateID
              SUM(CASE WHEN RateItemTypeID = 1 THEN UnitPrice ELSE 0 END),
              SUM(CASE WHEN RateItemTypeID = 2 THEN UnitPrice ELSE 0 END),
              SUM(CASE WHEN RateItemTypeID = 3 THEN UnitPrice ELSE 0 END)
              FROM rate_item WHERE _WhereClause_
              GROUP BY RateID

Later, when 2005 introduced PIVOT it became this:

   SELECT RateID, [1], [2], [3]
          FROM PertinentRates -- PertinentRates is a CTE with WHERE clause applied
          PIVOT (SUM(UnitPrice) FOR RateItemTypeID IN ([1], [2], [3])) PVT)

Across SQL Server 2005, 2008 R2, 2012 and 2014 (the versions of SQL Server I've worked with that implement PIVOT), in my experience, it has always been faster than SUM(CASE) or in a few cases equally fast. Are there examples where PIVOT is slower?

I can't give the DDL because it is an example from my work. But the table is pretty simple. In the PIVOT example it is drawing from a CTE whereas the SUM(CASE) is drawing directly from the table. But the SUM(CASE) performs the same drawing from the CTE.

In my work example, the PIVOT comes back in 10 seconds while the SUM(CASE) comes back in 14. Clearly it must be doing something different under the covers. The plans are the same, 50% of total each. PIVOT converted to SUM(CASE) in the query analyzer. Yet SUM(CASE) never comes back in less than 13 seconds, and PIVOT never comes back in over 11 seconds.

I've tried running them back and forth, it doesn't matter the order they are run in. If I run them both from a cold cache they both take longer, but PIVOT is still faster, 12 vs 17 seconds. Can't reproduce on a second server, but that one is considerably better; it's 5 seconds each there with minor variations. PIVOT is a little better, but percentage-wise it doesn't have the same edge as on the first server.

The IO stats, like the query plan, is identical between the two. That is odd, I had kind of expected to see different IO stats, even though I've never looked at them for this particular example.

Best Answer

Are there examples where PIVOT is slower?

This is unlikely in simple cases. As Itzik Ben-Gan notes in his SQL Server Pro article, Pivoting Data when looking at the plan for a PIVOT query (emphasis added):

Figure 3 shows the plan for the PIVOT query. As you can see, this plan is very similar to that of the standard solution—so much so that if you look at the properties of the Aggregate operator, under Defined Values, you’ll find that SQL Server constructed CASE expressions behind the scenes:

…
[Expr1022] = Scalar Operator(SUM(CASE WHEN [InsideTSQL2008].[Sales].[Orders].[shipcity]=N'Barcelona' THEN [InsideTSQL2008].[Sales].[Orders].[freight] ELSE NULL END)) …

With this in mind, you shouldn’t expect the solution that’s based on the PIVOT operator to perform better than the standard solution. The main benefit in the PIVOT operator at the moment is that it’s less verbose.

For more advanced pivoting requirements that the (non-standard) PIVOT syntax does not directly support, workarounds are needed. These may or may not lead to worse performance compared with CASE, depending on various factors including the skill level of the implementor.

Examples of these problematic cases are covered in Itzik's article, and also well explained in Robert Sheldon's Simple Talk article, Questions About Pivoting Data in SQL Server You Were Too Shy to Ask.

My experience has been that PIVOT and Agg(CASE... generate extremely similar plans with extremely close performance characteristics when both are written optimally. My usual advice is to write queries using whatever syntax feels most natural to you, and to only try rewrites if performance is not acceptable.

Internals

The SQL Server query processor does have a built-in Pivot logical operator (LogOp_Pivot), so it is maybe not quite correct to say that SQL Server rewrites pivots to aggregates and case expressions, at least if we are talking about parsing and compilation activities that take place prior to cost-based optimization (trivial plans are not available for pivot queries).

On the other hand, it is true that the only way the optimizer can implement a query tree containing LogOp_Pivot is via the exploration rule ExpandPivot. This rule expands LogOp_Pivot into a normal grouped aggregate (LogOp_GbAgg) with associated scalar expressions. When this rule is disabled, pivot queries fail to compile.

In practice then, we can say that pivots are always (eventually) 'rewritten' as aggregates and scalar expressions before an executable plan can be produced.

Anyway, the result of the rewrite to LogOp_GbAgg is converted to the physical operators needed for an executable plan by the regular group-by aggregate implementation rules GbAggToHS (hash) or GbAggToStrm (stream).

As a side note, the reason 'manual pivots' (aggregates on case expressions) have an extra Compute Scalar below the aggregate is that the case expressions are pushed toward the leaf level of the query tree during Project Normalization (an early stage of compilation, before cost-based optimization).

Queries that use the PIVOT syntax do not have this because the expressions are not created until ExpandPivot runs during cost-based optimization. At the (earlier) time Project Normalization runs, the query tree still has LogOp_Pivot elements, so there are no projections to push down, and the case expressions typically end up inside the hash or stream aggregate.

There is typically no advantage in avoiding the Compute Scalar, since from SQL Server 2005 onward, expression evaluation is normally deferred until the result is required by a later operator. In this case, evaluation of the case expressions is deferred until the aggregate (hash or stream) requires it.

Related Solutions

SQL Server CTE Benefits – Advantages of Using Common Table Expressions

As a rule, a CTE will NEVER improve performance.

A CTE is essentially a disposable view. There are no additional statistics stored, no indexes, etc. It functions as a shorthand for a subquery.

In my opinion they can be EASILY overused (I see a lot of overuse in code in my job). Some good answers are here, but if you need to refer to something more than once, or it's more than a few hundred thousand rows, put it into a #temp table instead and index it.

Sql-server – Optimizing a CTE hierarchy

Edit: this is second attempt

Based on @Hannah Vernon's answer, here is a way to bypass the use of CTE inside an inline subquery, which is like self-joining the CTE and I presume is the reason for the poor efficiency. It uses analytic functions available only in the 2012 version of SQL-Server. Tested at SQL-Fiddle

This part can be skipped from reading, it's a copy-paste from Hannah's answer:

;With AccountHierarchy AS
(                                                                           
    SELECT
        Root.AcctId                                         AccountId
        , Root.Name                                         AccountName
        , Root.ParentId                                     ParentId
        , 1                                                 HierarchyLevel  
        , cast(Root.Acctid as varchar(4000))                IdHierarchyMatch        
        , cast(Root.Acctid as varchar(4000))                IdHierarchy
        , cast(replace(Root.Name,'.','') as varchar(4000))  NameHierarchy   
        , cast(Root.Acctid as varchar(4000))                HierarchySort
        , cast(Root.Name as varchar(4000))                  HierarchyLabel          ,
        Root.CustomerCount                                  CustomerCount   

    FROM 
        account Root

    WHERE
        Root.ParentID is null

    UNION ALL

    SELECT
        Recurse.Acctid                                      AccountId
        , Recurse.Name                                      AccountName
        , Recurse.ParentId                                  ParentId
        , Root.HierarchyLevel + 1                           HierarchyLevel
        , CAST(CAST(Root.IdHierarchyMatch as varchar(40)) + '.' 
            + cast(recurse.Acctid as varchar(40))   as varchar(4000))   IdHierarchyMatch
        , cast(cast(recurse.Acctid as varchar(40)) + '.' 
            + Root.IdHierarchy  as varchar(4000))           IdHierarchy
        , cast(replace(recurse.Name,'.','') + '.' 
            + Root.NameHierarchy as varchar(4000))          NameHierarchy
        , cast(Root.AccountName + '.' 
            + Recurse.Name as varchar(4000))                HierarchySort   
        , cast(space(root.HierarchyLevel * 4) 
            + Recurse.Name as varchar(4000))                HierarchyLabel
        , Recurse.CustomerCount                             CustomerCount
    FROM
        account Recurse INNER JOIN
        AccountHierarchy Root on Root.AccountId = Recurse.ParentId
)

Here we order the rows of the CTE using the IdHierarchyMatch and we calculate row numbers and a running total (from the next row up to the end.)

, cte1 AS 
(
SELECT
    h.AccountId
    , h.AccountName
    , h.ParentId
    , h.HierarchyLevel
    , h.IdHierarchy
    , h.NameHierarchy
    , h.HierarchyLabel
    , parsename(h.IdHierarchy,1) Acct1Id
    , parsename(h.NameHierarchy,1) Acct1Name
    , parsename(h.IdHierarchy,2) Acct2Id
    , parsename(h.NameHierarchy,2) Acct2Name
    , parsename(h.IdHierarchy,3) Acct3Id
    , parsename(h.NameHierarchy,3) Acct3Name
    , parsename(h.IdHierarchy,4) Acct4Id
    , parsename(h.NameHierarchy,4) Acct4Name
    , h.CustomerCount
    , h.HierarchySort
    , h.IdHierarchyMatch
        , Rn = ROW_NUMBER() OVER 
                  (ORDER BY h.IdHierarchyMatch)
        , RunningCustomerCount = COALESCE(
            SUM(h.CustomerCount)
            OVER
              (ORDER BY h.IdHierarchyMatch
               ROWS BETWEEN 1 FOLLOWING
                        AND UNBOUNDED FOLLOWING)
          , 0) 
FROM
    AccountHierarchy AS h  
)

Then we have one more intermediate CTE where we use the previous running totals and row numbers - basically to find where the end points for the branches of the tree structure:

, cte2 AS
(
SELECT
    cte1.*
    , rn3  = LAST_VALUE(Rn) OVER 
               (PARTITION BY Acct1Id, Acct2Id, Acct3Id 
                ORDER BY Acct4Id
                ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)       
    , rn2  = LAST_VALUE(Rn) OVER 
               (PARTITION BY Acct1Id, Acct2Id 
                ORDER BY Acct3Id, Acct4Id
                ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) 
    , rn1  = LAST_VALUE(Rn) OVER 
               (PARTITION BY Acct1Id 
                ORDER BY Acct2Id, Acct3Id, Acct4Id
                ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) 
    , rcc3 = LAST_VALUE(RunningCustomerCount) OVER 
               (PARTITION BY Acct1Id, Acct2Id, Acct3Id 
                ORDER BY Acct4Id
                ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)       
    , rcc2 = LAST_VALUE(RunningCustomerCount) OVER 
               (PARTITION BY Acct1Id, Acct2Id 
                ORDER BY Acct3Id, Acct4Id
                ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) 
    , rcc1 = LAST_VALUE(RunningCustomerCount) OVER 
               (PARTITION BY Acct1Id 
                ORDER BY Acct2Id, Acct3Id, Acct4Id
                ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) 
FROM
    cte1 
)

and finally we build the last part:

SELECT
    hier.AccountId
    , hier.AccountName
    ---                        -- columns skipped 
    , hier.CustomerCount

    , TotalCustomerCount = hier.CustomerCount
        + hier.RunningCustomerCount 
        - ca.LastRunningCustomerCount

    , hier.HierarchySort
    , hier.IdHierarchyMatch
FROM
    cte2 hier
  OUTER APPLY
    ( SELECT  LastRunningCustomerCount, Rn
      FROM
      ( SELECT LastRunningCustomerCount
              = RunningCustomerCount, Rn
        FROM (SELECT NULL a) x  WHERE 4 <= HierarchyLevel 
      UNION ALL
        SELECT rcc3, Rn3
        FROM (SELECT NULL a) x  WHERE 3 <= HierarchyLevel 
      UNION ALL
        SELECT rcc2, Rn2 
        FROM (SELECT NULL a) x  WHERE 2 <= HierarchyLevel 
      UNION ALL
        SELECT rcc1, Rn1
        FROM (SELECT NULL a) x  WHERE 1 <= HierarchyLevel 
      ) x
      ORDER BY Rn 
      OFFSET 0 ROWS
      FETCH NEXT 1 ROWS ONLY
      ) ca
ORDER BY
    hier.HierarchySort ;

And a simplification, using the same cte1 as the code above. Test at SQL-Fiddle-2. Please note that both solutions work under the assumption that you have a maximum of four levels in your tree:

SELECT
    hier.AccountId
    ---                      -- skipping rows
    , hier.CustomerCount

    , TotalCustomerCount = CustomerCount
        + RunningCustomerCount 
        - CASE HierarchyLevel
            WHEN 4 THEN RunningCustomerCount
            WHEN 3 THEN LAST_VALUE(RunningCustomerCount) OVER 
                   (PARTITION BY Acct1Id, Acct2Id, Acct3Id 
                    ORDER BY Acct4Id
                    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)       
            WHEN 2 THEN LAST_VALUE(RunningCustomerCount) OVER 
                   (PARTITION BY Acct1Id, Acct2Id 
                    ORDER BY Acct3Id, Acct4Id
                    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) 
            WHEN 1 THEN LAST_VALUE(RunningCustomerCount) OVER 
                   (PARTITION BY Acct1Id 
                    ORDER BY Acct2Id, Acct3Id, Acct4Id
                    ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) 
          END

    , hier.HierarchySort
    , hier.IdHierarchyMatch
FROM cte1 AS hier
ORDER BY
    hier.HierarchySort ;

A third approach, with only one CTE, for the recursive part and then only window aggregate functions (SUM() OVER (...)), so it should work in any version from 2005 upwards. Test at SQL-Fiddle-3 This solution assumes, like the previous ones, that there are 4 levels maximum in the hierarchy tree:

;WITH AccountHierarchy AS
(                                                                           
    SELECT
          AccountId      = Root.AcctId                                         
        , AccountName    = Root.Name                                         
        , ParentId       = Root.ParentId                                     
        , HierarchyLevel = 1                                                   
        , HierarchySort  = CAST(Root.Acctid AS VARCHAR(4000))                
        , HierarchyLabel = CAST(Root.Name AS VARCHAR(4000))                   
        , Acct1Id        = CAST(Root.Acctid AS VARCHAR(4000))                
        , Acct2Id        = CAST(NULL AS VARCHAR(4000))                       
        , Acct3Id        = CAST(NULL AS VARCHAR(4000))                       
        , Acct4Id        = CAST(NULL AS VARCHAR(4000))                       
        , Acct1Name      = CAST(Root.Name AS VARCHAR(4000))                  
        , Acct2Name      = CAST(NULL AS VARCHAR(4000))                       
        , Acct3Name      = CAST(NULL AS VARCHAR(4000))                       
        , Acct4Name      = CAST(NULL AS VARCHAR(4000))                       
        , CustomerCount  = Root.CustomerCount                                   

    FROM 
        account AS Root

    WHERE
        Root.ParentID IS NULL

    UNION ALL

    SELECT
          Recurse.Acctid 
        , Recurse.Name 
        , Recurse.ParentId 
        , Root.HierarchyLevel + 1 
        , CAST(Root.AccountName + '.' 
            + Recurse.Name AS VARCHAR(4000)) 
        , CAST(SPACE(Root.HierarchyLevel * 4) 
            + Recurse.Name AS VARCHAR(4000)) 
        , Root.Acct1Id 
        , CASE WHEN Root.HierarchyLevel = 1 
              THEN cast(Recurse.Acctid AS VARCHAR(4000)) 
              ELSE Root.Acct2Id 
          END 
        , CASE WHEN Root.HierarchyLevel = 2 
              THEN CAST(Recurse.Acctid AS VARCHAR(4000)) 
              ELSE Root.Acct3Id 
          END 
        , CASE WHEN Root.HierarchyLevel = 3 
              THEN CAST(Recurse.Acctid AS VARCHAR(4000)) 
              ELSE Root.Acct4Id 
          END 
 
        , cast(Root.AccountName as varchar(4000))          
        , CASE WHEN Root.HierarchyLevel = 1 
              THEN CAST(Recurse.Name AS VARCHAR(4000)) 
              ELSE Root.Acct2Name 
          END 
        , CASE WHEN Root.HierarchyLevel = 2 
              THEN CAST(Recurse.Name AS VARCHAR(4000)) 
              ELSE Root.Acct3Name 
          END 
        , CASE WHEN Root.HierarchyLevel = 3 
              THEN CAST(Recurse.Name AS VARCHAR(4000)) 
              ELSE Root.Acct4Name 
          END 
        , Recurse.CustomerCount 
    FROM 
        account AS Recurse INNER JOIN 
        AccountHierarchy AS Root ON Root.AccountId = Recurse.ParentId
)

SELECT
      h.AccountId
    , h.AccountName
    , h.ParentId
    , h.HierarchyLevel
    , IdHierarchy = 
          CAST(COALESCE(h.Acct4Id+'.','') 
               + COALESCE(h.Acct3Id+'.','') 
               + COALESCE(h.Acct2Id+'.','') 
               + h.Acct1Id AS VARCHAR(4000))
    , NameHierarchy = 
          CAST(COALESCE(h.Acct4Name+'.','') 
               + COALESCE(h.Acct3Name+'.','') 
               + COALESCE(h.Acct2Name+'.','') 
               + h.Acct1Name AS VARCHAR(4000))   
    , h.HierarchyLabel
    , h.Acct1Id
    , h.Acct1Name
    , h.Acct2Id
    , h.Acct2Name
    , h.Acct3Id
    , h.Acct3Name
    , h.Acct4Id
    , h.Acct4Name
    , h.CustomerCount
    , TotalCustomerCount =  
          CASE h.HierarchyLevel
            WHEN 4 THEN h.CustomerCount
            WHEN 3 THEN SUM(h.CustomerCount) OVER 
                   (PARTITION BY h.Acct1Id, h.Acct2Id, h.Acct3Id)       
            WHEN 2 THEN SUM(h.CustomerCount) OVER 
                   (PARTITION BY Acct1Id, h.Acct2Id) 
            WHEN 1 THEN SUM(h.CustomerCount) OVER 
                   (PARTITION BY h.Acct1Id) 
          END
    , h.HierarchySort
    , IdHierarchyMatch = 
          CAST(h.Acct1Id 
               + COALESCE('.'+h.Acct2Id,'') 
               + COALESCE('.'+h.Acct3Id,'') 
               + COALESCE('.'+h.Acct4Id,'') AS VARCHAR(4000))   
FROM
    AccountHierarchy AS h  
ORDER BY
    h.HierarchySort ;

A 4th approach, that calculates as an intermediate CTE, the closure table of the hierarchy. Test at SQL-Fiddle-4. The benefit is that for the sums calculations, there is no resctriction on the number of levels.

;WITH AccountHierarchy AS
( 
    -- skipping several line, identical to the 3rd approach above
)

, ClosureTable AS
( 
    SELECT
          AccountId      = Root.AcctId  
        , AncestorId     = Root.AcctId  
        , CustomerCount  = Root.CustomerCount 
    FROM 
        account AS Root

    UNION ALL

    SELECT
          Recurse.Acctid 
        , Root.AncestorId 
        , Recurse.CustomerCount
    FROM 
        account AS Recurse INNER JOIN 
        ClosureTable AS Root ON Root.AccountId = Recurse.ParentId
)

, ClosureGroup AS
(                                                                           
    SELECT
          AccountId           = AncestorId  
        , TotalCustomerCount  = SUM(CustomerCount)                             
    FROM 
        ClosureTable AS a
    GROUP BY
        AncestorId
)

SELECT
      h.AccountId
    , h.AccountName
    , h.ParentId
    , h.HierarchyLevel 
    , h.HierarchyLabel
    , h.CustomerCount
    , cg.TotalCustomerCount 

    , h.HierarchySort
FROM
    AccountHierarchy AS h  
  JOIN
    ClosureGroup AS cg
      ON cg.AccountId = h.AccountId
ORDER BY
    h.HierarchySort ;

Best Answer

Internals

Related Solutions

SQL Server CTE Benefits – Advantages of Using Common Table Expressions

Sql-server – Optimizing a CTE hierarchy

Related Question