Sql-server – using $Partition function to improve query performance

partitioningperformancequery-performancesql serversql server 2014

I have tables which are partitioned based on a INT column.

I see some queries that are using $Partition function to compare partition number instead of comparing the actual field data.

For example instead of saying:

select *
from T1 
    inner join T2 on T2.SnapshotKey = T1.SnapshotKey

they have been written like below:

select *
from T1 
    inner join T2 on $Partition.PF_Name(T2.SnapshotKey) = $Partition.PF_Name(T1.SnapshotKey)

where PF_Name is the name of partition function.

I see comments on those queries that this has been done to improve performance, and when I run both queries I see different in execution time and different execution plan. I'm not sure how these two queries are different.

This is a real query:

-- this takes about 9 seconds 
select sum(PrinBal)
from fpc
    inner join gl on gl.SnapshotKey = fpc.SnapshotKey
where fpc.SnapshotKey = 201703

-- this takes about 5 seconds
select sum(PrinBal)
from fpc
    inner join gl on $Partition.SnapshotKeyPF(gl.SnapshotKey) = $Partition.SnapshotKeyPF(fpc.SnapshotKey)
where fpc.SnapshotKey = 201703

And below is the execution plan of the real query:

Sorry I can't upload even a sanitized execution plan as uploads are monitored by our network and it might be a policy violation.

Question is: Why execution plans are different and why the second query is faster.

I appreciate if someone can share any idea on this. Just interested to know why they are different. Faster is better though.

This is happening on SQL Server 2014. If I run the same on SQL Server 2012, the result will be different and the first query will perform faster!

Best Answer

Why execution plans are different

First query

select sum(PrinBal)
from fpc
    inner join gl on gl.SnapshotKey = fpc.SnapshotKey
where fpc.SnapshotKey = 201703

The optimizer knows:

gl.SnapshotKey = fpc.SnapshotKey; and
fpc.SnapshotKey = 201703

so it can infer:

gl.SnapshotKey = 201703

Just as if you had written:

select sum(PrinBal)
from fpc
    inner join gl on gl.SnapshotKey = fpc.SnapshotKey
where fpc.SnapshotKey = 201703
and gl.SnapshotKey = 201703

The literal value 201703 can also be used by the optimizer to determine the partition id. With both SnapshotKey predicates (one given, one inferred) this means the optimizer knows the partition id for both tables.

Going further, with a literal value (201703) for SnapshotKey now available on both tables, the join predicate:

gl.SnapshotKey = fpc.SnapshotKey

simplifies to:

201703 = 201703; or simply
true

Meaning there is no join predicate at all. The result is a logical cross join. Expressing the final execution plan using the closest available T-SQL syntax, it is as if you wrote:

SELECT
    CASE
        WHEN SUM(Q1.c) = 0 THEN NULL
        ELSE SUM(Q1.s)
    END
FROM 
(
    SELECT c = COUNT_BIG(*), s = SUM(GL.PrinBal)
    FROM dbo.gl AS GL
    WHERE GL.SnapshotKey = 201703
    AND $PARTITION.PF(GL.SnapshotKey) = $PARTITION.PF(201703)
) AS Q1
CROSS JOIN
(
    SELECT Dummy = 1
    FROM dbo.fpc AS FPC
    WHERE FPC.SnapshotKey = 201703
    AND $PARTITION.PF(FPC.SnapshotKey) = $PARTITION.PF(201703)
) AS Q2;

Second query

select sum(PrinBal)
from fpc
    inner join gl on $Partition.PF(gl.SnapshotKey) = $Partition.PF(fpc.SnapshotKey)
where fpc.SnapshotKey = 201703

The optimizer can no longer infer anything about gl.SnapshotKey, so the simplifications and transformations made for the first query are no longer possible.

Indeed, unless it is true that each partition holds only a single SnapshotKey, the rewrite is not guaranteed to produce the same results.

Again, expressing the execution plan produced using the closest available T-SQL syntax:

SELECT
    CASE
        WHEN SUM(Q2.c) = 0 THEN NULL
        ELSE SUM(Q2.s)
    END
FROM 
(
    SELECT
        Q1.PtnID,
        c = COUNT_BIG(*),
        s = SUM(Q1.PrinBal)
    FROM 
    (
        SELECT GL.PrinBal, PtnID = $PARTITION.PF(GL.SnapshotKey)
        FROM dbo.gl AS GL
    ) AS Q1
    GROUP BY
        Q1.PtnID
) AS Q2
CROSS APPLY
(
    SELECT Dummy = 1
    FROM dbo.fpc AS FPC
    WHERE
        $PARTITION.PF(FPC.SnapshotKey) = Q2.PtnID
        AND FPC.SnapshotKey = 201703
) AS Q3;

This time there is no logical cross join. Instead, there is a correlated join (an apply) on the partition id.

why the second query is faster.

This is hard to assess from the information given. Using mock data and tables based on the queries and plan image provided, I found the first query outperformed the second in every case.

The same query expressed using different syntax can often produce a different execution plan, simply because the optimizer started from a different point, and explored options in a different order before it found a suitable execution plan. Plan search is not exhaustive, and not every possible logical transformation is available, so the end result is likely to be different. As noted above, the two queries do not necessarily express the same requirement anyway (at least given the information available to the optimizer).

On a separate note, be aware that the initial columnstore implementation in SQL Server 2012 (and to a somewhat lesser extent, 2014) has many limitations, not least on the optimization side of things. You will likely get better, and more consistent, results by upgrading to a more recent release (ideally the very latest). This is particularly true if you're going to be using partitioning.

I certainly would not recommend you get into the habit of rewriting joins using $PARTITION, except as a very last resort, and with a very deep understanding of what you are doing.

That's about all I can say without being able to see the schema or plan detail.

Related Solutions

Sql-server – Performance comparison between using Join and Window function to get lead and lag values

The relatively low row-mode performance of LEAD and LAG window functions compared with self joins is nothing new. For example, Michael Zilberstein wrote about it on SQLblog.com back in 2012. There is quite a bit of overhead in the (repeated) Segment, Sequence Project, Window Spool, and Stream Aggregate plan operators:

In SQL Server 2016, you have a new option, which is to enable batch mode processing for the window aggregates. This requires some sort of columnstore index on the table, even if it is empty. The presence of a columnstore index is currently required for the optimizer to consider batch mode plans. In particular, it enables the much more efficient Window Aggregate batch-mode operator.

To test this in your case, create an empty nonclustered columnstore index:

 -- Empty CS index
CREATE NONCLUSTERED COLUMNSTORE INDEX dummy 
ON dbo.tab1 (id, [time], [value]) 
WHERE id < 0 AND id > 0;

The query:

SELECT
    T1.id,
    T1.[time],
    T1.[value],
    value_lag = 
        LAG(T1.[value]) OVER (
            PARTITION BY T1.id
            ORDER BY T1.[time]),
    value_lead =
        LEAD(T1.[value]) OVER (
            PARTITION BY T1.id
            ORDER BY T1.[time])
FROM dbo.tab1 AS T1;

Should now give an execution plan like:

...which may well execute much faster.

You may need to use an OPTION (MAXDOP 1) or other hint to get the same plan shape when storing the results in a new table. The parallel version of the plan requires a batch mode sort (or possibly two), which may well be a little slower. It rather depends on your hardware.

For more on the Batch Mode Window Aggregate operator, see the following articles by Itzik Ben-Gan:

Sql-server – Three update query vs single update query performance

The first single update reads and writes every row from the table. The second and third then re-read and re-write a sub-set of those rows. Look at the Actual Number of Rows. When the three statements are combined into one, the optimizer figures that if it has to read everything to satisfy the first change then it can piggy-back off that for the second and third change.

Have a look at the XML version of the query plans, specifically the <ComputeScalar> operators and <ScalarOperator ScalarString=""> parts. In the original plan you'll see each is relatively simple and maps very closely to the SQL. For the all-in-one plan it's a monster. This is the optimizer re-writing the SQL into a logically equivalent form. A plan works¹ by passing each row through the operators once. The optimizer's doing all the work it has to do to satisfy all three changes as that row passes through one time.

I'd expect the second query to be faster because the data is only read and written once whereas it is touched three times in the first.

As the second query has no predicates (no WHERE clause) the optimizer has no choice but read every single row and process it. I'm surprised the second form takes longer than the first. Are both starting from clean buffers? Is there other work happening on the system? Since it's reading and writing to a temp table the IO is happening in tempdb. Is there file growth or somesuch happening?

By one measure you have achieved your desired outcome. You say you want to make changes "so that the IO can be reduced." The all-in-one does less IO than the three separate statements do in total. I suspect what you really want, however, is reduced elapsed time, and this is obviously not happening.

_{¹ more or less, lots of details omitted.}

I ran your routine to generate test data then ran the three single-update statements and the all-in-one statement. Although there are some differences (no clustered index, no parallelism) I get more-or-less the same results. Specifically, the plans are about the same shape and the three individual queries complete in about two seconds and the one big query in about thirty to thirty five seconds.

I set

set nocount off;
set statistics io on;
set statistics time on;

With the plan in cache and the data in memory I get:

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table '#ResultSet...'. Scan count 1, logical reads 125223, physical reads 0

 SQL Server Execution Times:
   CPU time = 1422 ms,  elapsed time = 1417 ms.

(242906 row(s) affected)

Table '#ResultSet...'. Scan count 1, logical reads 125223, physical reads 0

 SQL Server Execution Times:
   CPU time = 344 ms,  elapsed time = 337 ms.

(0 row(s) affected)

Table '#ResultSet...'. Scan count 1, logical reads 125223, physical reads 0

 SQL Server Execution Times:
   CPU time = 734 ms,  elapsed time = 747 ms.

(0 row(s) affected)

I've removed some bits that aren't relevant. Since physical reads is zero for all three the table fits in memory. logical reads is the same for all three which makes sense. As there are no indexes the only approach is to scan every row of the table. The second and third query affect zero rows because I'd run them a few times already. CPU time and elapsed work out as 2500ms.

For the bigger query it is

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 0 ms.
Table '#ResultSet...'. Scan count 1, logical reads 125223

 SQL Server Execution Times:
   CPU time = 33093 ms,  elapsed time = 33137 ms.

(242906 row(s) affected)

The same number of pages are read, the same number of rows are updated. The huge differences is the CPU time. This is reflected in casual observation of Task Manager which shows 30% utilisation for the duration of query execution. The question is, why does it take so much?

The individual queries separately have simple calculations and two of the statements have predicates that greatly reduce the number of rows touched. The optimizer has good heuristics for processing these and finds a quick plan. The all-in-one query applies the monster Compute Scalar against every single row. My suggestion is that, for whatever reason, the optimizer cannot unravel the logic into a plan that's quick to run and ends up using a lot of CPU. The optimizer has to work with what its given, which in the second case is complex, nested SQL. Perhaps by refactoring the SQL the optimizer will follow different heuristics and achieve a better outcome? Perhaps some (filtered) indexes or (filtered) statistics will convince it to write a different plan. Maybe persisted computed columns would help it along. Perhaps you just need to give the optimizer what it needs and your first attempt really is the best that can be achieved and you need to find a way to run those three in parallel. Sorry I can't be more scientific.