SQL Server Functions – How to Perform a Function on a Returned Table Subset

functionsself-joinsql serversql-server-2012

I have a query that takes 10 minutes to run (1.4 million rows and 15-ish joins), so I think the time isn't terrible.

What I need from this is to effectively INNER JOIN on itself an additional column based on a key of 'n' other columns. This key would be selecting 3 specific rows and performing a mathematical function to produce one result. This result would be joined onto the 3 rows in the final column:

I know I could (and have with smaller queries) just run the query again with a group by and join it on to the original, but this (I believe) would double the query time.

Is there a standard way to do this that I am just not aware of?

Completely stumped by this one, and have had to resort to asking a question on here (after searching). Bear with me as I am self taught!

Best Answer

Below is an example of how you might use the FIRST_VALUE and LAST_VALUE Windows Functions to calculate your Gain_loss column and then use a CASE expression to give you a Classification

SET NOCOUNT ON
declare @Table table (Col1 Char(1), Col2 Char(2), Col3 Char(2), [Year] int, [Year+n] int, Result int)
insert into @Table 
    (Col1, Col2, Col3, [Year], [Year+n], Result) 
    values 
        ('A', 'UK', '1Z', 2015, 2, 30),
        ('A', 'UK', '1Z', 2015, 1, 20), 
        ('A', 'UK', '1Z', 2015, 0, 10), 
        ('A', 'UK', '1Y', 2015, 2, 30),
        ('A', 'UK', '1Y', 2015, 1, 35),
        ('A', 'UK', '1Y', 2015, 0, 40)

--Use the FIRST_VALUE and LAST_VALUE Window Functions to calculate the Gain_loss
SELECT Col1
    ,Col2
    ,Col3
    ,[YEAR]
    ,[Year+n]
    ,Result
    ,LAST_VALUE(Result) OVER(PARTITION BY Col1, Col2, Col3 
                            ORDER by Col1, Col2, Col3, [Year], [Year+n] 
                            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 
    FIRST_VALUE(Result) OVER(PARTITION BY Col1, Col2, Col3 
                            ORDER BY Col1, Col2, Col3, [Year], [Year+n]) 
    AS Gain_loss
    ,CASE
        WHEN 
            LAST_VALUE(Result) OVER(PARTITION BY Col1, Col2, Col3 
                            ORDER BY Col1, Col2, Col3, [Year], [Year+n] 
                            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 
            FIRST_VALUE(Result) OVER(PARTITION BY Col1, Col2, Col3 
                            ORDER BY Col1, Col2, Col3, [Year], [Year+n])  > 0 THEN 'Up'
        WHEN
            LAST_VALUE(Result) OVER(PARTITION BY Col1, Col2, Col3 
                            ORDER BY Col1, Col2, Col3, [Year], [Year+n] 
                            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) - 
            FIRST_VALUE(Result) OVER(PARTITION BY Col1, Col2, Col3 
                                ORDER BY Col1, Col2, Col3, [Year], [Year+n]) < 0 THEN 'Down'
        ELSE 'NoChange' 
    END AS Classification
FROM @Table
ORDER BY Col1, Col2, Col3 DESC, [YEAR], [Year+n]

That produces the following results (which appears to match your desired results)

| Col1 | Col2 | Col3 | YEAR | Year+n | Result | Gain_loss | Classification |
|------|------|------|------|--------|--------|-----------|----------------|
| A    | UK   | 1Z   | 2015 | 0      | 10     | 20        | Up             |
| A    | UK   | 1Z   | 2015 | 1      | 20     | 20        | Up             |
| A    | UK   | 1Z   | 2015 | 2      | 30     | 20        | Up             |
| A    | UK   | 1Y   | 2015 | 0      | 40     | -10       | Down           |
| A    | UK   | 1Y   | 2015 | 1      | 35     | -10       | Down           |
| A    | UK   | 1Y   | 2015 | 2      | 30     | -10       | Down           |

To learn more about SQL Server Window Functions, check out:

SELECT - OVER Clause (Transact-SQL)

How to Use Microsoft SQL Server 2012's Window Functions

Related Solutions

SQL Server MERGE – Subset of Target Table

The MERGE statement has a complex syntax and an even more complex implementation, but essentially the idea is to join two tables, filter down to rows that need to be changed (inserted, updated, or deleted), and then to perform the requested changes. Given the following sample data:

DECLARE @CategoryItem AS TABLE
(
    CategoryId  integer NOT NULL,
    ItemId      integer NOT NULL,

    PRIMARY KEY (CategoryId, ItemId),
    UNIQUE (ItemId, CategoryId)
);

DECLARE @DataSource AS TABLE
(
    CategoryId  integer NOT NULL,
    ItemId      integer NOT NULL

    PRIMARY KEY (CategoryId, ItemId)
);

INSERT @CategoryItem
    (CategoryId, ItemId)
VALUES
    (1, 1),
    (1, 2),
    (1, 3),
    (2, 1),
    (2, 3),
    (3, 5),
    (3, 6),
    (4, 5);

INSERT @DataSource
    (CategoryId, ItemId)
VALUES
    (2, 2);

Target

╔════════════╦════════╗
║ CategoryId ║ ItemId ║
╠════════════╬════════╣
║          1 ║      1 ║
║          2 ║      1 ║
║          1 ║      2 ║
║          1 ║      3 ║
║          2 ║      3 ║
║          3 ║      5 ║
║          4 ║      5 ║
║          3 ║      6 ║
╚════════════╩════════╝

Source

╔════════════╦════════╗
║ CategoryId ║ ItemId ║
╠════════════╬════════╣
║          2 ║      2 ║
╚════════════╩════════╝

The desired outcome is to replace data in the target with data from the source, but only for CategoryId = 2. Following the description of MERGE given above, we should write a query that joins the source and target on the keys only, and filter rows only in the WHEN clauses:

MERGE INTO @CategoryItem AS TARGET
USING @DataSource AS SOURCE ON 
    SOURCE.ItemId = TARGET.ItemId 
    AND SOURCE.CategoryId = TARGET.CategoryId
WHEN NOT MATCHED BY SOURCE 
    AND TARGET.CategoryId = 2 
    THEN DELETE
WHEN NOT MATCHED BY TARGET 
    AND SOURCE.CategoryId = 2 
    THEN INSERT (CategoryId, ItemId)
        VALUES (CategoryId, ItemId)
OUTPUT 
    $ACTION, 
    ISNULL(INSERTED.CategoryId, DELETED.CategoryId) AS CategoryId,
    ISNULL(INSERTED.ItemId, DELETED.ItemId) AS ItemId
;

This gives the following results:

╔═════════╦════════════╦════════╗
║ $ACTION ║ CategoryId ║ ItemId ║
╠═════════╬════════════╬════════╣
║ DELETE  ║          2 ║      1 ║
║ INSERT  ║          2 ║      2 ║
║ DELETE  ║          2 ║      3 ║
╚═════════╩════════════╩════════╝
╔════════════╦════════╗
║ CategoryId ║ ItemId ║
╠════════════╬════════╣
║          1 ║      1 ║
║          1 ║      2 ║
║          1 ║      3 ║
║          2 ║      2 ║
║          3 ║      5 ║
║          3 ║      6 ║
║          4 ║      5 ║
╚════════════╩════════╝

The execution plan is: Merge plan

Notice both tables are scanned fully. We might think this inefficient, because only rows where CategoryId = 2 will be affected in the target table. This is where the warnings in Books Online come in. One misguided attempt to optimize to touch only necessary rows in the target is:

MERGE INTO @CategoryItem AS TARGET
USING 
(
    SELECT CategoryId, ItemId
    FROM @DataSource AS ds 
    WHERE CategoryId = 2
) AS SOURCE ON
    SOURCE.ItemId = TARGET.ItemId
    AND TARGET.CategoryId = 2
WHEN NOT MATCHED BY TARGET THEN
    INSERT (CategoryId, ItemId)
    VALUES (CategoryId, ItemId)
WHEN NOT MATCHED BY SOURCE THEN
    DELETE
OUTPUT 
    $ACTION, 
    ISNULL(INSERTED.CategoryId, DELETED.CategoryId) AS CategoryId,
    ISNULL(INSERTED.ItemId, DELETED.ItemId) AS ItemId
;

The logic in the ON clause is applied as part of the join. In this case, the join is a full outer join (see this Books Online entry for why). Applying the check for category 2 on the target rows as part of an outer join ultimately results in rows with a different value being deleted (because they do not match the source):

╔═════════╦════════════╦════════╗
║ $ACTION ║ CategoryId ║ ItemId ║
╠═════════╬════════════╬════════╣
║ DELETE  ║          1 ║      1 ║
║ DELETE  ║          1 ║      2 ║
║ DELETE  ║          1 ║      3 ║
║ DELETE  ║          2 ║      1 ║
║ INSERT  ║          2 ║      2 ║
║ DELETE  ║          2 ║      3 ║
║ DELETE  ║          3 ║      5 ║
║ DELETE  ║          3 ║      6 ║
║ DELETE  ║          4 ║      5 ║
╚═════════╩════════════╩════════╝

╔════════════╦════════╗
║ CategoryId ║ ItemId ║
╠════════════╬════════╣
║          2 ║      2 ║
╚════════════╩════════╝

The root cause is the same reason predicates behave differently in an outer join ON clause than they do if specified in the WHERE clause. The MERGE syntax (and the join implementation depending on the clauses specified) just make it harder to see that this is so.

The guidance in Books Online (expanded in the Optimizing Performance entry) offers guidance that will ensure the correct semantic is expressed using MERGE syntax, without the user necessarily having to understand all the implementation details, or account for the ways in which the optimizer might legitimately rearrange things for execution efficiency reasons.

The documentation offers three potential ways to implement early filtering:

Specifying a filtering condition in the WHEN clause guarantees correct results, but may mean that more rows are read and processed from the source and target tables than is strictly necessary (as seen in the first example).

Updating through a view that contains the filtering condition also guarantees correct results (since changed rows must be accessible for update through the view) but this does require a dedicated view, and one that follows the odd conditions for updating views.

Using a common table expression carries similar risks to adding predicates to the ON clause, but for slightly different reasons. In many cases it will be safe, but it requires expert analysis of the execution plan to confirm this (and extensive practical testing). For example:

WITH TARGET AS 
(
    SELECT * 
    FROM @CategoryItem
    WHERE CategoryId = 2
)
MERGE INTO TARGET
USING 
(
    SELECT CategoryId, ItemId
    FROM @DataSource
    WHERE CategoryId = 2
) AS SOURCE ON
    SOURCE.ItemId = TARGET.ItemId
    AND SOURCE.CategoryId = TARGET.CategoryId
WHEN NOT MATCHED BY TARGET THEN
    INSERT (CategoryId, ItemId)
    VALUES (CategoryId, ItemId)
WHEN NOT MATCHED BY SOURCE THEN
    DELETE
OUTPUT 
    $ACTION, 
    ISNULL(INSERTED.CategoryId, DELETED.CategoryId) AS CategoryId,
    ISNULL(INSERTED.ItemId, DELETED.ItemId) AS ItemId
;

This produces correct results (not repeated) with a more optimal plan:

Merge plan 2

The plan only reads rows for category 2 from the target table. This might be an important performance consideration if the target table is large, but it is all too easy to get this wrong using MERGE syntax.

Sometimes, it is easier to write the MERGE as separate DML operations. This approach can even perform better than a single MERGE, a fact which often surprises people.

DELETE ci
FROM @CategoryItem AS ci
WHERE ci.CategoryId = 2
AND NOT EXISTS 
(
    SELECT 1 
    FROM @DataSource AS ds 
    WHERE 
        ds.ItemId = ci.ItemId
        AND ds.CategoryId = ci.CategoryId
);

INSERT @CategoryItem
SELECT 
    ds.CategoryId, 
    ds.ItemId
FROM @DataSource AS ds
WHERE
    ds.CategoryId = 2;

SQL Server Join Syntax and Style Performance Considerations

In a world where the query optimizer considered all possible join orders, and contained all possible logical transformations, the syntax we use for our queries would not matter at all.

As it is, the optimizer generally uses heuristics to pick an initial join order and explores a number of join order rewrites from there. It does this to avoid excessive compilation time and resource usage. It doesn't take all that many joins for the number of possible combinations to become unreasonable to explore exhaustively.

To take an extreme example, 42 joins are enough to generate more alternatives than there are atoms in the observable universe. More realistically, even 7 tables are enough to produce 665,280 alternatives. Although this is not a mind-boggling number, it would still take very significant time (and memory) to explore those alternatives completely.

Although the heuristics are largely based on the type of join (inner, outer, cross...) and cardinality estimates, the textual order of the query can also have an impact. Sometimes, this is an optimizer limitation - NOT EXISTS clauses are not reordered, and outer join reordering is very limited. Even with simple inner joins, the interaction between textual order, initial join order heuristics, and optimizer internals can be difficult to predict with certainty.

To take an example using the AdventureWorks sample database, I can write a query using the a common syntax form as:

SELECT
    P.Name,
    PS.Name,
    SUM(TH.Quantity),
    SUM(INV.Quantity)
FROM Production.Product AS P
JOIN Production.ProductSubcategory AS PS
    ON PS.ProductSubcategoryID = P.ProductSubcategoryID
JOIN Production.TransactionHistory AS TH
    ON TH.ProductID = P.ProductID
JOIN Production.ProductInventory AS INV
    ON INV.ProductID = P.ProductID
GROUP BY
    P.ProductID,
    P.Name,
    PS.ProductSubcategoryID,
    PS.Name;

Before cost-based optimization, the logical query tree looks like this (note the join order is not the same as the written order):

Logical tree

I can (carefully) rewrite the query to use 'nested' syntax:

SELECT
    P.Name,
    PS.Name,
    SUM(TH.Quantity),
    SUM(INV.Quantity)
FROM Production.ProductSubcategory AS PS
JOIN Production.Product AS P
JOIN Production.TransactionHistory AS TH
JOIN Production.ProductInventory AS INV
    ON INV.ProductID = TH.ProductID
    ON TH.ProductID = P.ProductID
    ON P.ProductSubcategoryID = PS.ProductSubcategoryID
GROUP BY
    P.ProductID,
    P.Name,
    PS.ProductSubcategoryID,
    PS.Name;

In which case the logical tree at the same point is:

Input tree 2

The two different syntaxes produce a different initial join order in this case. After cost-based optimization, both produce the same output plan shape:

Plan shape

There are detailed differences between the two plans, with the 'nested' syntax producing a plan with a somewhat lower estimated cost:

Plan 2

The two inputs took a slightly different path through the optimizer, so it isn't all that surprising there are slight differences.

In general, using different syntax will sometimes (definitely not always!) produce different plan results. There is no broad correlation between one syntax and better plans. Most people write and maintain queries using something like the non-nested join syntax, so it often makes practical sense to use that.

To summarize, my advice is to write queries using whichever syntax seems most natural (and maintainable!) to you and your peers. If you get a better plan for a specific query using a particular syntax, by all means use it - but be sure to test that you still get the better plan whenever you patch or upgrade SQL Server :)

Best Answer

Related Solutions

SQL Server MERGE – Subset of Target Table

SQL Server Join Syntax and Style Performance Considerations

Related Question