Sql-server – Unusual column comparisons and query performance

exceptperformancequery-performancesql serversql-server-2016

We have some consultants working on expanding an inhouse data warehouse. I was doing a code review and ran across this pattern in all of the load procs:

    MERGE [EDHub].[Customer].[Class] AS TARGET
    USING (
        SELECT <columns>
        FROM [dbo].[vw_CustomerClass]
            WHERE JHAPostingDate = @PostingDate   
        ) AS SOURCE
        ON  TARGET.BankId = SOURCE.BankId       -- This join is on the business keys
            AND TARGET.Code = SOURCE.Code
    WHEN NOT MATCHED BY TARGET  
        THEN
            <INSERT Statement>
    WHEN MATCHED
        AND TARGET.IsLatest = 1
        AND EXISTS (
            SELECT SOURCE.[HASH]   
            EXCEPT          
            SELECT TARGET.[Hash]
            )
        THEN 
            <UPDATE Statement>

The gist is, if we have a new business key, insert but if the business key exists and the hash of the attributes don't match our current row then update the old row and insert a new one (later in the code). It all works fine but I paused when I got to this code

AND EXISTS (
            SELECT SOURCE.[HASH]   
            EXCEPT          
            SELECT TARGET.[Hash]
            )

It seems overly complicated compared to SOURCE.[HASH] <> TARGET.[Hash]. The EXCEPT will do an accurate NULL comparison but in our case hashes will never be NULL (or we have bigger problems). I want our code to be easy to read so that when someone has to maintain it, it doesn't confuse. I asked our consultants about it and they speculated that it might be faster because of set operations but I decided to write a simple test (test code below).

The first thing I noticed was the EXISTS/EXCEPT had a more complicated query plan but that's not always bad

I ran each select client statistics on and the <> join yielded total execution time of 12,000 vs 25,000 with the EXISTS/EXCEPT. I want to take this to our consultants with the request to refactor that statement but wanted to get feedback here on:

is this a good test? – am I missing anything?
is there a case where EXISTS/EXCEPT would be a better comparison?

Test script:

CREATE TABLE r (hash VARBINARY(8000))
CREATE TABLE l (hash VARBINARY(8000))

SET NOCOUNT ON
DECLARE @x INT = 10000
WHILE @x <> 0 BEGIN

   INSERT INTO dbo.r ( hash ) SELECT HASHBYTES('SHA2_256',CAST(NEWID() AS VARCHAR(200)))
   INSERT INTO dbo.l ( hash ) SELECT HASHBYTES('SHA2_256',CAST(NEWID() AS VARCHAR(200)))    

   SET @x = @x-1
END

INSERT INTO dbo.r ( hash ) VALUES ( NULL  )
INSERT INTO dbo.l ( hash ) VALUES ( NULL  )

SELECT COUNT(1) 
FROM dbo.l
CROSS JOIN dbo.r 
WHERE ISNULL(r.hash,0) <> ISNULL(l.hash,0)

SELECT COUNT(1) 
FROM dbo.l
CROSS JOIN dbo.r 
WHERE EXISTS(SELECT r.hash except select l.HASH)

Best Answer

I don't like ISNULL with sentinel values, it requires picking values that can't ever legitimately appear in the data now or forever after and personally I find expressions containing these more difficult to reason about.

For your test rig I tried four different ways of expressing the query and got the stated results.

SELECT COUNT(1)
FROM   dbo.l
       CROSS JOIN dbo.r
WHERE  r.hash <> l.hash
        OR ( r.hash IS NULL
             AND l.hash IS NOT NULL )
        OR ( l.hash IS NULL
             AND r.hash IS NOT NULL )

SQL Server Execution Times: CPU time = 30968 ms, elapsed time = 8230 ms.

SELECT COUNT(1)
FROM   dbo.l
       CROSS JOIN dbo.r
WHERE  ISNULL(r.hash, 0) <> ISNULL(l.hash, 0)

SQL Server Execution Times: CPU time = 31594 ms, elapsed time = 9230 ms.

SELECT COUNT(1)
FROM   dbo.l
       CROSS JOIN dbo.r
WHERE  EXISTS(SELECT r.hash
              EXCEPT
              SELECT l.HASH)

SQL Server Execution Times: CPU time = 46531 ms, elapsed time = 13191 ms.

SELECT COUNT(1)
FROM   dbo.l
       CROSS JOIN dbo.r
WHERE  NOT EXISTS(SELECT r.hash
                  INTERSECT
                  SELECT l.HASH)

SQL Server Execution Times: CPU time = 23812 ms, elapsed time = 6760 ms.

So on that basis the last one would be a clear winner - along with a code comment with a link to Undocumented Query Plans: Equality Comparisons for anyone unfamiliar with the pattern.

But you should test whether this pattern is reproducible with your actual MERGE query too.

Related Solutions

Sql-server – Way to do MERGE with update source when target not match

Updates are tracked by MERGE by providing:

the $Action value UPDATE;
the old values in the memory table deleted; and
the new values in the memory table inserted.

Therefore both in the case where a match is found, and in the case where there is no match by target, the values of the identity column is available in the inserted memory table.

Postgresql – Best way of finding rows referencing a given id on PostgreSQL

I suggest your first option, with two improvements and some simplifications.

(
SELECT 1      -- irrelevant what you select here
FROM   client_category_price
WHERE  sellable_id = '9bc202ca-f7c1-11e2-a751-062b1fc90460'
LIMIT  1      -- may be redundant
)
UNION ALL     -- not just UNION

  ...

UNION ALL
(
SELECT 1
FROM   work_order_item
WHERE  sellable_id = '9bc202ca-f7c1-11e2-a751-062b1fc90460'
LIMIT  1
)
LIMIT  1;      -- this one is crucial

Given that all you want to know is

if any of those (table, column) have my row's id there, which would prevent its deletion.

You don't need a full list of violating rows. Stop searching at the first one. All you need to do is add another LIMIT 1 at the end of the query. This way, Postgres skips rest of the query as soon as the first row is found. You probably don't need LIMIT 1 for each SELECT, just the one at the end. Test without, it may produce different query plans.
Use UNION ALL instead of UNION. Faster.
Some other simplifications.

Best Answer

Related Solutions

Sql-server – Way to do MERGE with update source when target not match

Postgresql – Best way of finding rows referencing a given id on PostgreSQL

Related Question