Sql-server – Unusual column comparisons and query performance

exceptperformancequery-performancesql serversql-server-2016

We have some consultants working on expanding an inhouse data warehouse. I was doing a code review and ran across this pattern in all of the load procs:

    MERGE [EDHub].[Customer].[Class] AS TARGET
    USING (
        SELECT <columns>
        FROM [dbo].[vw_CustomerClass]
            WHERE JHAPostingDate = @PostingDate   
        ) AS SOURCE
        ON  TARGET.BankId = SOURCE.BankId       -- This join is on the business keys
            AND TARGET.Code = SOURCE.Code
    WHEN NOT MATCHED BY TARGET  
        THEN
            <INSERT Statement>
    WHEN MATCHED
        AND TARGET.IsLatest = 1
        AND EXISTS (
            SELECT SOURCE.[HASH]   
            EXCEPT          
            SELECT TARGET.[Hash]
            )
        THEN 
            <UPDATE Statement>

The gist is, if we have a new business key, insert but if the business key exists and the hash of the attributes don't match our current row then update the old row and insert a new one (later in the code). It all works fine but I paused when I got to this code

AND EXISTS (
            SELECT SOURCE.[HASH]   
            EXCEPT          
            SELECT TARGET.[Hash]
            )

It seems overly complicated compared to SOURCE.[HASH] <> TARGET.[Hash]. The EXCEPT will do an accurate NULL comparison but in our case hashes will never be NULL (or we have bigger problems). I want our code to be easy to read so that when someone has to maintain it, it doesn't confuse. I asked our consultants about it and they speculated that it might be faster because of set operations but I decided to write a simple test (test code below).

The first thing I noticed was the EXISTS/EXCEPT had a more complicated query plan but that's not always bad

I ran each select client statistics on and the <> join yielded total execution time of 12,000 vs 25,000 with the EXISTS/EXCEPT. I want to take this to our consultants with the request to refactor that statement but wanted to get feedback here on:

  1. is this a good test? – am I missing anything?
  2. is there a case where EXISTS/EXCEPT would be a better comparison?

Test script:

CREATE TABLE r (hash VARBINARY(8000))
CREATE TABLE l (hash VARBINARY(8000))

SET NOCOUNT ON
DECLARE @x INT = 10000
WHILE @x <> 0 BEGIN

   INSERT INTO dbo.r ( hash ) SELECT HASHBYTES('SHA2_256',CAST(NEWID() AS VARCHAR(200)))
   INSERT INTO dbo.l ( hash ) SELECT HASHBYTES('SHA2_256',CAST(NEWID() AS VARCHAR(200)))    

   SET @x = @x-1
END

INSERT INTO dbo.r ( hash ) VALUES ( NULL  )
INSERT INTO dbo.l ( hash ) VALUES ( NULL  )

SELECT COUNT(1) 
FROM dbo.l
CROSS JOIN dbo.r 
WHERE ISNULL(r.hash,0) <> ISNULL(l.hash,0)

SELECT COUNT(1) 
FROM dbo.l
CROSS JOIN dbo.r 
WHERE EXISTS(SELECT r.hash except select l.HASH)

Best Answer

I don't like ISNULL with sentinel values, it requires picking values that can't ever legitimately appear in the data now or forever after and personally I find expressions containing these more difficult to reason about.

For your test rig I tried four different ways of expressing the query and got the stated results.

SELECT COUNT(1)
FROM   dbo.l
       CROSS JOIN dbo.r
WHERE  r.hash <> l.hash
        OR ( r.hash IS NULL
             AND l.hash IS NOT NULL )
        OR ( l.hash IS NULL
             AND r.hash IS NOT NULL )

SQL Server Execution Times: CPU time = 30968 ms, elapsed time = 8230 ms.

SELECT COUNT(1)
FROM   dbo.l
       CROSS JOIN dbo.r
WHERE  ISNULL(r.hash, 0) <> ISNULL(l.hash, 0)

SQL Server Execution Times: CPU time = 31594 ms, elapsed time = 9230 ms.

SELECT COUNT(1)
FROM   dbo.l
       CROSS JOIN dbo.r
WHERE  EXISTS(SELECT r.hash
              EXCEPT
              SELECT l.HASH)

SQL Server Execution Times: CPU time = 46531 ms, elapsed time = 13191 ms.

SELECT COUNT(1)
FROM   dbo.l
       CROSS JOIN dbo.r
WHERE  NOT EXISTS(SELECT r.hash
                  INTERSECT
                  SELECT l.HASH) 

SQL Server Execution Times: CPU time = 23812 ms, elapsed time = 6760 ms.

So on that basis the last one would be a clear winner - along with a code comment with a link to Undocumented Query Plans: Equality Comparisons for anyone unfamiliar with the pattern.

But you should test whether this pattern is reproducible with your actual MERGE query too.