Sql-server – the algorithm behind the EXCEPT operator

excepthashingperformanceperformance-tuningsql serversql-server-2016

What is the internal algorithm of how the Except operator works under the covers in SQL Server? Does it internally take a hash of each row and compare?

David Lozinksi ran a study, SQL: Fastest way to insert new records where one doesn’t already exist He showed Except statement is the fastest for large number rows; closely tying to our results below.

Assumption: I would think Left join would be fastest, as it only compares 1 column, Except would take the longest, since it has to compare All the columns.
With these results, now our thinking is Except automatically and internally takes a hash of each row? I looked at Except execution plan and it does utilize some hash.

Background:
Our team was comparing two heap tables. Table A Rows not in Table B, were inserted into Table B.

The heap tables (from legacy text filesystem) do not have primary keys/guids/identifiers. Some of the tables had duplicate rows, so we found the Hash of each row, and removed duplicates, and created Primary key identifiers.

1) First we ran an except statement, excluding (hash column)

select * from TableA
Except
Select * from TableB,

2) Then we ran a left join compare between the two tables on the HashRowId

select * 
FROM dbo.TableA A
left join dbo.TableB B
    on A.RowHash =  B.RowHash
where B.Hash is null

surprisingly the Except Statement Insert was the fastest.

Results actually map close to testing results from David Lozinksi

Best Answer

What is the internal algorithm of how the Except operator works under the covers in SQL Server?

I wouldn't say that there's a special internal algorithm for EXCEPT. For A EXCEPT B, the engine takes distinct (if needed) tuples from A and subtracts rows that match in B. There are no special query plan operators. The distinct and the subtraction are implemented through the typical operators that you would see with a sort or with a join. Nested loop join, merge join, and hash join are all supported. To show this, I'll throw 15 million rows into a pair of heaps:

DROP TABLE IF EXISTS dbo.TABLE_1;

CREATE TABLE dbo.TABLE_1 (
    COL1 BIGINT NULL,
    COL2 BIGINT NULL
);

INSERT INTO dbo.TABLE_1 WITH (TABLOCK)
SELECT TOP (15000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);


DROP TABLE IF EXISTS dbo.TABLE_2;

CREATE TABLE dbo.TABLE_2 (
    COL1 BIGINT NULL,
    COL2 BIGINT NULL
);

INSERT INTO dbo.TABLE_2 WITH (TABLOCK)
SELECT TOP (15000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

The optimizer makes it usual cost-based decisions about how to implement the sort and the join. With two heaps I get a hash join as expected. You can see other join types naturally by adding indexes or by changing the data in either table. Below I force merge and loop joins with hints just for illustrative purposes:

Does it internally take a hash of each row and compare?

No. It is implemented as any other join. One difference is that NULLs are treated as equal. This is a special type of comparison which you can see in the execution plan: <Compare CompareOp="IS">. However, you can get that same plan with T-SQL that does not include the EXCEPT keyword. For example, the following has the exact same query plan as the EXCEPT query that uses a hash join:

SELECT t1.*
FROM
(
    SELECT DISTINCT COL1, COL2
    FROM dbo.TABLE_1
) t1
WHERE NOT EXISTS (
    SELECT 1
    FROM dbo.TABLE_2 t2
    WHERE (t1.COL1 = t2.COL1 OR (t1.COL1 IS NULL AND t2.COL1 IS NULL))
    AND (t1.COL2 = t2.COL2 OR (t1.COL2 IS NULL AND t2.COL2 IS NULL))
);

Diffing the XML of the execution plans only reveals superficial differences around aliases and things like that. The probe residuals for the hash joins do the row comparison. They are the same for both queries:

If you still have doubts, I ran PerfView with the highest available sample rate to get call stacks for the query with EXCEPT and the query without it. Here are the results side by side:

There is no real difference. The call stacks there reference hashing are present because of the hash matches in the plan. If I add indexes to get a natural merge join, you won't see any references to hashing in the call stacks:

Any hashing that occurs is due to the implementation of hash match operators. There isn't anything special about EXCEPT which leads to a special, internal hashing comparison.

Related Solutions

Sql-server – Easily show rows that are different between two tables or queries

You don't need 30 join conditions for a FULL OUTER JOIN here.

You can just Full Outer Join on the PK, preserve rows with at least one difference with WHERE EXISTS (SELECT A.* EXCEPT SELECT B.*) and use CROSS APPLY (SELECT A.* UNION ALL SELECT B.*) to unpivot out both sides of the JOINed rows into individual rows.

WITH TableA(Col1, Col2, Col3) 
     AS (SELECT 'Dog',1,1     UNION ALL 
         SELECT 'Cat',27,86   UNION ALL 
         SELECT 'Cat',128,92), 
     TableB(Col1, Col2, Col3) 
     AS (SELECT 'Dog',1,1     UNION ALL 
         SELECT 'Cat',27,105  UNION ALL 
         SELECT 'Lizard',83,NULL) 
SELECT CA.*
FROM   TableA A 
       FULL OUTER JOIN TableB B 
         ON A.Col1 = B.Col1 
            AND A.Col2 = B.Col2 
/*Unpivot the joined rows*/
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
             SELECT 'TableB' AS what, B.*) AS CA     
/*Exclude identical rows*/
WHERE  EXISTS (SELECT A.* 
               EXCEPT 
               SELECT B.*) 
/*Discard NULL extended row*/
AND CA.Col1 IS NOT NULL      
ORDER BY CA.Col1, CA.Col2

Gives

what   Col1   Col2        Col3
------ ------ ----------- -----------
TableA Cat    27          86
TableB Cat    27          105
TableA Cat    128         92
TableB Lizard 83          NULL

Or a version dealing with the moved goalposts.

SELECT DISTINCT CA.*
FROM   TableA A 
       FULL OUTER JOIN TableB B 
         ON EXISTS (SELECT A.*  INTERSECT  SELECT B.*) 
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
             SELECT 'TableB' AS what, B.*) AS CA     
WHERE NOT EXISTS (SELECT A.*  INTERSECT  SELECT B.*) 
AND CA.Col1 IS NOT NULL
ORDER BY CA.Col1, CA.Col2

For tables with many columns it can still be difficult to identify the specific column(s) that differ. For that you can potentially use the below.

(though just on relatively small tables as otherwise this method likely won't have adequate performance)

SELECT t1.primary_key,
       y1.c,
       y1.v,
       y2.v
FROM   t1
       JOIN t2
         ON t1.primary_key = t2.primary_key
       CROSS APPLY (SELECT t1.*
                    FOR xml path('row'), elements xsinil, type) x1(x)
       CROSS APPLY (SELECT t2.*
                    FOR xml path('row'), elements xsinil, type) x2(x)
       CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
                           n.n.value('.', 'nvarchar(max)')
                    FROM   x1.x.nodes('row/*') AS n(n)) y1(c, v)
       CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
                           n.n.value('.', 'nvarchar(max)')
                    FROM   x2.x.nodes('row/*') AS n(n)) y2(c, v)
WHERE  y1.c = y2.c
       AND EXISTS(SELECT y1.v
                  EXCEPT
                  SELECT y2.v)

Sql-server – Identifying rows which don’t match a master row

You can also do this with dynamic SQL without having to manually build out all the column names.

DECLARE @sql NVARCHAR(MAX), @c1 NVARCHAR(MAX), @c2 NVARCHAR(MAX);

SELECT @c1 = N'', @c2 = N'';

SELECT 
  @c1 = @c1 + ',' + QUOTENAME(name),
  @c2 = @c2 + ' AND m.' + QUOTENAME(name) + ' = s.' + QUOTENAME(name)
 FROM sys.columns
 WHERE name <> 'LocationID'
 AND [object_id] = OBJECT_ID('dbo.table1');

SET @sql = ';WITH s AS (
       SELECT ' + STUFF(@c1, 1, 1, '') + ' FROM dbo.table1
       EXCEPT 
       SELECT ' + STUFF(@c1, 1, 1, '') + ' FROM dbo.table1_master
     ) 
     SELECT m.LocationID
 FROM s INNER JOIN dbo.table1 AS m ON 1 = 1
 ' + @c2;

SELECT @sql;
--EXEC sp_executesql @sql;

You can take the output of this query as is and store the query somewhere, or you can comment out the SELECT and uncomment the EXEC and leave it as permanent dynamic SQL - in this case it will automatically adapt to column changes in the two tables.

Another idea (assuming LocationID is unique) - and it occurred to me you may want to include the master row so you can quickly spot the columns that are different:

  ;WITH c AS 
  (
    SELECT t.LocationID, m.setting1, m.setting2, ...
      FROM dbo.table1 AS t CROSS JOIN dbo.table1_master AS m
  )
  SELECT DISTINCT src = '> master', setting1, setting2, ...
    FROM c
  UNION ALL
  (
    SELECT RTRIM(LocationID), setting1, setting2, ...
      FROM dbo.table1
    EXCEPT
    SELECT RTRIM(LocationID), setting1, setting2, ...
      FROM c
  )
  ORDER BY src;

This version is a little cheaper (mostly by avoiding the DISTINCT against the master table, at the cost of needing to specify all of the columns one more time - which again you can automate as per above):

  ;WITH m AS 
  (
    SELECT setting1, setting2, ... 
      FROM dbo.table1_master
  ),
  c AS 
  (
    SELECT src = RTRIM(t.LocationID), m.setting1, m.setting2, ...
      FROM dbo.table1 AS t CROSS JOIN m
  )
  SELECT src = '> master', setting1, setting2, ...
    FROM m
  UNION ALL
  (
    SELECT RTRIM(LocationID), setting1, setting2, ...
      FROM dbo.table1
    EXCEPT
    SELECT src, setting1, setting2, ...
      FROM c
  )
  ORDER BY src;

However all of these options are poorer performers with worse plans than Rachel's simple LEFT JOIN. I tried to stick to the theme of using EXCEPT even though it is more about syntax than performance.

The key takeaway is that if the column count is too high to deal with manually, you can use the dynamic SQL approach above to construct whatever query you want to use - and you can do that one time and store the result, or have the code generated every time. To generate Rachel's query using dynamic SQL, not much needs to change:

DECLARE @sql NVARCHAR(MAX), @and NVARCHAR(MAX), @anycol NVARCHAR(128);
SELECT @sql = N'', @and = N'';

SELECT @and = @and + ' AND t.' + QUOTENAME(name) + ' = m.' + QUOTENAME(name)
  FROM sys.columns
  WHERE [object_id] = OBJECT_ID('dbo.table1_master');

SELECT TOP (1) @anycol = QUOTENAME(name)
  FROM sys.columns
  WHERE [object_id] = OBJECT_ID('dbo.table1_master')
  ORDER BY name;

SET @sql = 'SELECT locationID
FROM dbo.table1 AS t
LEFT OUTER JOIN dbo.table1_master AS m ON 1 = 1' 
  + @and + ' WHERE m.' + @anycol + ' IS NULL;';

SELECT @sql;
--EXEC sp_executesql @sql;

Best Answer

Related Solutions

Sql-server – Easily show rows that are different between two tables or queries

Sql-server – Identifying rows which don’t match a master row

Related Question