Sql-server – the algorithm behind the EXCEPT operator

excepthashingperformanceperformance-tuningsql serversql-server-2016

What is the internal algorithm of how the Except operator works under the covers in SQL Server? Does it internally take a hash of each row and compare?

David Lozinksi ran a study, SQL: Fastest way to insert new records where one doesn’t already exist He showed Except statement is the fastest for large number rows; closely tying to our results below.

Assumption: I would think Left join would be fastest, as it only compares 1 column, Except would take the longest, since it has to compare All the columns.
With these results, now our thinking is Except automatically and internally takes a hash of each row? I looked at Except execution plan and it does utilize some hash.

Background:
Our team was comparing two heap tables. Table A Rows not in Table B, were inserted into Table B.

The heap tables (from legacy text filesystem) do not have primary keys/guids/identifiers. Some of the tables had duplicate rows, so we found the Hash of each row, and removed duplicates, and created Primary key identifiers.

1) First we ran an except statement, excluding (hash column)

select * from TableA
Except
Select * from TableB,

2) Then we ran a left join compare between the two tables on the HashRowId

select * 
FROM dbo.TableA A
left join dbo.TableB B
    on A.RowHash =  B.RowHash
where B.Hash is null

surprisingly the Except Statement Insert was the fastest.

Results actually map close to testing results from David Lozinksi

enter image description here

Best Answer

What is the internal algorithm of how the Except operator works under the covers in SQL Server?

I wouldn't say that there's a special internal algorithm for EXCEPT. For A EXCEPT B, the engine takes distinct (if needed) tuples from A and subtracts rows that match in B. There are no special query plan operators. The distinct and the subtraction are implemented through the typical operators that you would see with a sort or with a join. Nested loop join, merge join, and hash join are all supported. To show this, I'll throw 15 million rows into a pair of heaps:

DROP TABLE IF EXISTS dbo.TABLE_1;

CREATE TABLE dbo.TABLE_1 (
    COL1 BIGINT NULL,
    COL2 BIGINT NULL
);

INSERT INTO dbo.TABLE_1 WITH (TABLOCK)
SELECT TOP (15000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);


DROP TABLE IF EXISTS dbo.TABLE_2;

CREATE TABLE dbo.TABLE_2 (
    COL1 BIGINT NULL,
    COL2 BIGINT NULL
);

INSERT INTO dbo.TABLE_2 WITH (TABLOCK)
SELECT TOP (15000000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), NULL
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

The optimizer makes it usual cost-based decisions about how to implement the sort and the join. With two heaps I get a hash join as expected. You can see other join types naturally by adding indexes or by changing the data in either table. Below I force merge and loop joins with hints just for illustrative purposes:

joins

Does it internally take a hash of each row and compare?

No. It is implemented as any other join. One difference is that NULLs are treated as equal. This is a special type of comparison which you can see in the execution plan: <Compare CompareOp="IS">. However, you can get that same plan with T-SQL that does not include the EXCEPT keyword. For example, the following has the exact same query plan as the EXCEPT query that uses a hash join:

SELECT t1.*
FROM
(
    SELECT DISTINCT COL1, COL2
    FROM dbo.TABLE_1
) t1
WHERE NOT EXISTS (
    SELECT 1
    FROM dbo.TABLE_2 t2
    WHERE (t1.COL1 = t2.COL1 OR (t1.COL1 IS NULL AND t2.COL1 IS NULL))
    AND (t1.COL2 = t2.COL2 OR (t1.COL2 IS NULL AND t2.COL2 IS NULL))
);

Diffing the XML of the execution plans only reveals superficial differences around aliases and things like that. The probe residuals for the hash joins do the row comparison. They are the same for both queries:

enter image description here

If you still have doubts, I ran PerfView with the highest available sample rate to get call stacks for the query with EXCEPT and the query without it. Here are the results side by side:

enter image description here

There is no real difference. The call stacks there reference hashing are present because of the hash matches in the plan. If I add indexes to get a natural merge join, you won't see any references to hashing in the call stacks:

enter image description here

Any hashing that occurs is due to the implementation of hash match operators. There isn't anything special about EXCEPT which leads to a special, internal hashing comparison.