You don't need 30 join conditions for a FULL OUTER JOIN
here.
You can just Full Outer Join on the PK, preserve rows with at least one difference with WHERE EXISTS (SELECT A.* EXCEPT SELECT B.*)
and use CROSS APPLY (SELECT A.* UNION ALL SELECT B.*)
to unpivot out both sides of the JOIN
ed rows into individual rows.
WITH TableA(Col1, Col2, Col3)
AS (SELECT 'Dog',1,1 UNION ALL
SELECT 'Cat',27,86 UNION ALL
SELECT 'Cat',128,92),
TableB(Col1, Col2, Col3)
AS (SELECT 'Dog',1,1 UNION ALL
SELECT 'Cat',27,105 UNION ALL
SELECT 'Lizard',83,NULL)
SELECT CA.*
FROM TableA A
FULL OUTER JOIN TableB B
ON A.Col1 = B.Col1
AND A.Col2 = B.Col2
/*Unpivot the joined rows*/
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
SELECT 'TableB' AS what, B.*) AS CA
/*Exclude identical rows*/
WHERE EXISTS (SELECT A.*
EXCEPT
SELECT B.*)
/*Discard NULL extended row*/
AND CA.Col1 IS NOT NULL
ORDER BY CA.Col1, CA.Col2
Gives
what Col1 Col2 Col3
------ ------ ----------- -----------
TableA Cat 27 86
TableB Cat 27 105
TableA Cat 128 92
TableB Lizard 83 NULL
Or a version dealing with the moved goalposts.
SELECT DISTINCT CA.*
FROM TableA A
FULL OUTER JOIN TableB B
ON EXISTS (SELECT A.* INTERSECT SELECT B.*)
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
SELECT 'TableB' AS what, B.*) AS CA
WHERE NOT EXISTS (SELECT A.* INTERSECT SELECT B.*)
AND CA.Col1 IS NOT NULL
ORDER BY CA.Col1, CA.Col2
For tables with many columns it can still be difficult to identify the specific column(s) that differ. For that you can potentially use the below.
(though just on relatively small tables as otherwise this method likely won't have adequate performance)
SELECT t1.primary_key,
y1.c,
y1.v,
y2.v
FROM t1
JOIN t2
ON t1.primary_key = t2.primary_key
CROSS APPLY (SELECT t1.*
FOR xml path('row'), elements xsinil, type) x1(x)
CROSS APPLY (SELECT t2.*
FOR xml path('row'), elements xsinil, type) x2(x)
CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
n.n.value('.', 'nvarchar(max)')
FROM x1.x.nodes('row/*') AS n(n)) y1(c, v)
CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
n.n.value('.', 'nvarchar(max)')
FROM x2.x.nodes('row/*') AS n(n)) y2(c, v)
WHERE y1.c = y2.c
AND EXISTS(SELECT y1.v
EXCEPT
SELECT y2.v)
You can also do this with dynamic SQL without having to manually build out all the column names.
DECLARE @sql NVARCHAR(MAX), @c1 NVARCHAR(MAX), @c2 NVARCHAR(MAX);
SELECT @c1 = N'', @c2 = N'';
SELECT
@c1 = @c1 + ',' + QUOTENAME(name),
@c2 = @c2 + ' AND m.' + QUOTENAME(name) + ' = s.' + QUOTENAME(name)
FROM sys.columns
WHERE name <> 'LocationID'
AND [object_id] = OBJECT_ID('dbo.table1');
SET @sql = ';WITH s AS (
SELECT ' + STUFF(@c1, 1, 1, '') + ' FROM dbo.table1
EXCEPT
SELECT ' + STUFF(@c1, 1, 1, '') + ' FROM dbo.table1_master
)
SELECT m.LocationID
FROM s INNER JOIN dbo.table1 AS m ON 1 = 1
' + @c2;
SELECT @sql;
--EXEC sp_executesql @sql;
You can take the output of this query as is and store the query somewhere, or you can comment out the SELECT
and uncomment the EXEC
and leave it as permanent dynamic SQL - in this case it will automatically adapt to column changes in the two tables.
Another idea (assuming LocationID is unique) - and it occurred to me you may want to include the master row so you can quickly spot the columns that are different:
;WITH c AS
(
SELECT t.LocationID, m.setting1, m.setting2, ...
FROM dbo.table1 AS t CROSS JOIN dbo.table1_master AS m
)
SELECT DISTINCT src = '> master', setting1, setting2, ...
FROM c
UNION ALL
(
SELECT RTRIM(LocationID), setting1, setting2, ...
FROM dbo.table1
EXCEPT
SELECT RTRIM(LocationID), setting1, setting2, ...
FROM c
)
ORDER BY src;
This version is a little cheaper (mostly by avoiding the DISTINCT
against the master table, at the cost of needing to specify all of the columns one more time - which again you can automate as per above):
;WITH m AS
(
SELECT setting1, setting2, ...
FROM dbo.table1_master
),
c AS
(
SELECT src = RTRIM(t.LocationID), m.setting1, m.setting2, ...
FROM dbo.table1 AS t CROSS JOIN m
)
SELECT src = '> master', setting1, setting2, ...
FROM m
UNION ALL
(
SELECT RTRIM(LocationID), setting1, setting2, ...
FROM dbo.table1
EXCEPT
SELECT src, setting1, setting2, ...
FROM c
)
ORDER BY src;
However all of these options are poorer performers with worse plans than Rachel's simple LEFT JOIN
. I tried to stick to the theme of using EXCEPT
even though it is more about syntax than performance.
The key takeaway is that if the column count is too high to deal with manually, you can use the dynamic SQL approach above to construct whatever query you want to use - and you can do that one time and store the result, or have the code generated every time. To generate Rachel's query using dynamic SQL, not much needs to change:
DECLARE @sql NVARCHAR(MAX), @and NVARCHAR(MAX), @anycol NVARCHAR(128);
SELECT @sql = N'', @and = N'';
SELECT @and = @and + ' AND t.' + QUOTENAME(name) + ' = m.' + QUOTENAME(name)
FROM sys.columns
WHERE [object_id] = OBJECT_ID('dbo.table1_master');
SELECT TOP (1) @anycol = QUOTENAME(name)
FROM sys.columns
WHERE [object_id] = OBJECT_ID('dbo.table1_master')
ORDER BY name;
SET @sql = 'SELECT locationID
FROM dbo.table1 AS t
LEFT OUTER JOIN dbo.table1_master AS m ON 1 = 1'
+ @and + ' WHERE m.' + @anycol + ' IS NULL;';
SELECT @sql;
--EXEC sp_executesql @sql;
Best Answer
I wouldn't say that there's a special internal algorithm for
EXCEPT
. ForA EXCEPT B
, the engine takes distinct (if needed) tuples from A and subtracts rows that match in B. There are no special query plan operators. The distinct and the subtraction are implemented through the typical operators that you would see with a sort or with a join. Nested loop join, merge join, and hash join are all supported. To show this, I'll throw 15 million rows into a pair of heaps:The optimizer makes it usual cost-based decisions about how to implement the sort and the join. With two heaps I get a hash join as expected. You can see other join types naturally by adding indexes or by changing the data in either table. Below I force merge and loop joins with hints just for illustrative purposes:
No. It is implemented as any other join. One difference is that NULLs are treated as equal. This is a special type of comparison which you can see in the execution plan:
<Compare CompareOp="IS">
. However, you can get that same plan with T-SQL that does not include theEXCEPT
keyword. For example, the following has the exact same query plan as theEXCEPT
query that uses a hash join:Diffing the XML of the execution plans only reveals superficial differences around aliases and things like that. The probe residuals for the hash joins do the row comparison. They are the same for both queries:
If you still have doubts, I ran PerfView with the highest available sample rate to get call stacks for the query with
EXCEPT
and the query without it. Here are the results side by side:There is no real difference. The call stacks there reference hashing are present because of the hash matches in the plan. If I add indexes to get a natural merge join, you won't see any references to hashing in the call stacks:
Any hashing that occurs is due to the implementation of hash match operators. There isn't anything special about
EXCEPT
which leads to a special, internal hashing comparison.