You don't need 30 join conditions for a FULL OUTER JOIN
here.
You can just Full Outer Join on the PK, preserve rows with at least one difference with WHERE EXISTS (SELECT A.* EXCEPT SELECT B.*)
and use CROSS APPLY (SELECT A.* UNION ALL SELECT B.*)
to unpivot out both sides of the JOIN
ed rows into individual rows.
WITH TableA(Col1, Col2, Col3)
AS (SELECT 'Dog',1,1 UNION ALL
SELECT 'Cat',27,86 UNION ALL
SELECT 'Cat',128,92),
TableB(Col1, Col2, Col3)
AS (SELECT 'Dog',1,1 UNION ALL
SELECT 'Cat',27,105 UNION ALL
SELECT 'Lizard',83,NULL)
SELECT CA.*
FROM TableA A
FULL OUTER JOIN TableB B
ON A.Col1 = B.Col1
AND A.Col2 = B.Col2
/*Unpivot the joined rows*/
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
SELECT 'TableB' AS what, B.*) AS CA
/*Exclude identical rows*/
WHERE EXISTS (SELECT A.*
EXCEPT
SELECT B.*)
/*Discard NULL extended row*/
AND CA.Col1 IS NOT NULL
ORDER BY CA.Col1, CA.Col2
Gives
what Col1 Col2 Col3
------ ------ ----------- -----------
TableA Cat 27 86
TableB Cat 27 105
TableA Cat 128 92
TableB Lizard 83 NULL
Or a version dealing with the moved goalposts.
SELECT DISTINCT CA.*
FROM TableA A
FULL OUTER JOIN TableB B
ON EXISTS (SELECT A.* INTERSECT SELECT B.*)
CROSS APPLY (SELECT 'TableA' AS what, A.* UNION ALL
SELECT 'TableB' AS what, B.*) AS CA
WHERE NOT EXISTS (SELECT A.* INTERSECT SELECT B.*)
AND CA.Col1 IS NOT NULL
ORDER BY CA.Col1, CA.Col2
For tables with many columns it can still be difficult to identify the specific column(s) that differ. For that you can potentially use the below.
(though just on relatively small tables as otherwise this method likely won't have adequate performance)
SELECT t1.primary_key,
y1.c,
y1.v,
y2.v
FROM t1
JOIN t2
ON t1.primary_key = t2.primary_key
CROSS APPLY (SELECT t1.*
FOR xml path('row'), elements xsinil, type) x1(x)
CROSS APPLY (SELECT t2.*
FOR xml path('row'), elements xsinil, type) x2(x)
CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
n.n.value('.', 'nvarchar(max)')
FROM x1.x.nodes('row/*') AS n(n)) y1(c, v)
CROSS APPLY (SELECT n.n.value('local-name(.)', 'sysname'),
n.n.value('.', 'nvarchar(max)')
FROM x2.x.nodes('row/*') AS n(n)) y2(c, v)
WHERE y1.c = y2.c
AND EXISTS(SELECT y1.v
EXCEPT
SELECT y2.v)
Here's what I've done before:
(SELECT 'TableA', * FROM TableA
EXCEPT
SELECT 'TableA', * FROM TableB)
UNION ALL
(SELECT 'TableB', * FROM TableB
EXCEPT
SELECT 'TableB', * FROM TableA)
It's worked well enough on tables that are about 1,000,000 rows, but I'm not sure how well that would work on extremely large tables.
Added:
I've run the query against my system which compares two tables with 21 fields of regular types in two different databases attached to the same server running SQL Server 2005. The table has about 3 million rows, and there's about 25000 rows different. The primary key on the table is weird, however, as it's a composite key of 10 fields (it's an audit table).
The execution plans for the queries has a total cost of 184.25879 for UNION
and 184.22983 for UNION ALL
. The tree cost only differs on the last step before returning rows, the concatenation.
Actually executing either query takes about 42s plus about 3s to actually transmit the rows. The time between the two queries is identical.
Second Addition:
This is actually extremely fast, each one running against 3 million rows in about 2.5s:
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM TableA
SELECT CHECKSUM_AGG(BINARY_CHECKSUM(*)) FROM TableB
If the results of those don't match, you know the tables are different. However, if the results do match, you're not guaranteed that the tables are identical because of the [highly unlikely] chance of checksum collisions.
I'm not sure how datatype changes between tables would affect this calculation. I would run the query against the system
views or information_schema
views.
I tried the query against another table with 5 million rows and that one ran in about 5s, so it appears to be largely O(n).
Best Answer
You need all combinations from the two tables, so a
CROSS JOIN
and then remove the ones in the 3rd table, using eitherNOT EXISTS
orEXCEPT
: