I have a table in SQL server that looks like this:
Id |Version |Name |date |fieldA |fieldB ..|fieldZ
1 |1 |Foo |20120101|23 | ..|25334123
2 |2 |Foo |20120101|23 |NULL ..|NULL
3 |2 |Bar |20120303|24 |123......|NULL
4 |2 |Bee |20120303|34 |-34......|NULL
I am working on a stored procedure to diff, that takes input data and a version number.
The input data has columns from Name uptil fieldZ. Most of the field columns are expected to be NULL, i.e., each row usually has data for only the first few fields, the rest are NULL. The name, date and version form a unique constraint on the table.
I need to diff the data that is input with respect to this table, for a given version. Each row needs to be diffed – a row is identified by the name, date and version, and any change in any of the values in the field columns will need to show in the diff.
Update: all the fields need not be of type decimal. Some of them may be nvarchars. I would prefer the diff to happen without converting the type, although the diff output could convert everything to nvarchar since it is to be used only for display purposed.
Suppose the input is the following, and the requested version is 2,:
Name |date |fieldA |fieldB|..|fieldZ
Foo |20120101|25 |NULL |.. |NULL
Foo |20120102|26 |27 |.. |NULL
Bar |20120303|24 |126 |.. |NULL
Baz |20120101|15 |NULL |.. |NULL
The diff needs to be in the following format:
name |date |field |oldValue |newValue
Foo |20120101|FieldA |23 |25
Foo |20120102|FieldA |NULL |26
Foo |20120102|FieldB |NULL |27
Bar |20120303|FieldB |123 |126
Baz |20120101|FieldA |NULL |15
My solution so far is to first generate a diff, using EXCEPT and UNION.
Then convert the diff to the desired output format using a JOIN and CROSS APPLY.
Although this seems to be working, I am wondering if there is a cleaner and more efficient way to do this. The number of fields is close to a 100, and each place in the code that has a … is actually a large number of lines. Both the input table and existing table are expected to be quite large over time.
I am new to SQL and am still trying to learn performance tuning.
Here is the SQL for it:
CREATE TABLE #diff
( [change] [nvarchar](50) NOT NULL,
[name] [nvarchar](50) NOT NULL,
[date] [int] NOT NULL,
[FieldA] [decimal](38, 10) NULL,
[FieldB] [decimal](38, 10) NULL,
.....
[FieldZ] [decimal](38, 10) NULL
)
--Generate the diff in a temporary table
INSERT INTO #diff
SELECT * FROM
(
(
SELECT
'old' as change,
name,
date,
FieldA,
FieldB,
...,
FieldZ
FROM
myTable mt
WHERE
version = @version
AND mt.name + '_' + CAST(mt.date AS VARCHAR) IN (SELECT name + '_' + CAST(date AS VARCHAR) FROM @diffInput)
EXCEPT
SELECT 'old' as change,* FROM @diffInput
)
UNION
(
SELECT 'new' as change, * FROM @diffInput
EXCEPT
SELECT
'new' as change,
name,
date,
FieldA,
FieldB,
...,
FieldZ
FROM
myTable mt
WHERE
version = @version
AND mt.name + '_' + CAST(mt.date AS VARCHAR) IN (SELECT name + '_' + CAST(date AS VARCHAR) FROM @diffInput)
)
) AS myDiff
SELECT
d3.name, d3.date, CrossApplied.field, CrossApplied.oldValue, CrossApplied.newValue
FROM
(
SELECT
d2.name, d2.date,
d1.FieldA AS oldFieldA, d2.FieldA AS newFieldA,
d1.FieldB AS oldFieldB, d2.FieldB AS newFieldB,
...
d1.FieldZ AS oldFieldZ, d2.FieldZ AS newFieldZ,
FROM #diff AS d1
RIGHT OUTER JOIN #diff AS d2
ON
d1.name = d2.name
AND d1.date = d2.date
AND d1.change = 'old'
WHERE d2.change = 'new'
) AS d3
CROSS APPLY (VALUES ('FieldA', oldFieldA, newFieldA),
('FieldB', oldFieldB, newFieldB),
...
('FieldZ', oldFieldZ, newFieldZ))
CrossApplied (field, oldValue, newValue)
WHERE
crossApplied.oldValue != crossApplied.newValue
OR (crossApplied.oldValue IS NULL AND crossApplied.newValue IS NOT NULL)
OR (crossApplied.oldValue IS NOT NULL AND crossApplied.newValue IS NULL)
Thank you!
Best Answer
Here is another approach:
This is how it works:
The two tables are joined using an outer join,
@diffInput
being on the outer side to match your right join.The result of the join is conditionally unpivoted using CROSS APPLY, where "conditionally" means that each pair of columns is tested individually and returned only if the columns differ.
The pattern of each test condition
is equivalent to your
only more concise. You can read more about this use of INTERSECT in detail in Paul White's article Undocumented Query Plans: Equality Comparisons.
On a different note, since you are saying,
you may want to consider replacing the table variable you are using for the input table with a temporary table. There is a very comprehensive answer by Martin Smith that explores differences between the two:
In short, certain properties of table variables, like e.g. absence of column statistics, may make them less query optimiser-friendly for your scenario than temporary tables.