There is no advantage for the non-NULL cases when using SPARSE
, and in fact, there are two stated disadvantages:
- an extra 4 bytes per each non-NULL value
- slightly longer access time
As you pretty much already gathered, the SPARSE
option only makes sense for fixed-length datatypes; I can't think of a single reason to use it on variable-length types.
I am not sure that the extra 4 bytes implies anything about it being stored off-row, and the MAX
types aren't entirely off-row when they exceed 8000 bytes as there is then the 16-byte pointer in the row to that off-row location.
Stick with VARCHAR(4000)
, no SPARSE
, and I would even consider making it NOT NULL DEFAULT('')
(an empty string is still 0 bytes, but now you don't need to mess with the NULL indicator, and can a comment really be "unknown" as opposed to "no comment"?).
Here is another approach:
SELECT
di.name,
di.date,
x.field,
x.oldValue,
x.newValue
FROM
@diffInput AS di
LEFT JOIN dbo.myTable AS mt ON
mt.version = @version
AND mt.name = di.name
AND mt.date = di.date
CROSS APPLY
(
SELECT
'fieldA',
mt.fieldA,
di.fieldA
WHERE
NOT EXISTS (SELECT mt.fieldA INTERSECT SELECT di.fieldA)
UNION ALL
SELECT
'fieldB',
mt.fieldB,
di.fieldB
WHERE
NOT EXISTS (SELECT mt.fieldB INTERSECT SELECT di.fieldB)
UNION ALL
SELECT
'fieldC',
mt.fieldC,
di.fieldC
WHERE
NOT EXISTS (SELECT mt.fieldC INTERSECT SELECT di.fieldC)
UNION ALL
...
) AS x (field, oldValue, newValue)
;
This is how it works:
The two tables are joined using an outer join, @diffInput
being on the outer side to match your right join.
The result of the join is conditionally unpivoted using CROSS APPLY, where "conditionally" means that each pair of columns is tested individually and returned only if the columns differ.
The pattern of each test condition
NOT EXISTS (SELECT oldValue INTERSECT SELECT newValue)
is equivalent to your
oldValue != newValue
OR (oldValue IS NULL AND newValue IS NOT NULL)
OR (oldValue IS NOT NULL AND newValue IS NULL)
only more concise. You can read more about this use of INTERSECT in detail in Paul White's article Undocumented Query Plans: Equality Comparisons.
On a different note, since you are saying,
Both the input table and existing table are expected to be quite large over time
you may want to consider replacing the table variable you are using for the input table with a temporary table. There is a very comprehensive answer by Martin Smith that explores differences between the two:
In short, certain properties of table variables, like e.g. absence of column statistics, may make them less query optimiser-friendly for your scenario than temporary tables.
Best Answer
Add this constraint: