Why is a Key Lookup required to get A, B and C when they are not referenced in the query at all? I assume they are being used to calculate Comp, but why?
Columns A, B, and C
are referenced in the query plan - they are used by the seek on T2
.
Also, why can the query use the index on t2, but not on t1?
The optimizer decided that scanning the clustered index was cheaper than scanning the filtered nonclustered index and then performing a lookup to retrieve the values for columns A, B, and C.
Explanation
The real question is why the optimizer felt the need to retrieve A, B, and C for the index seek at all. We would expect it to read the Comp
column using a nonclustered index scan, and then perform a seek on the same index (alias T2) to locate the Top 1 record.
The query optimizer expands computed column references before optimization begins, to give it a chance to assess the costs of various query plans. For some queries, expanding the definition of a computed column allows the optimizer to find more efficient plans.
When the optimizer encounters a correlated subquery, it attempts to 'unroll it' to a form it finds easier to reason about. If it cannot find a more effective simplification, it resorts to rewriting the correlated subquery as an apply (a correlated join):
It just so happens that this apply unrolling puts the logical query tree into a form that does not work well with project normalization (a later stage that looks to match general expressions to computed columns, among other things).
In your case, the way the query is written interacts with internal details of the optimizer such that the expanded expression definition is not matched back to the computed column, and you end up with a seek that references columns A, B, and C
instead of the computed column, Comp
. This is the root cause.
Workaround
One idea to workaround this side-effect is to write the query as an apply manually:
SELECT
T1.ID,
T1.Comp,
T1.D,
CA.D2
FROM dbo.T AS T1
CROSS APPLY
(
SELECT TOP (1)
D2 = T2.D
FROM dbo.T AS T2
WHERE
T2.Comp = T1.Comp
AND T2.D > T1.D
ORDER BY
T2.D ASC
) AS CA
WHERE
T1.D IS NOT NULL -- DON'T CARE ABOUT INACTIVE RECORDS
ORDER BY
T1.Comp;
Unfortunately, this query will not use the filtered index as we would hope either. The inequality test on column D
inside the apply rejects NULLs
, so the apparently redundant predicate WHERE T1.D IS NOT NULL
is optimized away.
Without that explicit predicate, the filtered index matching logic decides it cannot use the filtered index. There are a number of ways to work around this second side-effect, but the easiest is probably to change the cross apply to an outer apply (mirroring the logic of the rewrite the optimizer performed earlier on the correlated subquery):
SELECT
T1.ID,
T1.Comp,
T1.D,
CA.D2
FROM dbo.T AS T1
OUTER APPLY
(
SELECT TOP (1)
D2 = T2.D
FROM dbo.T AS T2
WHERE
T2.Comp = T1.Comp
AND T2.D > T1.D
ORDER BY
T2.D ASC
) AS CA
WHERE
T1.D IS NOT NULL -- DON'T CARE ABOUT INACTIVE RECORDS
ORDER BY
T1.Comp;
Now the optimizer does not need to use the apply rewrite itself (so the computed column matching works as expected) and the predicate is not optimized away either, so the filtered index can be used for both data access operations, and the seek uses the Comp
column on both sides:
This would generally be preferred over adding A, B, and C as INCLUDEd
columns in the filtered index, because it addresses the root cause of the problem, and does not require widening the index unnecessarily.
Persisted computed columns
As a side note, it is not necessary to mark the computed column as PERSISTED
, if you don't mind repeating its definition in a CHECK
constraint:
CREATE TABLE dbo.T
(
ID integer IDENTITY(1, 1) NOT NULL,
A varchar(20) NOT NULL,
B varchar(20) NOT NULL,
C varchar(20) NOT NULL,
D date NULL,
E varchar(20) NULL,
Comp AS A + '-' + B + '-' + C,
CONSTRAINT CK_T_Comp_NotNull
CHECK (A + '-' + B + '-' + C IS NOT NULL),
CONSTRAINT PK_T_ID
PRIMARY KEY (ID)
);
CREATE NONCLUSTERED INDEX IX_T_Comp_D
ON dbo.T (Comp, D)
WHERE D IS NOT NULL;
The computed column is only required to be PERSISTED
in this case if you want to use a NOT NULL
constraint or to reference the Comp
column directly (instead of repeating its definition) in a CHECK
constraint.
The problem seems to be related to the fact that [TestGeocode].[ToString]()
returns a max
datatype (nvarchar(max)
).
I also encounter the issue with this simpler version (changing the definition of c1
to varchar(8000)
or using COALESCE
instead of ISNULL
resolves it)
DROP TABLE dbo.Test
CREATE TABLE dbo.Test
(
c1 VARCHAR(
MAX --Fails
-- 8000 --Works fine
) NULL,
comp1 AS CAST(ISNULL(c1, 'ABC') AS VARCHAR(100))
CONSTRAINT UK_Test_comp1 UNIQUE NONCLUSTERED(comp1)
)
GO
DECLARE @comp1 VARCHAR(100)
SELECT comp1
FROM dbo.Test WITH (FORCESEEK)
WHERE comp1 = @comp1
OPTION (QUERYTRACEON 3604, QUERYTRACEON 8606);
Computed column references get expanded out to the underlying definition then matched back to the column later. This allows computed columns to be matched without referencing them by name at all and also allows simplification to operate on the underlying definitions.
ISNULL
returns the datatype of the first parameter (VARCHAR(MAX)
in my example). The return type of COALESCE
will be VARCHAR(MAX)
here too but it seems to be evaluated differently in a way that avoids the problem.
In the cases where the query succeeds the trace flag output includes the following
ScaOp_Convert varchar(max) collate 49160,Null,Var,Trim,ML=65535
ScaOp_Const TI(varchar collate 49160,Var,Trim,ML=3)
XVAR(varchar,Owned,Value=Len,Data = (3,ABC))
Where it fails this is replaced by
ScaOp_Identifier COL: ConstExpr1003
I speculate that in the cases where it fails the (implicit) CAST('ABC' AS VARCHAR(MAX))
is just done once and this is evaluated as a runtime constant (more information). However the reference to this runtime constant label, instead of the actual string literal value itself, prevents it from matching the computed column definition.
This rewrite avoids the issue in your query
CREATE TABLE [dbo].[Test]
(
[test] [VARCHAR](100) NULL,
[TestGeocode] [geography] NULL,
[Hashkey] AS CAST(
( hashbytes
('SHA',
( RIGHT(SPACE(100) + isnull([test], ''), 100) )
+ RIGHT(SPACE(100) + isnull(CAST(RIGHT([TestGeocode].[ToString](),100) AS VARCHAR(100)), ''),100)
)
) AS BINARY(20)
) PERSISTED
CONSTRAINT [UK_Test_HashKey] UNIQUE NONCLUSTERED([Hashkey])
)
Best Answer
As far as I know, your only options are to create the second computed column, contact the vendor of the software to ask for a fix, or to submit an enhancement request to Microsoft to get better support for your scenario. On the surface, the functionality that you're asking for could be viewed as simple: why can't SQL Server figure out that the expressions are equivalent when they obviously are to a programmer? However, it requires at least all of the following:
SQL Server needs to know that
DATEADD(minute, ValMin, DATEADD(hour, ValHour, ValDate))
is equal toDATEADD(hour, ValHour, DATEADD(minute, ValMin, ValDate))
. There are a lot of date equivalences that SQL Server isn't yet aware of, and this is one of them.The query optimizer would need to look for commutative matches during computed column matching.
The query optimizer would need to be able to find your match during the query optimization process, which is designed to very quickly give you a "good enough" plan.
I am sympathetic to your problem, but my guess is that the scenario that you're describing just isn't common enough for Microsoft to make improvements in this area, especially when there is a straightforward workaround that could be accomplished through changing code or by adding another computed column to the table.