SQL Server – How to Investigate Wrong Index Selection

execution-planindexsql server

I have a Transaction table with about 200 million records, one primary key clustered on Id and 2 indexes:

  • IX_SiloId_ChangedTime_IncludeTime
  • IX_SiloId_Time_IncludeContent

I run these 2 statements before I proceed with the actual query to update statistics

Update STATISTICS dbo.[Transaction] IX_SiloId_ChangedTime_IncludeTime WITH FULLSCAN
Update STATISTICS dbo.[Transaction] IX_SiloId_Time_IncludeContent WITH FULLSCAN

This is my query:

DECLARE @Query SiloTimeQueryTableType -- (SiloId, Time) with primary key clustered on SiloId
INSERT INTO @Query VALUES 
(1, '2020-12-31'), -- 1000 total values, though it's still the same problem with just one

SELECT  t.*
FROM    [Transaction] t
INNER JOIN @Query q
    ON t.SiloId = q.SiloId
WHERE 
    t.Time >= q.Time

Now what happens is for whatever reason Sql Server choses IX_SiloId_ChangedTime_IncludeTime. It then takes forever. If I use WITH (INDEX(IX_SiloId_Time_IncludeContent)) I get the result right away.

The correct index is quite obvious here, but yet SQL Server choses the one that is not even indexed on Time.

I cannot understand this behaviour, but from what I read it is best to avoid hints for Indexes, though I made this Index with this query in mind.

So the question is: what can I do to try to figure out why SQL Server prefers the "wrong" index even though a much better one exists and I just run full statistics update?

Query plan for the forced index (here from the temp table instead of TVP to check if this changes anything as the answer suggested, the result seems to be the same):

enter image description here

Query plan without forced index:

enter image description here

https://www.brentozar.com/pastetheplan/?id=rJOt3G00P

https://www.brentozar.com/pastetheplan/?id=ByFshGAAP (this one is live, as it takes too long)

Best Answer

It's could be the lack of column-level statistics on the table variable. Try the same approach, but instead using a real temp table instead, like this:

CREATE TABLE #Query
(
    Id int NOT NULL,
    Time datetime NOT NULL
)

INSERT INTO #Query VALUES 
(1, '2020-12-31');

ALTER TABLE #Query
ADD CONSTRAINT PK_Query PRIMARY KEY (Id);

SELECT  t.*
FROM    [Transaction] t
INNER JOIN #Query q
    ON t.SiloId = q.SiloId
WHERE 
    t.Time >= q.Time

However, looking at the actual execution plan provided, the problem is definitely due to a bad estimate. Check this out:

enter image description here

For some reason, the optimizer thinks there will be 25,000,000 rows coming out of that join, when in reality there are only 4,155 rows. The optimizer is not going to choose to do 25,000,000 key lookups (since the index doesn't include all the columns), so you end up with a scan of the clustered index. This is referred to as "the tipping point" on various blogs and such.

The "simplest" option would be to make the index covering (include all the columns in the table), or only select columns that are in the index. This avoids the key lookup, and you should get this plan naturally. However, that might not be practical (not sure what the data type of the other columns in your table are, etc).

The better option would be to somehow fix the estimate. You could try adding an index on Time to the temp table, in case that gives you better stats (after loading the table):

CREATE NONCLUSTERED INDEX IX_Time
ON #Query (Time);

You could also try rewriting the query in different forms to see if you get different results / better estimates, like this:

SELECT  t.*
FROM    [Transaction] t
WHERE EXISTS
(
    SELECT null
    FROM #Query q
    WHERE 
        q.SiloId = t.SiloId 
        AND t.Time >= q.Time
)

Or see how the legacy cardinality estimator does with this query by adding this hint to the end of it:

OPTION (USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'))

So the question is: what can I do to try to figure out why SQL Server prefers the "wrong" index even though a much better one exists and I just run full statistics update?

In general, it would be useful to use index hints to "force" the index you want, and then compare the difference in costs or estimates between that plan and the natural one. That can provide clues about why the desired index isn't getting chosen.

FORCE_LEGACY_CARDINALITY_ESTIMATION though chose the correct index right away! Don't really know what that means though... There is some bug in a new one?

I wouldn't say it's a bug - they just have very different approaches to producing estimates. The legacy CE still works better in certain situations compared to the new CE.

If you want a deep dive into one specific difference, which also provides an approach for looking at the stats and how estimates are produced, check out Paul White's blog post: SQL Server Join Estimation using Histogram Coarse Alignment