Probably a straight forward answer, but I'm too bleary eyed to see it.
Have a simple query that is pulling the top 100 rows based on a char column.
- The column has an index on the where column.
- Returning all the columns (*)
- There are ~100 millions rows
- Statistics have been updated
The optimizer thinks a table scan is more efficient than running an index scan.
Sample Table:
CREATE TABLE [dbo].[tst](
[Mon] [char](6) NULL,
COL1 [varchar](50) NULL,
COL2 [varchar](50) NULL,
COL3 [varchar](50) NULL,
COL4 [varchar](50) NULL,
COL5 [varchar](50) NULL
)
Sample Query:
select
top 100 *
from
<table>
where
mon = '201509'
This will take > 30 seconds to return since the optimizer chooses a table scan vs using the index seek. An index scan is < 1 sec.
Interestingly if a symbol is added in the where
e.g. '2015.09' or '2015 09'
It will use the index.
Searching straight '201509', it uses a table scan.
Add in a dot or space, it uses an index scan.
Can someone explain to me why the optimizer is choosing a table scan vs using the index?
Edit: Thanks all for the excellent feedback and information. I didn't know about traceflag 9130 and from that I can see the estimated rows are a lot less than the actual rows which is likely why the optimizer is choosing the table scan vs the index.
Here is the Statistics Output:
Without index hint:
Table 'TestTable'. Scan count 1, logical reads 982046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 17437 ms, elapsed time = 17792 ms.
With Index hint:
Table 'TestTable'. Scan count 1, logical reads 104, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 4 ms.
Best Answer
The optimizer is convinced that if it's going to have to go back to the disk for retrieving column data anyway, it might as well scan the table in the first place, since that'll be less work for it to do. It'll use the seek with the
CHAR( 7 )
scalar because the statistics for the index know it's not going to find anything, but if data needs to be returned, it has to consider both CPU and I/O weights.Specifying the hint, in both cases, does reduce the time required for the query to resolve, but the index seek + RID lookup actually results in a significant increase in the number of reads necessary ( my test indicated a 60% increase ). Obviously it's not a 1:1 trade off, since the time difference is about 6x, but regardless, the optimizer is choosing the scan instead.
If you can
INCLUDE
the columns you need in the index, you'll get the best of both worlds, eliminating the RID lookup and the additional reads.