Sql-server – Why is the optimizer doing a table scan vs Index Seek

index-tuningsql server

Probably a straight forward answer, but I'm too bleary eyed to see it.

Have a simple query that is pulling the top 100 rows based on a char column.

  • The column has an index on the where column.

  • Returning all the columns (*)

  • There are ~100 millions rows

  • Statistics have been updated

The optimizer thinks a table scan is more efficient than running an index scan.

Sample Table:

CREATE TABLE [dbo].[tst](
[Mon] [char](6) NULL,
COL1 [varchar](50) NULL,
COL2 [varchar](50) NULL,
COL3 [varchar](50) NULL,
COL4 [varchar](50) NULL,
COL5 [varchar](50) NULL
)

Sample Query:

select
   top 100 *
from
   <table>
where
   mon = '201509'

This will take > 30 seconds to return since the optimizer chooses a table scan vs using the index seek. An index scan is < 1 sec.

Interestingly if a symbol is added in the where

e.g. '2015.09' or '2015 09'

It will use the index.

Searching straight '201509', it uses a table scan.

Add in a dot or space, it uses an index scan.
Index

Can someone explain to me why the optimizer is choosing a table scan vs using the index?

Edit: Thanks all for the excellent feedback and information. I didn't know about traceflag 9130 and from that I can see the estimated rows are a lot less than the actual rows which is likely why the optimizer is choosing the table scan vs the index.

Traceflag 9130

Here is the Statistics Output:
Without index hint:

Table 'TestTable'. Scan count 1, logical reads 982046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 17437 ms, elapsed time = 17792 ms.

With Index hint:

Table 'TestTable'. Scan count 1, logical reads 104, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 4 ms.

Best Answer

The optimizer is convinced that if it's going to have to go back to the disk for retrieving column data anyway, it might as well scan the table in the first place, since that'll be less work for it to do. It'll use the seek with the CHAR( 7 ) scalar because the statistics for the index know it's not going to find anything, but if data needs to be returned, it has to consider both CPU and I/O weights.

USE tempdb;
GO

IF NOT EXISTS ( SELECT  1
                FROM    sys.objects
                WHERE   name = 'tst'
                    AND type = 'U' )
BEGIN
    --DROP TABLE dbo.tst;
    CREATE TABLE [dbo].[tst] 
    (
        Mon                     [char](6) NULL,
        COL1                    [varchar](50) NULL,
        COL2                    [varchar](50) NULL,
        COL3                    [varchar](50) NULL,
        COL4                    [varchar](50) NULL,
        COL5                    [varchar](50) NULL
    );

    INSERT INTO dbo.tst ( [Mon] )
    SELECT  TOP 100000000
            CONVERT( CHAR( 6 ), DATEADD( DAY, ( ABS( CHECKSUM( NEWID() ) ) % 10000 + 1 ),
                '20000101' ), 112 )         
    FROM    sys.all_objects so
    CROSS APPLY sys.all_objects sp;

    CREATE NONCLUSTERED INDEX IX__tst__Mon
        ON dbo.tst ( Mon )
    WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR = 100 );
END;

SELECT  Mon, COUNT( 1 )
FROM    dbo.tst
GROUP BY Mon
ORDER BY Mon;

SET STATISTICS IO, TIME ON;

SELECT  Mon, COL1
FROM    dbo.tst
WHERE   Mon = '201509'

SELECT  Mon, COL1
FROM    dbo.tst WITH ( INDEX = IX__tst__Mon )
WHERE   Mon = '201509'

SELECT  *
FROM    dbo.tst
WHERE   Mon = '201509'

SELECT  *
FROM    dbo.tst WITH ( INDEX = IX__tst__Mon )
WHERE   Mon = '201509'

SET STATISTICS IO, TIME OFF;

Specifying the hint, in both cases, does reduce the time required for the query to resolve, but the index seek + RID lookup actually results in a significant increase in the number of reads necessary ( my test indicated a 60% increase ). Obviously it's not a 1:1 trade off, since the time difference is about 6x, but regardless, the optimizer is choosing the scan instead.

If you can INCLUDE the columns you need in the index, you'll get the best of both worlds, eliminating the RID lookup and the additional reads.

--DROP INDEX dbo.tst.IX__tst__Mon
CREATE NONCLUSTERED INDEX IX__tst__Mon
    ON dbo.tst ( Mon )
INCLUDE ( COL1, COL2, COL3, COL4, COL5 )
WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR = 100 );

SET STATISTICS IO, TIME ON;

SELECT  *
FROM    dbo.tst
WHERE   Mon = '201509'

SET STATISTICS IO, TIME OFF;