Sql-server – Getting Index Scan instead of a possible Index Seek

indexoptimizationsql server

Currently learning a few things on query optimization, and I've been trying out different queries and stumbled upon this "issue".

I am using the AdventureWorks2014 database, I ran this simple query:

table structure (taken from https://www.sqldatadictionary.com/AdventureWorks2014.pdf):

enter image description here

SELECT C.CustomerID
FROM Sales.Customer AS C
WHERE C.CustomerID > 100

returns 19,720 rows

total number of rows in Sales.Customer = 19,820

And after checking to make sure CustomerID is in fact not just the PK of the table but also has a clustered index on it (yet it uses a non clustered index), that indeed is the case:

EXEC SP_HELPINDEX 'Sales.Customer'

enter image description here

Here is the execution plan ↓

https://www.brentozar.com/pastetheplan/?id=B1g1SihGr

I have read that when faced with huge amounts of data and/or when it returns more than 50% of the data set, the query optimizer will favor an index scan. But that table as a whole barely has 20,000 rows (19,820 to be exact), it's not a big table by any means.

When I run this query:

SELECT C.CustomerID
FROM Sales.Customer AS C
WHERE C.CustomerID > 30000

returns 118 rows

https://www.brentozar.com/pastetheplan/?id=Byyux32MS

I get an index seek instead, so I thought it was due to that "more than 50% case" however, I also ran this query:

SELECT C.CustomerID
FROM Sales.Customer AS C
WHERE C.CustomerID > 20000

returns 10,118 rows

https://www.brentozar.com/pastetheplan/?id=HJ9oV33zr

And it also used an index seek, even though it was returning more than 50% of the data set.

So what's going on here?

EDIT:

With IO Statistics turned on, the >100 query returns:

Table 'Customer'. Scan count 1, logical reads 37, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

While the >20,000 one returned:

Table 'Customer'. Scan count 1, logical reads 65, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So added WITH(FORCESCAN) option to the >20,000 one to see what would happen:

Table 'Customer'. Scan count 1, logical reads 37, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So it ends up running better with an Index Scan (less logical reads), even though the query optimizer chose to run an Index Seek for this particular query.

Best Answer

You use a nonequality predicate so your "seek" operations are in fact scans which just start from some value (not from "first") and then go to the end of the clustered index leaf level.

In the other hand you return only one column which is the clustered index key so using any of indexes won't get any key lookup operations. The optimizer has to estimate what would be cheaper: scanning a nonclustered index (two int columns on the leaf level) or part scanning of your clustered index (all columns on the leaf level).

It estimates it depending on current statistics (how many rows) and metadata (what is one row size). We saw the optimizer made a mistake on >20,000 predicate.

When faced with huge amounts of data and/or when it returns more than 50% of the data set, the query optimizer will favor an index scan.

That is a fact when the optimizer has to choose performing clustered index or table scan versus nonclustered index seek + key lookups.

In your case if your index on CustomerID were nonclustered you would always see a seek operation on that index, but if you then added another column to your output you would see the index seek + RID lookups on short resultsets and table scan on the big ones.