Sql-server – Getting Index Scan instead of a possible Index Seek

indexoptimizationsql server

Currently learning a few things on query optimization, and I've been trying out different queries and stumbled upon this "issue".

I am using the AdventureWorks2014 database, I ran this simple query:

table structure (taken from https://www.sqldatadictionary.com/AdventureWorks2014.pdf):

SELECT C.CustomerID
FROM Sales.Customer AS C
WHERE C.CustomerID > 100

returns 19,720 rows

total number of rows in Sales.Customer = 19,820

And after checking to make sure CustomerID is in fact not just the PK of the table but also has a clustered index on it (yet it uses a non clustered index), that indeed is the case:

EXEC SP_HELPINDEX 'Sales.Customer'

Here is the execution plan ↓

https://www.brentozar.com/pastetheplan/?id=B1g1SihGr

I have read that when faced with huge amounts of data and/or when it returns more than 50% of the data set, the query optimizer will favor an index scan. But that table as a whole barely has 20,000 rows (19,820 to be exact), it's not a big table by any means.

When I run this query:

SELECT C.CustomerID
FROM Sales.Customer AS C
WHERE C.CustomerID > 30000

returns 118 rows

https://www.brentozar.com/pastetheplan/?id=Byyux32MS

I get an index seek instead, so I thought it was due to that "more than 50% case" however, I also ran this query:

SELECT C.CustomerID
FROM Sales.Customer AS C
WHERE C.CustomerID > 20000

returns 10,118 rows

https://www.brentozar.com/pastetheplan/?id=HJ9oV33zr

And it also used an index seek, even though it was returning more than 50% of the data set.

So what's going on here?

EDIT:

With IO Statistics turned on, the >100 query returns:

Table 'Customer'. Scan count 1, logical reads 37, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

While the >20,000 one returned:

Table 'Customer'. Scan count 1, logical reads 65, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So added WITH(FORCESCAN) option to the >20,000 one to see what would happen:

Table 'Customer'. Scan count 1, logical reads 37, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

So it ends up running better with an Index Scan (less logical reads), even though the query optimizer chose to run an Index Seek for this particular query.

Best Answer

You use a nonequality predicate so your "seek" operations are in fact scans which just start from some value (not from "first") and then go to the end of the clustered index leaf level.

In the other hand you return only one column which is the clustered index key so using any of indexes won't get any key lookup operations. The optimizer has to estimate what would be cheaper: scanning a nonclustered index (two int columns on the leaf level) or part scanning of your clustered index (all columns on the leaf level).

It estimates it depending on current statistics (how many rows) and metadata (what is one row size). We saw the optimizer made a mistake on >20,000 predicate.

When faced with huge amounts of data and/or when it returns more than 50% of the data set, the query optimizer will favor an index scan.

That is a fact when the optimizer has to choose performing clustered index or table scan versus nonclustered index seek + key lookups.

In your case if your index on CustomerID were nonclustered you would always see a seek operation on that index, but if you then added another column to your output you would see the index seek + RID lookups on short resultsets and table scan on the big ones.

Related Solutions

Sql-server – New query slower even if it has less logical reads than the old one

The percentage costs on an execution plan are from the optimizer estimates, even when an actual execution plan produced. The actual execution plan does use the exact plan and include both the estimated rows and the actual row counts. Discrepancies between the row counts can be useful to determine how accurate the estimate was.

Somewhere between the subquery and comparing it to the column in derived, the optimizer wasn't able to correctly estimate how many rows would match. It guessed that there would be 18 rows from derived when there were actually over 220,000. An additional clue is the warning message Cardnality Estimate: CONVERT(nvarchar(35),[mssqlsystemresource].[sys].[spt_values].[name],0) on the SELECT node.

If you were to check the query run length with something else, such as STATISTICS TIME, I would expect them to be much closer, and likely the second query running faster.

Here's another plan analysis with a somewhat similar situation. (SQL Server Plan Explorer) (with hat tip to Kendra Little on how to fool the optimizer)

The estimate shows a 93%/7% cost split, but by looking at the actual CPU, time, or IO, the difference is not that extreme. IO is about 75%/25% and CPU is roughly 60%/40%. (I tried to come up with something more even, but wasn't able to.)

Sql-server – Procedure showing different duration / reads in Profiler vs. SSMS

The difference in logical reads leads me to believe that the two executions are getting different plans. This could be because:

there are different set settings between the app that is calling the procedure that profiler is picking up and your instance of SSMS.
you have a different default schema (you really should say exec dbo.SM_SP...).
see this post for more details

You can see if there are two or more different plans for the procedure using something like this, which will also identify if this is caused by either (or both) of the two most likely factors for different plan choice:

SELECT p.plan_handle, p.usecounts, p.size_in_bytes, 
  set_options = MAX(CASE a.attribute WHEN N'set_options' THEN a.value END),
  dflt_schema = MAX(CASE a.attribute WHEN N'user_id'     THEN a.value END)
FROM sys.dm_exec_cached_plans AS p
CROSS APPLY sys.dm_exec_sql_text(p.plan_handle) AS t
CROSS APPLY sys.dm_exec_plan_attributes(p.plan_handle) AS a
WHERE t.objectid = OBJECT_ID(N'dbo.SM_SP_MAILEVENTNAMES')
AND a.attribute IN (N'user_id', N'set_options')
GROUP BY p.plan_handle, p.usecounts, p.size_in_bytes;

Best Answer

Related Solutions

Sql-server – New query slower even if it has less logical reads than the old one

Sql-server – Procedure showing different duration / reads in Profiler vs. SSMS

Related Question