Sql-server – Estimated vs. Actual Row counts in a trivial case

cardinality-estimatesoptimizationperformancequery-performancesql server

I'm working on optimizing some queries in a database I've inherited. I'm not permitted to disclose the query, but I can present an anonymized version of a query plan that is showing some very strange behavior (to me):

enter image description here
enter image description here

My question is about the 'Clustered Index Scan' denoted (1) in the images. As I understand it, the 'Clustered Index Scan' highlighted in yellow doesn't have any dependencies. The operation doesn't have any predicates (although I don't show it here) and it's sibling in the tree (2) is executed the same number of times as the number of rows (1) returns — I take this to mean that (1) is executed first, and (2) is executed row by row for the join (which makes sense, since the join is a 'Nested Loop').

Here's the issue: The 'Estimated Rows' for (1) is 46. The 'Actual Rows' is 283972, which is the number of rows in that table exactly…which makes sense since there's nothing to trim it down. I know the query is very complex (4 levels of nested views), but how in the world does SQL Server get this wrong? Why isn't the estimate for the number of rows that come back in an Index Scan with no predicate equal to the number of rows in the table?

I've updated all of the relevant table statistics with no success. I've also noticed that reducing the nesting level of the queries has the effect of the problem occurring less often, though the results are 'unstable'. There are a number of queries that reference this view, and if I partially reduce the nesting of this view, a few queries may be fixed, but it may break a few more.

Best Answer

It's important to realize that a clustered index scan doesn't always mean that SQL Server expects to read all of the rows from the table. For some queries SQL Server is able to stop early after it finds all of the rows that it needs. The low number of estimated rows that you're seeing in the plan is likely due to a row goal.

Let's walk through a simple example. Suppose that I create two heap tables without any shared values:

CREATE TABLE X_HEAP_1 (ID INT NOT NULL);

INSERT INTO X_HEAP_1 WITH (TABLOCK)
SELECT N FROM dbo.GetNums(1000000)
WHERE N % 10 <> 5;

CREATE TABLE X_HEAP_2 (ID INT NOT NULL);

INSERT INTO X_HEAP_2 WITH (TABLOCK)
SELECT N FROM dbo.GetNums(1000000)
WHERE N % 10 = 5;

I know that the following query will not return any rows:

SELECT TOP (10) H1.ID
FROM X_HEAP_1 h1
INNER JOIN X_HEAP_2 h2 ON h1.ID = h2.ID;

Because there are no indexes on the tables, SQL Server will be forced to read all of the rows from both tables. However, SQL Server does not know that the query will not return any rows. In fact, based on statistics and modeling assumptions it expects the query will return the full 10 rows. It also expects that it will not need to scan every row in both tables to get to 10 rows. Here is an estimated plan:

estimated plan

The query optimizer thinks that after it gets 89 rows from X_HEAP_1 that it will have a total of 10 matching rows from both tables. However, we already know that in the actual plan it will need to scan all 900000 rows from X_HEAP_1:

actual plan

That's a simple example of how a row goal can lead to an estimated number of rows which is 10000X off.