Sql-server – Estimated Rows Higher than Actual Despite Updating Statistics

cardinality-estimatesexecution-plansql serversql server 2014

I have a (anonymised) query plan

There is a table (Object8) which has a high variation on the number of Actual Rows vs Estimated Rows on two columns (Column2 and Column6) which can be seen in the hash match and sort operators

I have updated all the statistics on the Object8 table (they were only a few days out of date and the table doesn't change much)

What else could the problem be?

Best Answer

Cardinality estimates are just numbers. They don't matter unless you have a performance problem. You state in your question that there's a difference between the actual number of rows and estimated number of rows at some steps in the query plan. That's certainly true, but it's pretty uncommon to see non-trivial query plans for which estimates match the actual number of rows exactly. I wouldn't focus on a difference of a thousand rows unless you've identified that difference to be the root cause of a query performance issue.

Let's consider updating statistics: why might you see inaccurate cardinality estimation even after updating statistics? Statistics often don't fully describe the column's data. They could be sampled, 201 steps could be insufficient for some data sets, density isn't a good metric if there's skew in the data, and so on. Join cardinality estimation and combining filters on multiple columns is difficult even with perfect statistics objects because histograms are only available on one column. SQL Server often doesn't have information about the level of correlation between columns, so it needs to make a guess.

If you'd like to learn more I recommend taking a look at Optimizing Your Query Plans with the SQL Server 2014 Cardinality Estimator by Joe Sack. Here's a quote on one of the many differences between the legacy CE and the new CE:

Multiple Join Conditions

For joins with a conjunction of equality predicates, the legacy CE computes the selectivity of each equality predicate, assumes independence, and combines them. In contrast, the new CE estimates the join cardinality based on multi-column frequencies computed on the join columns.

The CEs use different models. One model may be a better fit than the other for your data or even for just a single join. In this case, it has very little to do with statistics. Updating statistics just isn't enough to result in good estimates for all cases.

On the other hand, if you have a specific performance that you'd like help with, you'll need to give more information. I don't see anything alarming in the query plan that you posted. What is the performance problem that you have with the query?