Sql-server – SQL server 2012 inner join estimated number of rows issue

cardinality-estimatesexecution-plansql serversql-server-2012

I have a simple join like the one below:

select * 
from fact_sales f
join dim_company d on f.company_SK = d.company_SK

The fact table contains a little over 500 million records (with NC columnstore), the query will return all of them, however, the estimated number of rows after the join is only 300 million. Up till the hash join, the estimated number of rows are correct, only after the join it drops to 300 million. Here's the estimated plan for the query:

I've updated the stats on the SK column used in the join in both the fact table (with fullscan) and the dimension table, here's the histogram for each:

This issue only seems to occur for a couple dimension tables in the database, joining the other dimension tables doesn't produce the same type of cardinality estimate issue – any suggestions on how to possibly fix this or further investigate?

If I add a where clause to the query, it correctly estimates the number of rows before/after the join, e.g.

select * 
from fact_sales f
join dim_company d on f.company_SK = d.company_SK where company_SK = 1

will estimate 467,583,000 rows coming out of the join, which matches what's in the histogram.

The issue seems to only occur when I don't have any filter in the query. It's causing a problem in a bigger query (sort spill). I've narrowed it down to this particular join.

I do have an FK constraint, but they have been turned off (WITH NOCHECK) on the fact table (we were told to turn them off so ETL can go faster). Unfortunately, turning the FK back on is not an option 🙁

Update: enabling trace flag 2301 solved the issue :p

Best Answer

I was able to reproduce your issue in SQL Server 2014 with the legacy CE and TF 4199 enabled. I used a rowstore fact table because I have little experience with columnstore.

For your query with the filter, the query optimizer rewrites the query to something like this:

select * 
from fact_sales f
join dim_company d on f.company_SK = d.company_SK 
where f.company_SK = 1 and d.company_SK = 1

For that query the optimizer can directly uses the histogram step containing 1 for both tables and you get 467,583,000 X 1 as the estimated number of rows.

The query without a filter will use linear interpolation:

For joins with a single equality or inequality predicate, the legacy CE joins the histograms on the join columns by aligning the two histograms step-by-step using linear interpolation.

Note that the statistics for your dimension table do not have a histogram step with a RANGE_HI_KEY value of 1, which is where almost all of the data is in your fact table. Something is going wrong with the linear interpolation step for that value. If I create your dimension table without a 0 value then the histogram gets a step with a RANGE_HI_KEY value of 1. That fixes the estimate. In short, you got unlucky with your histogram. I don't think there is a supported way to force the histogram on the dimension table to contain a step for 1, unless perhaps you are willing to add additional filters to your query.

I don't know the exact rules of your data, but I can give you two workarounds. The first workaround is to enable trace flag 2301 for the query. That fixes the estimate for this part of the query but may have other negative effects for the rest of your query. This trace flag is documented by MS, but talk to your DBA before using it.

The second workaround is to encourage the query optimizer to use a different method of cardinality estimation for the join. The following query gave a good estimate for me and only does a single scan on your fact table:

select * 
from FACT_SALES f
join (SELECT COMPANY_SK FROM DIM_COMPANY UNION SELECT 1) d
on f.company_SK = d.company_SK;

Related Solutions

Sql-server – How does SQL Server’s optimizer estimate the number of rows in a joined table

I think my understanding of the second estimate is incorrect and the differing numbers seems to indicate that. What am I missing?

Using the SQL Server 2012 cardinality estimator, the selectivity of the join drives the estimated number of rows on the inner side of the nested loops join, and not the other way around.

The 11.4867 number is derived (for display in showplan) by dividing the computed estimated cardinality of the join output (30.0919) by the number of iterations (2.61972). The result, using single-precision floating-point arithmetic, is 11.4867.

It really is as simple as that. Note that the (logical) join selectivity is independent of the choice of physical join operator. It remains the same whether the join is ultimately performed using a Nested Loops, Hash, or Merge Join physical operator.

In SQL Server 2012 and earlier, the join selectivity (as a whole) is estimated using the SalesOrderID histograms from each table (computed for each histogram step, after step boundary alignment using linear interpolation as necessary). The SalesOrderID histogram associated with the SalesOrderHeader table is also adjusted for the scaling effect of the independent CustomerID filter.

That is not to say there is anything fundamentally 'wrong' with the alternate calculation proposed in the question; it just makes a different set of assumptions. There will always be different ways to compute or combine estimates for a given sequence of logical operations. There is no general guarantee that different statistical methods applied to the same data will produce the same answers, or that one method would always be superior to the other. Inconsistencies resulting from the application of different statistical methods can even appear within a single final execution plan, though they are rarely noticed.

As a side-note, the SQL Server 2014 cardinality estimator takes a different approach to combining the independent-filter-adjusted histogram information ("coarse alignment"), which results in a different final estimate of 10.1006 rows for this query:

Plan for computation:

  CSelCalcExpressionComparedToExpression
  (QCOL: [s].SalesOrderID x_cmpEq QCOL: [d].SalesOrderID)

Loaded histogram for column QCOL: [s].SalesOrderID from stats with id 1
Loaded histogram for column QCOL: [d].SalesOrderID from stats with id 1

Stats collection generated: 

  CStCollJoin(ID=4, **CARD=10.1006** x_jtInner)
      CStCollFilter(ID=3, CARD=2.61972)
          CStCollBaseTable(ID=1, CARD=31465 TBL: Sales.SalesOrderHeader AS TBL: s)
      CStCollBaseTable(ID=2, CARD=121317 TBL: Sales.SalesOrderDetail AS TBL: d)

This happens to be the same result as the calculation in the question, though the detailed reasoning is different (i.e. it is not based on an assumed nested loops implementation).

Sql-server – Actual Number of rows is too high even with new tables

The number of records in TransmittedManifests table is much less than the number of records in LWTest. In such scenario, good solution is to use NOT EXISTS approach (as in Query 2) to reduce the number of actual rows. Refer Joins without JOIN - Rob Farley

Now, Martin Smith’s comment helped me demystifying the count per execution (estimate) and total count (actual). For Query 1, the ActualExecutions="904" for LWTest table. So 128385/904 = 142 is the actual rows per execution which is somewhat close to the estimate 104.

In SQL Server 2005, the ActualExecutions can be seen by making a profiler trace adding performance events as shown in the diagram below. [In SQL Server 2012, I could see this information captured in execution plan diagram itself].

XML

SQL Server 2005 Trace Settings

Best Answer

Related Solutions

Sql-server – How does SQL Server’s optimizer estimate the number of rows in a joined table

Sql-server – Actual Number of rows is too high even with new tables

Related Question