I was able to reproduce your issue in SQL Server 2014 with the legacy CE and TF 4199 enabled. I used a rowstore fact table because I have little experience with columnstore.
For your query with the filter, the query optimizer rewrites the query to something like this:
select *
from fact_sales f
join dim_company d on f.company_SK = d.company_SK
where f.company_SK = 1 and d.company_SK = 1
For that query the optimizer can directly uses the histogram step containing 1 for both tables and you get 467,583,000 X 1 as the estimated number of rows.
The query without a filter will use linear interpolation:
For joins with a single equality or inequality predicate, the legacy
CE joins the histograms on the join columns by aligning the two
histograms step-by-step using linear interpolation.
Note that the statistics for your dimension table do not have a histogram step with a RANGE_HI_KEY value of 1, which is where almost all of the data is in your fact table. Something is going wrong with the linear interpolation step for that value. If I create your dimension table without a 0 value then the histogram gets a step with a RANGE_HI_KEY value of 1. That fixes the estimate. In short, you got unlucky with your histogram. I don't think there is a supported way to force the histogram on the dimension table to contain a step for 1, unless perhaps you are willing to add additional filters to your query.
I don't know the exact rules of your data, but I can give you two workarounds. The first workaround is to enable trace flag 2301 for the query. That fixes the estimate for this part of the query but may have other negative effects for the rest of your query. This trace flag is documented by MS, but talk to your DBA before using it.
The second workaround is to encourage the query optimizer to use a different method of cardinality estimation for the join. The following query gave a good estimate for me and only does a single scan on your fact table:
select *
from FACT_SALES f
join (SELECT COMPANY_SK FROM DIM_COMPANY UNION SELECT 1) d
on f.company_SK = d.company_SK;
Based on your description of the problem it is unlikely that the UDF is the cause of the cardinality estimation errors. You can try commenting out the UDF and checking if the join estimate changes if you want to test that.
It is possible to experience cardinality estimation errors even when statistics have been updated with FULLSCAN
. The histograms created by SQL Server may not model your data in a way for the optimizer to create good enough plans, SQL Server may not have histograms that exactly match the joined data (you could be filtering on the inner and/or outer table), you may be joining on multiple columns which complicate estimations, and so on. For example, Microsoft changed some of the join cardinality estimate calculations with the new CE released in SQL Server 2014. Some tables will have data that better matches the assumptions for the legacy CE and some tables will have data that better matches the assumptions for the new CE.
For the first estimation issue (38999 estimated rows versus 701680 actual rows), based on what you've provided I can only recommend reading about join cardinality calculations. Microsoft released a white paper for the new cardinality estimator in SQL Server 2014. I also know of a blog post that dives into the internals of join cardinality quite a bit. As the question is written there are too many possible causes to give a good answer. Please post an actual execution plan if you'd like further guidance.
For the second estimation issue (1 estimated row versus 1546894 actual rows), note that the estimated number of rows is for a single iteration of the inner loop for the join. The actual number of rows is for all rows returned from the table over all iterations of the join, so they will not often match for nested loop joins. What you are seeing is common and not necessarily a sign of a problem.
Best Answer
You could create additional statistics for the data. Include a WHERE clause so each new statistics object references only a subset of the table. By bucketing the rows appropriately (big, medium, small?) the filtered statistics will be less bad than the whole-table ones, leading to better plans.
The predicate must match that used in queries. Depending on your settings there may be more system activity for stats maintenance. Of course this assumes the table contains (or could change to contain) some combination of columns on which rows can be meaningfully aggregated.