Sql-server – Multicolumn statistics on large table yielding poor estimations

execution-plansql serverstatisticst-sql

I have a rather large table (~200 million rows), and although the statistics are up to date using WITH FULLSCAN, is it possible that my histogram (limited to 200 steps) is simply too broad for the optimizer to make good estimates – in other words, is it no longer "selective" enough? With this particular customer's database/table, my query plan estimations are way off compared to others.

The particular statistic that I am concerned with comes from the table's PK/CLUSTERED INDEX. It is a multi-column statistic containing an int (ParentId) and a smalldatetime (TimeStamp).

When I issue a DBCC SHOW_STATISTICS('SomeTable', 'PK_SomeTable'), I get the following output (histogram omitted – but I can post it if it will help):

Name                    Updated              Rows       Rows Sampled    Steps   Density         Average key length  String Index    Filter Expression   Unfiltered Rows
PK_SomeTable            Jan 31 2014 10:59AM  181170887  181170887       200     2.022617E-05    8                   NO              NULL                181170887

All density     Average Length  Columns
0.0004892368    4               ParentId
5.519651E-09    8               ParentId, TimeStamp

Most of my queries are performed using the combination of both these columns (ParentId and TimeStamp). The small all density value shows the selectivity of this pair – obviously since it's the PK.

(1) The histogram appears to only show the ParentId column. Am I missing something here? Are both columns being accounted for?

(2) If I take 200,000,000 rows / 200 steps, I essentially have 1,000,000 rows defined in each histogram step. This seems large enough that it could cause estimation issues, right? What about sort spills to tempdb and things like that?

(3) Would manually created statistics/filtered statistics be an avenue to explore? How does one go about deciding what type of filter to apply?

Best Answer

With a table that large I would consider partitioning. Unfortunately I can't answer your specific questions (1-3) but in general one of the benefits of using partitioned views (instead of native partitioning) is that the individual tables within the partitioned view are considered separate objects and have their own statistics each with 200 steps. Here is a post where Kimberly Tripp of SQLSkills recommends for large tables that you consider not only partitioning or partitioned views but combining the two.

In case you don't know a partitioned view is one where you have multiple tables each holding a portion of the data and a view on top with UNION ALLs to combine the tables together.

Here is another of Kimberly's blogs on statistics in case you are interested. It should help you answer some of your other questions.

Here's an article of Connor Cunningham regarding statistics: Statistics, Damned Lies, and Statistics – What is Statman?