Sql-server – How to effectively handle very skewed data? Statistics up to date but don’t seem to help

cardinality-estimatesperformancequery-performancesql serversql-server-2012statistics

I have a table, dbo.ClaimBilling, with 130,000 rows. In this table, the column OperatorID is a varchar(max) and is heavily skewed. 125,000 rows are 'user1', and the remaining 5000 rows are split between 6 other values, with 'user2' having a total of 3 records.

There is a non-clustered index on OperatorID, and the clustered index is the primary key, IDClaimBilling.

I have the following query currently:

SELECT DISTINCT IDClaimBilling
FROM dbo.ClaimBilling cb
INNER JOIN dbo.BillingItem bi
    ON cb.IDClaimBilling = bi.ClaimID
WHERE OperatorID = @operator

No matter what value @operator is, the estimate of rows from ClaimBilling is ~4000, which is not close to what any value would return, and it's always a clustered index scan, it doesn't use the operatorID index. If I remove the join and do

SELECT DISTINCT IDClaimBilling
FROM dbo.ClaimBilling
WHERE OperatorID = @operator

then it does use the OperatorID index, but again the estimates are wrong regardless of the value of @operator, this time always estimating ~18,000 or so.

I did an UPDATE STATISTICS dbo.ClaimBilling WITH FULLSCAN before running the queries.

Why are these estimates so wrong even if the statistics know exactly how many rows there are per value?

I'm declaring and assigning @operator a value in testing. It was originally part of a procedure and I assumed that was the problem, but it behaves the same when used in an ad hoc statement as well.

The query only runs when the user first logs in, so only maybe a couple times a day per user.

Best Answer

Community Wiki answer generated from comments left on the question

If you are running the query with @operator as a variable, SQL Server cannot 'sniff' the value in the variable, so it will use the statistics' average density value to calculate an estimate. No matter what value you assign to the variable, this average value estimate will always be the same.

One way to resolve this is by using an OPTION (RECOMPILE) query hint. This will compile a fresh plan on each execution, with the plan optimized for the specific value in the variable at that time. This comes at the cost (usually small) of a statement recompilation each time.

You could also modularize the code. You could use an IF statement and check the value of operatorid and if it's "user1" call one stored proc, let's say sp_user1. If it is not "user1" call a different procedure. The first sp would be optimized for "user1" and the other for the rest of the value. You could also use option (recompile) in the second sp for non-"user1" values if required.

It might also be a decent use-case for dynamic SQL. That would turn your @operator variable into a literal value, and would customize the plan for each user. Since there's only 7 users in that table I don't think that would really pose an issue for you.

For more information, see: