Sql-server – SQL Server – query optimization for highly skewed data distributions

performancequery-performancesql serversql-server-2008-r2statistics

I am trying to optimize a query which looks similar to this:

select top(1)
    t1.Table1ID,
    t1.Column1,
    t1....
    ....
    t2.Table2ID,
    t2....
    ....
    c.FirstName,
    c.LastName,
    c....
from BigTable1 t1
join BigTable2 t2
    on t1.Table1ID = t2.Table1ID
join Customer c
    on t2.CustomerID = c.CustomerID
join Table4 t4
    on t4.Table4ID = t2.Table4ID
join Table5 t5
    on t5.Table5ID = t1.Table5ID
join Table6 t6
    on t6.Table6ID = t5.Table6ID
where 
    t4.Column1 = @p1
    and t1.Column1 = @p2
    and t3.FirstName = @FirstName
    and t3.LastName = @LastName
    and t6.Column1 = @p5
    and (@p6 is null or t2.Column6 = @p6)
order by t2.Table2ID desc
option(recompile);

BigTable1, BigTable2, Customer – are big transactional table (hundreds millions of rows), Table4,Table5,Table6 are relativly static and small lookup tables.
The problem is that data in those big tables has pretty skewed distribution and because of that very often this query performs very poorly (estimated number of rows in execution plan are very different from the actual). Updating statistics on those big tables does not help (200 steps in histogram are not enough to cover all the skews in data distribution). For example, in Customer table there are a few (FirstName, LastName) combinations that correspond to about 500k records.

I see 2 options to improve performance of such queries:

  1. Split this query into smaller ones and saving intermidiate results
    into temporary tables. With this approach of materializing
    intermidiate results into temporary tables we can give the optimizer
    a better opportunity for cardinality estimations, but we add a
    substantial additiobal load on tempdb (cause this query is executed
    rather frequently, up to several times a second). And another
    drawback is that it is not faster for all parameter values, it
    appears to be significantly better for atypical parameter values
    (when the original query can take minutes, this takes just few
    seconds), but for parameters that correspond to a more o less unique
    (or infrequent) rows the approach with temp tables is slower.
  2. Create filtered statistics for the spike values, for example for
    Customer table it would look like this:

    declare @sql nvarchar(max) = N'', @i int, @N int;
    select top (1000)
    identity(int,1,1) as id,
    FirstName,
    LastName,
    count(*) as cnt
    into #FL
    from Customer
    where 
    FirstName is not null 
    and LastName is not null
    group by
    FirstName,
    LastName
    order by cnt desc;
    
    set @N = @@ROWCOUNT;
    set @i = 1;
    
    while @i <= @N
    begin
        select @sql = 'CREATE STATISTICS Customer_FN_LN_' + cast(id as varchar(10)) + ' ON dbo.Customer(CustomerID) WHERE FirstName = ''' + FirstName + ''' AND LastName = ''' + LastName + ''' WITH FULLSCAN, NORECOMPUTE'
    from #FL
    where id = @i;
    exec sp_executesql @sql;
    set @i = @i + 1;
    end;
    

So if we create this filtered stats for those 3 big tables and re-create those stats, lets say, every week, we should be OK with row estimations in the original query.

From my previous experience I usually used approach with temp tables, but in this case it looks like the approach with filtered statistics is more attractive. I haven't use it in production yet though and I am curious what could be the drawbacks.

So my questions are:

  1. Are there any other approaches to help optimizer deal with highly skewed data distributions?
  2. What are the drawbacks of manually created and handled filtered statistics in the 2nd approach?

Best Answer

  1. Are there any other approaches to help optimizer deal with highly skewed data distributions?

Filtered statistics and breaking the query up using intermediate temporary tables (correctly!) are the main options, but you could also consider how an indexed view might be used to help. Properly implemented, an indexed view should have approximately the same impact as an extra nonclustered index on the base table(s).

Using WITH (NOEXPAND) instead of relying on automatic matching (Enterprise Edition only) will allow the optimizer to create and use statistics on the indexed view as well.

More generally, if you are able to identify the 'safe' or 'unsafe' values ahead of time, you could consider a hybrid approach (including dynamic SQL) as detailed in Building High Performance Stored Procedures by Kimberly L. Tripp.

You could also consider multiple separate procedures optimized for different cases, using the appropriate approach in each, including hints like OPTIMIZE FOR.

Finally, you have plan guides and/or forced plans via Query Store (where available).

  1. What are the drawbacks of manually created and handled filtered statistics in the 2nd approach?

Mainly issues around the filtered statistics not being updated as frequently as one might hope. You can workaround this by refreshing these statistics manually.

You're already recompiling, so that should address the use of filtered statistics with parameterized queries.