Sql-server – Nested Loop has very low estimates due to skewed data

cardinality-estimatesperformancequery-performancesql serversql-server-2016

On SQL Server 2016 SP2 we have a query that has a very low estimate on the nested loop operator. Due to the low estimate this query also spills to tempdb.

If I'm correct SQL Server 2014+ uses Coarse Histogram Estimation to calculate the estimated number of rows on a join.
But when I execute the query, SQL Server uses the density vector to calculate the number of estimated rows.
Is SQL Server only using the Coarse Histogram Estimation if there is no where clause?

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

Is there a way to improve the estimations on the nested loop?

Using following code you can reproduce the data:

create table MyTable
(
    id int identity,
    field varchar(50),
    constraint  pk_id primary  key clustered (id)
)
go

create table SkewedTable
(
    id int identity,
    startdate datetime,
    myTableId int,
    remark varchar(50),
    constraint  pk_id primary  key clustered (id)
)

set nocount on

insert into MyTable select top 1000 [name] from master..spt_values
go

insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go 1000

insert into SkewedTable select GETDATE(),FLOOR(RAND()*(1000))+1,REPLICATE(N'A',FLOOR(RAND()*(40))+1)
go 

CREATE NONCLUSTERED INDEX [ix_field] ON [dbo].[MyTable]([field] ASC)
go

CREATE NONCLUSTERED INDEX [ix_mytableid] ON [dbo].[SkewedTable]([myTableId] ASC)
go

--95=varchar in sys.messages
set nocount off

;with cte as
( 
    select GETDATE() as startdate ,95 as myTableId, REPLICATE(N'B',FLOOR(RAND()*(40))+1) as remark
    union all
    select * from cte
)
insert into skewedtable select top 40000 * from cte
option(maxrecursion 0)
go

update statistics mytable with fullscan
go

update statistics skewedtable with fullscan
go

Best Answer

Normally I would use filtered statistics to improve estimations when I have a table with skewed data. But in this case that doesn't seem to work.

You should find the following filtered statistic helpful:

CREATE STATISTICS [stats id (field=varchar)]
ON dbo.MyTable (id)
WHERE field = 'varchar'
WITH FULLSCAN;

This gives the optimizer information about the distribution of id values that match field = 'varchar', giving a much better selectivity estimate for the join:

Actual execution plan

The execution plan above shows exactly correct estimates with the filtered statistic, leading the optimizer to choose a hash join (for cost reasons).

This distribution information is much more important than the exact method used by the estimator to match the join histograms (fine or coarse alignment), or even the general assumptions (e.g. simple join, base containment).

If you can't do that, your options are broadly as outlined in answer to your previous question Sort spills to tempdb due to varchar(max). My preference would probably be an intermediate temporary table.