SQL Server 2012 – Hint Remote Scan Operator to Estimate More Than 10000 Rows

execution-planquery-performancesql serversql-server-2012

I need to read data from a linked server and insert into a local table. I need to remove duplicates in data and I need to do it on a local server, because remote server is overloaded. So, I added DISTINCT clause which does Distinct Sort as I want it to do.

The problem is that Remote Scan operator always estimates the number of rows as 10000, while the real number of rows is around 3M. So, the sorting spills to disk and becomes slow.

If there a way to hint to the optimizer that the real number of rows is much more than 10K?

Should I load raw data into a local staging table and then run DISTINCT off the local table? I didn't want to write to disk twice.

The number of rows that are duplicates is small – few hundred out of 3M.
I mean by this that before the duplicates are removed there are ~3,000,000 rows; after the duplicated are removed there are ~2,999,800 rows. So, removing the duplicates on the remote server would not noticeably reduce the amount of data that is transferred over the network.

The destination table is truncated before insert, so I'm always inserting into an empty table. Also, the destination table doesn't have any indexes, triggers or constraints. There are many columns in the table. About 110 columns. In the query below I wrote ManyManyColumns instead.

The query:

WITH
CTE_Raw
AS
(
SELECT
    [ManyManyColumns]
FROM OpenQuery([remote_server],'
SELECT
    [ManyManyColumns]
FROM
    [DB].[dbo].[remote_view]
')
)
,CTE_Converted
AS
(
    SELECT DISTINCT
        [ManyManyColumns]
    FROM
        CTE_Raw
)
INSERT INTO [dbo].[TestVBFast2]
    ([ManyManyColumns]
    )
SELECT
    [ManyManyColumns]
FROM
    CTE_Converted
;

execution plan

remote scan

SQL Server version:

Microsoft SQL Server 2012 (SP4) (KB4018073) - 11.0.7001.0 (X64) 
    Aug 15 2017 10:23:29 
    Copyright (c) Microsoft Corporation
    Standard Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: ) (Hypervisor)

Best Answer

I assume ManyManyColumns is really multiple columns and not one column?...I see your comment states it's 110 actually.

10,000 rows is the default cardinality estimation for a Remote Scan operation in your version of SQL Server, so I don't think you can do much to change that, unfortunately.

How slow is slow currently? Keep in mind even with perfectly accurate cardinality estimates, 3 million rows is always going to be a lot of data to pipe across the network / linked server, especially if you have many columns.

The only general ideas I have at the moment is to either pre-stage the DISTINCT data on your remote server, or use a data synchronization feature like replication to copy it over to your local server instead of using a linked server. If I think of anything else, I'll update my answer accordingly.