SQL Server 2012 – Hint Remote Scan Operator to Estimate More Than 10000 Rows

execution-planquery-performancesql serversql-server-2012

I need to read data from a linked server and insert into a local table. I need to remove duplicates in data and I need to do it on a local server, because remote server is overloaded. So, I added DISTINCT clause which does Distinct Sort as I want it to do.

The problem is that Remote Scan operator always estimates the number of rows as 10000, while the real number of rows is around 3M. So, the sorting spills to disk and becomes slow.

If there a way to hint to the optimizer that the real number of rows is much more than 10K?

Should I load raw data into a local staging table and then run DISTINCT off the local table? I didn't want to write to disk twice.

The number of rows that are duplicates is small – few hundred out of 3M.
I mean by this that before the duplicates are removed there are ~3,000,000 rows; after the duplicated are removed there are ~2,999,800 rows. So, removing the duplicates on the remote server would not noticeably reduce the amount of data that is transferred over the network.

The destination table is truncated before insert, so I'm always inserting into an empty table. Also, the destination table doesn't have any indexes, triggers or constraints. There are many columns in the table. About 110 columns. In the query below I wrote ManyManyColumns instead.

The query:

WITH
CTE_Raw
AS
(
SELECT
    [ManyManyColumns]
FROM OpenQuery([remote_server],'
SELECT
    [ManyManyColumns]
FROM
    [DB].[dbo].[remote_view]
')
)
,CTE_Converted
AS
(
    SELECT DISTINCT
        [ManyManyColumns]
    FROM
        CTE_Raw
)
INSERT INTO [dbo].[TestVBFast2]
    ([ManyManyColumns]
    )
SELECT
    [ManyManyColumns]
FROM
    CTE_Converted
;

SQL Server version:

Microsoft SQL Server 2012 (SP4) (KB4018073) - 11.0.7001.0 (X64) 
    Aug 15 2017 10:23:29 
    Copyright (c) Microsoft Corporation
    Standard Edition (64-bit) on Windows NT 6.3 <X64> (Build 9600: ) (Hypervisor)

Best Answer

I assume ManyManyColumns is really multiple columns and not one column?...I see your comment states it's 110 actually.

10,000 rows is the default cardinality estimation for a Remote Scan operation in your version of SQL Server, so I don't think you can do much to change that, unfortunately.

How slow is slow currently? Keep in mind even with perfectly accurate cardinality estimates, 3 million rows is always going to be a lot of data to pipe across the network / linked server, especially if you have many columns.

The only general ideas I have at the moment is to either pre-stage the DISTINCT data on your remote server, or use a data synchronization feature like replication to copy it over to your local server instead of using a linked server. If I think of anything else, I'll update my answer accordingly.

Related Solutions

SQL Server – Efficiently Transfer Large Amounts of Data (84 Million Rows)

I would add that, however you decide to approach this, you'll need to batch these transactions. I've had very good luck with the linked article lately, and I appreciate the way it takes advantage of indexes as opposed to most batched solutions I see.

Even minimally logged, those are big transactions, and you could be spend a lot of time dealing with the ramifications of abnormal log growth (VLFs, truncating, right-sizing, etc.).

Thanks

SQL Server Performance – Full Outer Join vs Union for Finding Distinct Rows

The semantics of the two queries are not the same - UNION removes duplicates, whereas the FULL OUTER JOIN will not:

DECLARE @T1 AS table (id bigint NULL, val integer NULL);
DECLARE @T2 AS table (id bigint NULL, val integer NULL);

INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T1 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);
INSERT @T2 (id, val) VALUES (1, 1);

SELECT COALESCE(t1.id, t2.id) AS id, COALESCE(t1.val, t2.val) AS val
FROM @t1 t1
FULL OUTER JOIN @t2 t2
    ON t2.id = t1.id
    AND t2.val = t1.val;

SELECT t1.id, t1.val
FROM @t1 t1
UNION 
SELECT t2.id, t2.val
FROM @t2 t2;

Output:

╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║  1 ║   1 ║
║  1 ║   1 ║
║  1 ║   1 ║
║  1 ║   1 ║
╚════╩═════╝

╔════╦═════╗
║ id ║ val ║
╠════╬═════╣
║  1 ║   1 ║
╚════╩═════╝

That said, the optimizer does not know many FOJN tricks, so it is always possible that there is a better way to express the query than the natural UNION. Only commonly-useful and always-correct transformations are implemented.

Note that with a unique constraint only on the larger table, the optimizer chooses a hash union, without expensive duplicate-removal on the probe input, that makes it choose Concat Union All in the question example:

ALTER TABLE #t2 
ADD CONSTRAINT UQ2 
UNIQUE CLUSTERED (id);

SELECT COUNT(*), AVG(x.id), AVG(x.val)
FROM (
    SELECT t1.id, t1.val
    FROM #t1 t1
    UNION
    SELECT t2.id, t2.val
    FROM #t2 t2
) AS x;

The FOJN rewrite may well be a useful one in cases where you know there cannot be duplicates within each input set, but this condition is not enforced with a unique constraint or index (particularly on the large input).

If such a uniqueness guarantee does exist, and yet the optimizer does not select a Hash Union, you might try an OPTION (HASH UNION) hint, to see how it compares.

Best Answer

Related Solutions

SQL Server – Efficiently Transfer Large Amounts of Data (84 Million Rows)

SQL Server Performance – Full Outer Join vs Union for Finding Distinct Rows

Related Question