Sql-server – Updating a local table with a per-row count(*) which is an aggregate of inner joins on remote server

join;sql serversql-server-2005t-sqlupdate

I have a table of around 350m rows on a linked server in which I added an additional INT column to serve as a count(external_identification) of records as the result of a join on PACKAGE and DOC2. Since the tables are so large I'd like to process the update in batches, both so I can gauge progress and to avoid creating huge temp tables. Each column is indexed.

Would this be a good example where a CTE comes into play? Quite honestly they confuse me with the way they need to be written, it's hard to visualize…

The tables are structured as:

ServerA (utility SQL server)
Table: CLIP_IDs
Columns: Package_UUID nvarchar(255), MessageExtractState tinyint, [count] int (350m rows)

ServerB (main database server)
Table: PACKAGE
Columns: Package_UUID nvarchar(255), Package_id bigint (650m rows)
Table: DOC2
Columns: External_Identification nvarchar(255), Package_id bigint (2b rows)

Both SQL servers are linked both ways, if initiating the query from one is more efficient. I have a feeling issuing the query from ServerA will be, as it seems like the execution plan offers less remote queries.

I stopped the query below after 26 hours because I think I have a syntax logic error. Can someone explain what it is and offer any suggestions please?

Executed from ServerA:

DECLARE @rowsUpdated INT

SET @rowsUpdated = 1

WHILE (@rowsUpdated > 0)
BEGIN
    UPDATE CLIP_IDs
    SET [Count] = x.[count]
    FROM (
        SELECT TOP 50000 c.package_uuid
            ,count(d.external_identification) AS [count]
        FROM CLIP_IDs c
        INNER JOIN ServerB.DATABASE.dbo.package p(NOLOCK) ON c.package_uuid = p.package_uuid
        INNER JOIN ServerB.DATABASE.dbo.doc2 d(NOLOCK) ON p.package_id = d.package_id
        WHERE c.messageextractstate = 1
            AND c.[count] IS NULL
        GROUP BY c.package_uuid
        ) x

    SET @rowsUpdated = @@rowcount

    PRINT N'Finished set of rows: ' + convert(VARCHAR, getdate(), 120)
END

Best Answer

Depending on your permissions, the linked server could be trying to stream all the data over locally and then doing filtering. References

You might be able to skip that pain by computing the total aggregate count first into a table on the local server and then beat against that.

CREATE TABLE #LOCAL
(
    package_uuid nvarchar(255) NOT NULL PRIMARY KEY CLUSTERED
,   [count] bigint
);

INSERT INTO
    #LOCAL
SELECT 
    p.package_uuid
,   count(d.external_identification) AS [count]
FROM 
    ServerB.DATABASE.dbo.package p
    INNER JOIN 
        ServerB.DATABASE.dbo.doc2 d
        ON p.package_id = d.package_id
GROUP BY 
    p.package_uuid;

Try running that query locally on ServerB first to get an understanding of the theoretical throughput without factoring in your network. You can then do some quick and dirty estimates based on data sizes (500 + 8 per row in temporary table) and then it depends on your network. Hopefully this is all local network.

If the time is significantly different between the run on ServerB and pulling it back over, then you might need to use the OPENQUERY syntax to force the join on the remote server. Code approximately

CREATE TABLE #LOCAL
(
    package_uuid nvarchar(255) NOT NULL PRIMARY KEY CLUSTERED
,   [count] bigint
);

INSERT INTO
    #LOCAL
SELECT
    OQ.package_uuid
,   OQ.[count]
FROM
    OPENQUERY(ServerB,
    N'
    SELECT 
        p.package_uuid
    ,   count(d.external_identification) AS [count]
    FROM 
        DATABASE.dbo.package p
        INNER JOIN 
            DATABASE.dbo.doc2 d
            ON p.package_id = d.package_id
    GROUP BY 
        p.package_uuid
    ) AS OQ;