Sql-server – Remote queries and linked servers

linked-serversql server

I have 2 servers with SQL Server 2008 R2: INT & CRM.

I need to run an SSIS package on INT, which uses data from both an INT database and a CRM database. It's painfully slow.

I tried this to test a simple query (on INT):

EXEC sp_addlinkedserver @server='CRM', 
                        @srvproduct='', 
                        @provider='SQLNCLI', 
                        @datasrc='<IP of CRM>'

EXEC sp_addlinkedsrvlogin
                        @useself='FALSE',
                        @rmtsrvname='CRM',
                        @rmtuser='<username>',
                        @rmtpassword='<userpassword>'

I also gave <username> ddl_admin role on the CRM server, as recommended on this answer:
SQL performance issues with remote query across linked server

I then run this query:

UPDATE [CRM].DB.dbo.table1
SET field = 1
WHERE id = (SELECT id FROM [CRM].DB.dbo.table2 WHERE secondary_id = 9999)

This query takes 40 ms to run when on CRM itself. It takes 30 seconds to run in the above example. There are no results returned, it can't be network issues, so why does it take so long?

Best Answer

Several possible reasons for this.

Lack of data distribution statistics. It looks like you have this one covered in your post, but i'll leave it for completeness.

If the linked server login does not have sufficient privileges to obtain all useful statistics on CRM, a suboptimal plan will be used. *If the linked server is an instance of SQL Server, to obtain all available statistics, the user must own the table or be a member of the sysadmin fixed server role, the db_owner fixed database role, or the db_ddladmin fixed database role on the linked server.* http://msdn.microsoft.com/en-us/library/ms175537%28v=sql.105%29.aspx

Mike Walsh also has a writeup on this: http://www.straightpathsql.com/archives/2010/07/linked-server-query-running-slow/

Query is too complex to perform well across the linked servers.

  1. Replication: Replicate the minimum necessary columns from one server to the other. Use the local replicated data for joins.
  2. Use temp tables: Insert the minimum necessary columns from one server to a table on the other. Use the local data for joins.

For both of these scenarios, all data is on the same machine. This enables the optimizer to have complete data distribution statistics. You can even index columns for further optimization. Always identify the minimum columns required to complete the necessary tasks. Sometimes it's sufficient just to replicate or insert 1 or 2 columns.

OPENQUERY is also an option, but I prefer all of the data on the same machine when possible.

If the cross server joins are INNER JOINS, you might also benefit from the use of the REMOTE hint. I've seen this work very well when the remote table has significantly more records than the local table.