Sql-server – Alternatives to joins using Linked Server and OPENQUERY

sql serversql-server-2008

In my environment, we have a server that stores sales data and another server that stores our HR data. For some reports, we need to combine sales data with employee data. Our sales server contains various tables with millions of rows. The employee table on the HR server contains about 12,000 rows.

On the sales server we have the HR server setup as a Linked Server. For employee sales reporting, we run the main queries on the sales server. Typically this involves joining multiple tables on this server and a single join across to the HR server to the employee table.

As you probably suspect, we are having performance issues. I've tried using both four dot nation and OPENQUERY.

I haven't seen a major difference in performance between the two. Both are bad. If the query optimizer picks a plan where the remote table is queried once, performance is acceptable, even if all 12,000 rows are returned. Unfortunately, I have seen execution plans where the remote table was hit thousands of times in a single query.

I've considered simply creating a local table and running a SQL Agent job to refresh it from the HR employee table every hour or so. Although not ideal, this would be acceptable because the HR employee table doesn't change all that often.

Are there any other approaches I should be considering to solve my performance issue?

Best Answer

You can perhaps play with EXECUTE AT vs. OPENQUERY, but what I see as critical is:

If the query optimizer picks a plan where the remote table is queried once, performance is acceptable, even if all 12,000 rows are returned.

If this is both true now, and will be true after a few years of growth, and you've already restricted the amount of data returned as much as possible by doing as many joins and WHERE clause restrictions and even RTRIM if you have CHAR columns with trailing spaces in your OPENQUERY statement as you can, then at the beginning of your stored procedure/query batch, create a #temp table for the results you need and fill it with the OPENQUERY results. Then use it in your actual queries and joins!

Use Profiler (reads, writes, cpu, duration columns) to check on overall performance with SQL:BatchCompleted and RPC:Completed events; in particular, see if various ways of indexing the #temp table yield a net gain.

Alternately, consider that instead of an agent job refreshing a permanent table willy-nilly, is there a guaranteed way of telling if the remote server's table has been changed, perhaps a rowcount to find deletes combined with a MAX(updatedate) or a MAX(column that's INT IDENTITY and always autopopulates) that'll catch changes, inserts, and inserts equal in quantity to the deletes, or another way?

If so, then in your stored procedure/batch, instead of fetching all the data, just check to see if it's changed. If it's changed, update your permanent local table. If not, use your permanent local table as-is. WARNING: be very careful not to deadlock yourself with this code if it's run by several people at close to the same time (i.e. "run at the same time" in layman's terms).

Also alternatively, put a trigger on the remote table that simply sets a flag table with a single row and BIT column (call it YourDataIsOutOfDate) to 1 on every DML statement on that remote table. Then do the above "check for updates" technique, except that you need to lock the remote table, update your local table, update the YourDataIsOutOfDate remote table to be 0 again, and then unlock (finish your transaction) the remote table. Again, beware of deadlocking yourself, and this may run into remote transaction intricacies that could be a headache. I don't recommend this, but include it for completeness.