Sql-server – Azure SQL cross-database queries returns too many rows with joins

azure-sql-databasesql server

I have a problem with Azure SQL cross-database queries, it looks like if we join data across databases the join is not distributed as input to the external database.

Currently I tried to implement a simple example and therefore I used this starting point:
https://docs.microsoft.com/en-us/azure/sql-database/sql-database-elastic-query-getting-started-vertical.

As preperation I created 199999 CustomerInformation entries and 200 OrderInformation entries.

Following query works as expected the CustomerInformation Table returns just one row:

SELECT o.OrderId, c.CustomerID, c.CustomerName
  FROM OrderInformation o
  JOIN CustomerInformation c ON o.CustomerID = c.CustomerID
 WHERE c.CustomerID = 2

Expectd Query plan

But any query with a join that does not specifiy the CustomerID returns all the rows from the Customers Database.

SELECT o.OrderId, c.CustomerID, c.CustomerName
  FROM OrderInformation o
  JOIN CustomerInformation c ON o.CustomerID = c.CustomerID
 WHERE o.OrderId = 1870

external database returns too many rows

Therefore the queries are really slow because the servers transfer to many rows. Do I miss some setting or is it not possible to separate this data with the suggested queries.

Thanks for your help.

Best Answer

Recreating the issue

Test data below

SELECT o.OrderId, c.CustomerID, c.CustomerName
FROM OrderInformation o
JOIN CustomerInformation c ON o.CustomerID = c.CustomerID
WHERE o.OrderId = 155;

enter image description here

All records are returned by the remote query operator, to filter down to one when the inner join is applied.


Some more background on pushing down the predicate to a remote database:

Previously, elastic database query could not push parameterized operations to remote databases. As a result, sometimes large row sets had to be unnecessarily brought local to evaluate these operations. With the recent improvements, parameterized operations can now be pushed to remote databases and be evaluated remotely. For a query over an external table and a local table like the following, this can now avoid transferring millions of rows by evaluating the selective filter in the WHERE clause on the remote database:

DECLARE @low int
DECLARE @high int
SET @low = 100
SET @high = 200

SELECT c.CustomerId, c.Name, count(OrderId) 
FROM remote_customers c
JOIN local_orders o
ON c.CustomerId = o.CustomerId 
WHERE c.CustomerId > @low and c.CustomerId < @high
GROUP BY c.CustomerId, c.Name;

Source

Above query shows an example of using parameters to define a range from where to read the data. The difference with your problem is that the filtering is happening on the local table.


If we know that this predicate can be pushed down to the remote table

SELECT o.OrderId,c.CustomerID, c.CustomerName
FROM CustomerInformation c 
JOIN OrderInformation o on o.CustomerId = c.CustomerID
WHERE o.customerID > 1 and o.Customerid < 10;

enter image description here

You could look for a range of customerid's and filter on these

MAX() and MIN() functions on o.Customerid are used to get the upper and lower boundaries for WHERE o.OrderId = 155

DECLARE @MAX INT, @MIN INT;

SELECT @MAX= max(o.Customerid) , @MIN = MIN(o.Customerid)
FROM OrderInformation o 
WHERE o.OrderId = 155;

SELECT o.OrderId,c.CustomerID, c.CustomerName
FROM CustomerInformation c 
WHERE c.customerID >= @MIN and c.Customerid <= @MAX;

The complete result

 DECLARE @MAX INT, @MIN INT;

 SELECT @MAX= max(o.Customerid) , @MIN = MIN(o.Customerid)
 FROM
 OrderInformation o 
 WHERE o.OrderId = 155;

 ;WITH CTE as(
 SELECT c.CustomerID, c.CustomerName
 FROM CustomerInformation c 
 WHERE c.customerID >= @MIN 
 AND c.Customerid <= @MAX
 )

 SELECT o.OrderId, c.CustomerID, c.CustomerName
 FROM CTE c
 INNER JOIN  OrderInformation o
 on o.customerID = c.CustomerID
 where o.Orderid = 155;

Pushing down the predicates:

enter image description here

Execution plan of this query

RemoteQuery="SELECT [c].[CustomerID] [Col1005],[c].[CustomerName] [Col1007] FROM [CustomerInformation] [c] WHERE [c].[CustomerID]&gt;=? AND [c].[CustomerID]&lt;=?"

It depends on your data & parameters passed how much this could filter out on the remote table. YMMV


Test data

Customerinformation (remote db)

CREATE TABLE dbo.CustomerInformation (CustomerID int identity(1,1) , CustomerName nvarchar(255));

DECLARE @i int = 1;
WHILE @i < 20000
BEGIN
INSERT INTO CustomerInformation(CustomerName)
VALUES ('Frodo');

SET @i += 1;
END

On the database that will use the external table

CREATE TABLE OrderInformation(OrderId Int identity(1,1), CustomerId int,Ordervalue int);


SET NOCOUNT ON;
DECLARE @i int = 1;
WHILE @i < 200 
BEGIN
INSERT INTO OrderInformation(CustomerId,Ordervalue)
Values (@I,111);

set @i += 1;
END

CREATE EXTERNAL TABLE  [dbo].CustomerInformation(
    CustomerID [int],
CustomerName nvarchar(255))
    WITH  
(  
    DATA_SOURCE = Stackoverflow  
);