Sql-server – How to force a merge join without a hint

execution-planjoin;optimizationsql servert-sql

Some time ago I used query hints but then realized that most of the time SQL Server is smarter than I. And it's better to build an additional index/reorganize data or the query and get a much better result than forcing the server to use a plan that is inefficient in general but speedy enough for some subset of data. But now I am in a situation where I don't know how to organize data better.

I have two tables. First table, T1, is (Id, CustomerId) and second, T2, has the same columns. I want to join T1 to T2 on CustomerId. And get top N rows. In this case the optimizer sees that I want only N top and says, "Hey, I will use loops and find N matches very quick, especially when I use an index seek." But it's not working, because there is no data satisfying the conditions. Thus it uses a loop join to join a 25m table and a 100k table which is pretty slow.

enter image description here

When I force SQL Server to use a merge join, I get the following plan, which is executed in one second:

enter image description here

I don't want to force it for two reasons:

  • First, SQL Server is smart enough, as I said earlier.
  • Second, I'm using an ORM and it's quite hard to inject hints into
    the generated query, so I'd like to avoid it.

What should I do in this situation?

Best Answer

To switch from a nested loop join to a merge join without a hint you need the plan with the merge join to have a lower estimated cost than the plan with the nested loop. Technically, you also need the query optimizer to find the lower cost plan during plan exploration but there isn't a lot that you can do about that.

First, a note about merge join costing. In my experience, SQL Server is pretty pessimistic about merge join costing. This seems understandable when you consider how much the IO requirements for a merge join can change when small amounts of data are changed. Consider a simple example in which one table has the integers from 1 - 10000 and the other has the integers from 100001 to 1100000:

DROP TABLE IF EXISTS X_SMALL_TABLE;

CREATE TABLE X_SMALL_TABLE (ID INT NOT NULL PRIMARY KEY (ID));

INSERT INTO X_SMALL_TABLE WITH (TABLOCK) (ID)
SELECT N
FROM dbo.GetNums(10000);

UPDATE STATISTICS X_SMALL_TABLE WITH FULLSCAN;

DROP TABLE IF EXISTS X_LARGE_TABLE;

CREATE TABLE X_LARGE_TABLE (ID INT NOT NULL PRIMARY KEY (ID));

INSERT INTO X_LARGE_TABLE WITH (TABLOCK) (ID)
SELECT N + 100000
FROM dbo.GetNums(1000000);

UPDATE STATISTICS X_LARGE_TABLE WITH FULLSCAN;

If I join the tables together I will get 0 rows returned. If I force a MERGE JOIN then SQL Server will scan all of the rows from the small table and only one of the rows from the large table. By default, SQL Server will traverse through the tables in ascending order according to the join key. However, the query optimizer estimates that all 1000000 rows will be scanned from the large table and subsequently assigns a relatively large cost to that operator:

plan 1

As we can see from the actual plan, only one row was scanned. What happens if we insert just one new row into the small table?

INSERT INTO X_SMALL_TABLE
SELECT 1200000;

UPDATE STATISTICS X_SMALL_TABLE WITH FULLSCAN;

Since the merge join algorithm goes in ID order once we hit ID 10000 in the small table we should loop through all of the rows in the large table until we hit ID 1200000.

plan 2

The actual plan confirms that all rows were read from both tables. So small changes in data can lead to large differences in merge join performance. Based on that it seems reasonable for SQL Server to be pessimistic about merge join costing, which will work against you if you are trying to get a merge join to appear in your query. In your second query you scan all of the rows from CustomerDiscountCards. That operation alone could have a higher estimated cost than the entire nested loop join plan.

When using hints to force a MERGE JOIN note that the MERGE JOIN join hint also adds a FORCE ORDER hint to the query. Based on the information in the question, your second query doesn't run quickly strictly because of the MERGE JOIN. It runs quickly because of the new join order. The optimizer joins together CustomerDiscountCards and DiscountCards using full scans on both tables and finds 0 matching rows. Any inner join against a 0 row result set is going to be fast because query execution can stop at that point (disregarding blocking operators which occur before that). So instead of focusing on forcing a merge join, which will probably be difficult without hints, I would focus on getting the join order correct.

A simple way to force join order is to materialize the first result set that you want into a temporary table. You could join CustomerDiscountCards and DiscountCards in an initial query and put the matching relevant rows into a temp table. This should run in under a second for your test case because you won't actually be inserting any rows into the temp table. If your second query joins to the empty table it should finish almost instantly. You will need to test the performance of the query for other parameter values in which you actually get rows back because you could be putting a fair number of rows and columns into tempdb.

You may be able to rewrite your SQL to effectively force the join order without using a temp table. Of course using the FORCE ORDER hint is the easiest way to do that but you said that you don't want to use hints and that hint can be a difficult one to use. It applies to the entire query so if your data ever changes you could end up with poor performance. For an idea on how to force join order without the hint, it looks like pairing together the tables that you want to be joined first in a derived table and adding a superfluous TOP operator seems to push SQL Server in the right direction:

INNER JOIN 
(
    SELECT TOP 9223372036854775807 ...
    FROM 
    CustomerDiscountCards
    INNER JOIN DiscountCards ON ...
) t ON ...

However, I cannot conclude if SQL Server is actually forced to evaluate that join separately from the rest of the query. As before you will need to test.

To give you a final option to consider, since the initial problem appears to relate to row goals you could hide the row goal from the optimizer. One way to accomplish that is to use the OPTIMIZE FOR query hint and to use a variable for your TOP statement. Consider the query snippet below:

DECLARE @TOP BIGINT = 10;

SELECT TOP (@top)
...
OPTION (OPTIMIZE FOR (@TOP = 9223372036854775807))

That query will only return the first 10 rows that it finds in the result set. However, the plan will be created as if it returns the first 9223372036854775807 rows. This may disfavor the nested loop join plan that you experienced the issues with, but it's hard to say how well this will perform in general.

Ultimately, it's hard to say more with the information that we have in the question. Presumably your query can return rows in some cases (there isn't much reason to continually run a query that always returnns no rows) so you'll need to test your options for all of the different cases.