Sql-server – SQL Server chooses Nested Loop join with dimensional table and make seek for each row

execution-planjoin;sql serversql-server-2017

I face an issue where SQL Server generates a un-optimal execution plan: a
Nested Loop join and seek to the dimensional table and executes 2M reads on it.

The Sort operation Estimation is 100 rows instead of 450 K rows and maybe affects the plan selection:

NestedLoop: https://www.brentozar.com/pastetheplan/?id=B110MZ2Pm or
NestedLoop plan

This is within a test DB. We have an additional DB with the same schema and almost the same data.

Running exactly the same query (both from SSMS) generates a different plan using Hash Join and dimensional table scan (32K reads):

HashJoin: https://www.brentozar.com/pastetheplan/?id=r1Jm7b2D7 or
Hash plan

I need a help to understand and solve the issue.

I can work around it by hint Hash Joint, but it does not make any sense that 2 similar DBs on the same instance generate different plans.

Update #1: I found that the estimated cost is different
So when SQL Server executes parallel it will choose a hash join.

With a single thread will be nested loop.

Update #2: The same issue occurred during a SELECT from the same table.
Depend on the number of columns (estimated cost). When I reduce the number of columns, the execution plan falls to a nested loop and seeks the dimension table.

Best Answer

There appear to be three reasons why you get a serial nested loop join plan in one of your environments and a hash join in the other. Based on the information that you've provided, the best fixes involve query hints or splitting the query into two parts.

Differences between your environments

One environment has 480662 rows in your CCI and the other has 686053 rows. I wouldn't call that nearly identical. There also appears to be a hardware or configuration difference between your environments, or at the very least you're getting very unlucky. The serial sort of 251 MB of estimated data has an IO cost of 0.0037538 units. The parallel sort of 351 MB of estimated data has an IO cost of 23.1377 units, even though it is discounted by parallelism. The engine expects to spill a relatively significant amount of data for the parallel plan. Differences like that can lead to different plans between environments.
The optimizer misapplies a row goal cost reduction which can favor a nested loop join plan

The nested loop plan is costed as if only 100 rows need to be output from the sort:

However, the query contains the following in the SELECT clause: COUNT(*) OVER ()

The engine must read all rows in order to produce the correct result for the aggregate. This is indeed what happens in the actual plan, and the index seek is executed 450k times instead of 100 times. This cost reduction appears to happen on a variety of versions (I tested back to 2016 SP1 base), on both CEs, with many different window functions, and for both batch mode and row mode. It is a limitation in the product which results in a suboptimal query plan here.
The nested loop join plan is serial due to a limitation with batch mode sorts

It's possible that your serial nested loop join qualifies for parallelism (depends on your CTFP) and you may be wondering why the optimizer did not find a lower costed parallel plan. The optimizer has heuristics which prevent a parallel batch mode sort from being the first child of a nested loop join (which must run in row mode). The problem is that the parallel batch mode sort will put all rows on a single thread which won't work well with a parallel nested loop join. Moving the sort to be a parent of the loop join won't result in a decrease in the number of estimated executions for the index seek (due to the optimizer problem). As a result you're very likely to end up with a serial plan, even if CTFP was set to the default of 5.

Here is a reproduction of your issue, which I can't upload to PasteThePlan because it doesn't support my version of SQL Server:

drop table if exists cci_216665;

create table cci_216665 (
    SORT_ID BIGINT,
    JOIN_ID BIGINT,
    COL1 BIGINT,
    COL2 BIGINT,
    COL3 BIGINT,
    INDEX CCI CLUSTERED COLUMNSTORE
);

INSERT INTO cci_216665 WITH (TABLOCK)
SELECT TOP (500000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 50
, 0, 0, 0
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

drop table if exists YEAH_NAH;

CREATE TABLE dbo.YEAH_NAH (ID INT, FILLER VARCHAR(20), PRIMARY KEY (ID));

INSERT INTO dbo.YEAH_NAH WITH (TABLOCK)
SELECT TOP (50) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, 'CHILLY BIN'
FROM master..spt_values t1;

GO

-- takes 780 ms of CPU with nested loops
SELECT TOP (100)
*, COUNT(*) OVER ()
FROM cci_216665 c
INNER JOIN YEAH_NAH y ON c.JOIN_ID = y.ID
ORDER BY SORT_ID;

-- takes 111 ms of CPU with hash join
SELECT TOP (100)
*, COUNT(*) OVER ()
FROM cci_216665 c
INNER JOIN YEAH_NAH y ON c.JOIN_ID = y.ID
ORDER BY SORT_ID
OPTION (HASH JOIN);

The most straightforward way to solve your problem is to split your queries in two. Here's one way to do it:

SELECT COUNT(*)
FROM cci_216665 c
INNER JOIN YEAH_NAH y ON c.JOIN_ID = y.ID;

SELECT TOP (100) *
FROM cci_216665 c
INNER JOIN YEAH_NAH y ON c.JOIN_ID = y.ID
ORDER BY SORT_ID;

On my machine this is actually faster than the hash join plan but you may not see the same results. In general, my first attempt for a query like yours would be to avoid a window aggregate without an OVER clause when only the first 100 rows are needed.

A reasonable alternative is to use the DISABLE_OPTIMIZER_ROWGOAL use hint introduced in SQL Server 2016 SP1. For this type of query, there's a problem with row goals so this hint directly addresses the problem without any dependence on statistics or anything like that. I would consider it to be a relatively safe hint to employ.

SELECT TOP (100)
*, COUNT(*) OVER ()
FROM cci_216665 c
INNER JOIN YEAH_NAH y ON c.JOIN_ID = y.ID
ORDER BY SORT_ID
OPTION (USE HINT('DISABLE_OPTIMIZER_ROWGOAL'));

This results in a hash join plan on my machine.

Related Solutions

Sql-server – Query slower after upgrade from sql server 2008R2 to 2014sp2

I understand your disappointment with the query plan regressions that you experienced. However, Microsoft changed some core assumptions about the cardinality estimator. They could not avoid some query plan regression. To quote Juergen Thomas:

However to state it pretty clearly as well, it was NOT a goal to avoid any regressions compared to the existing CE. The new SQL Server 2014 CE is NOT integrated following the principals of QFEs. This means our expectation is that the new SQL Server 2014 CE will create better plans for many queries, especially complex queries, but will also result in worse plans for some queries than the old CE resulted in.

To answer your first question, the optimizer appears to pick a worse plan with the new CE because of a 1 row cardinality estimate from Object2. This makes a nested loop join very attractive to the optimizer. However, the actual number of rows returned from Object2 was 34182. This means that the estimated cost for the nested loop plan was an underestimate by about 30000X.

The legacy CE gives a 208.733 cardinality estimate from Object2. This is still very far off, but it's enough to give a plan that uses a merge join a lower estimated cost than a nested loop join plan. SQL Server gave the nonclustered index seek on Object3 a cost of 0.0032831. With a nested loop plan under the legacy CE, we could expect a total cost for 208 index seeks to be about 0.0032831 * 208.733 = 0.68529 which is much higher than the final estimated subtree cost for the merge join plan, 0.0171922.

To answer your second question, the cardinality estimate formulas for a query as simple as yours are actually published by Microsoft. I recommend referencing the excellent white paper on differences between the legacy and new CE found here. Focus on why the cardinality estimates are 1 for the new CE and 208.733 for the legacy CE. That's unexpected because the legacy CE assumes independence of filters but the new CE uses exponential backoff. In general for such a query I would expect the new CE to give a larger cardinality estimate for Object2. You should be able to figure what's going on by looking at the statistics on Object2.

To answer your third question, we can get general strategies from the white paper. The following is an abbreviated quote:

Retain the new CE setting if specific queries still benefit, and “design around” performance issues using alternative methods.

Retain the new CE, and use trace flag 9481 for those queries that had performance degradations directly caused by the new CE.

Revert to an older database compatibility level, and use trace flag 2312 for queries that had performance improvements using the new CE.

Use fundamental cardinality estimate skew troubleshooting methods.

Revert to the legacy CE entirely.

For your problem in particular, first I would focus on the statistics. It's not clear to me why the cardinality estimate for an index scan on Object3 would be so far off. I recommend updating statistics with FULLSCAN on all of the involved objects and indexes before doing more tests. Updating the statistics again after changing the CE is also a good step. You should be able to use the white paper to figure out exactly why you're seeing the cardinality estimates that you're seeing.

I can give you more detailed help if you provide more information. I understand wanting to protect your IP but what you have there is a pretty simple query. Can you change the table and column names and provide the exact query text, relevant table DDL, index DDL, and information about the statistics?

If all else fails and you need to fix it without hints or trace flags, you could try updating to SQL Server 2016 or changing the indexes on your tables. You're unlikely to get the bad nested loop plan if you remove the index, but of course removing an index could affect other queries in a negative way.

Sql-server – SQL Server OR operator causing massive loop join

There are some redundant JOINs between the 2 tables, and using too many OR conditions with nested queries sometimes overwhelms the SQL Query Optimiser.

After cleaning up and organising the code, I've come up with the following (hopefully equivalent) alternative:

DECLARE @a INT
    ,@b BIT
    ,@c INT
    ,@d INT
    ,@e INT;

SELECT SUM(CNT) AS [C]
FROM (
    SELECT COUNT(1) AS [CNT]
    FROM [dbo].[EmailTable]
    WHERE 1 = 1
        AND [ACTIVE] = 1
        AND [INS_ID] = @a
        AND [QUEUED_TO_SEND] = @b
        AND [OWNER_USER_ID] <> @c
        AND (
            [OWNER_USER_ID] IN (- 1)
            OR N'Allusers' = [VISIBLE_TO]
            OR [OWNER_USER_ID] = @d
            )

    UNION ALL

    SELECT COUNT(1) AS [CNT]
    FROM [dbo].[EmailTable] AS [table1]
    WHERE 1 = 1
        AND [table1].[ACTIVE] = 1
        AND [table1].[INS_ID] = @a
        AND [table1].[QUEUED_TO_SEND] = @b
        AND [table1].[OWNER_USER_ID] <> @c
        AND EXISTS (
            SELECT 1 AS [C]
            FROM [dbo].[TeamMembers] AS [table2]
            WHERE [table2].[INS_ID] = [table1].[INS_ID]
                AND [table2].[TEAM_ID] = [table1].[VISIBLE_TEAM_ID]
                AND [table2].[MEMBER_USER_ID] = @d
            )
    ) XYZ
OPTION(RECOMPILE);

Let us know how it works and maybe post the Actual Execution Plan, as some commenters requested.

Best Answer

Related Solutions

Sql-server – Query slower after upgrade from sql server 2008R2 to 2014sp2

Sql-server – SQL Server OR operator causing massive loop join

Related Question