Sql-server – Strange query plan when using OR in JOIN clause – Constant scan for every row in table

execution-plansql serversql-server-2017

I'm trying to produce a sample query plan to show why UNIONing two result-sets can be better than using OR in a JOIN clause. A query plan I've written has me stumped. I'm using the StackOverflow database with a nonclustered index on Users.Reputation.

The query is

CREATE NONCLUSTERED INDEX IX_NC_REPUTATION ON dbo.USERS(Reputation)
SELECT DISTINCT Users.Id
FROM dbo.Users
INNER JOIN dbo.Posts  
    ON Users.Id = Posts.OwnerUserId
    OR Users.Id = Posts.LastEditorUserId
WHERE Users.Reputation = 5

The query plan is at https://www.brentozar.com/pastetheplan/?id=BkpZU1MZE, query duration for me is 4:37 min, 26612 rows returned.

I haven't seen this style of constant-scan being created from an existing table before – I'm unfamiliar with why there's a constant scan ran for every single row, when a constant scan is usually used for a single row inputted by the user for example SELECT GETDATE(). Why is it used here? I would really appreciate some guidance in reading this query plan.

If I split that OR out into a UNION, it produces a standard plan running in 12 seconds with the same 26612 rows returned.

SELECT Users.Id
FROM dbo.Users
    INNER JOIN dbo.Posts
       ON Users.Id = Posts.OwnerUserId
WHERE Users.Reputation = 5
UNION 
SELECT Users.Id
FROM dbo.Users
    INNER JOIN dbo.Posts
       ON  Users.Id = Posts.LastEditorUserId
WHERE Users.Reputation = 5

I interpret this plan as doing this:

Get all 41782500 rows from Posts (the actual number of rows matches the CI scan on Posts)
For each 41782500 rows in Posts:
- Produce scalars:
- Expr1005: OwnerUserId
- Expr1006: OwnerUserId
- Expr1004: The static value 62
- Expr1008: LastEditorUserId
- Expr1009: LastEditorUserId
- Expr1007: The static value 62
In the concatenate:
- Exp1010: If Expr1005 (OwnerUserId) is not null, use that else use Expr1008 (LastEditorUserID)
- Expr1011: If Expr1006 (OwnerUserId) is not null, use that, else use Expr1009 (LastEditorUserId)
- Expr1012: If Expr1004 (62) is null use that, else use Expr1007 (62)
In the Compute scalar: I don't know what an ampersand does.
- Expr1013: 4 [and?] 62 (Expr1012) = 4 and OwnerUserId IS NULL (NULL = Expr1010)
- Expr1014: 4 [and?] 62 (Expr1012)
- Expr1015: 16 and 62 (Expr1012)
In the Order By sort by:
- Expr1013 Desc
- Expr1014 Asc
- Expr1010 Asc
- Expr1015 Desc
In the Merge Interval it removed Expr1013 and Expr1015 (these are inputs but not outputs)
In the Index seek below the nested loops join it is using Expr1010 and Expr1011 as seek predicates, but I don't understand how it has access to these when it hasn't done the nested loop join from IX_NC_REPUTATION to the subtree containing Expr1010 and Expr1011.
The Nested Loops join returns only the Users.IDs that have a match in the earlier subtree. Because of predicate pushdown, all rows returned from the index seek on IX_NC_REPUTATION are returned.
The last Nested Loops join: For each Posts record, output Users.Id where a match is found in the below dataset.

Best Answer

The plan is similar to the one I go into in more detail here.

The Posts table is scanned.

For each row it extracts the OwnerUserId and LastEditorUserId. This is in a similar manner to the way UNPIVOT works. You see a single constant scan operator in the plan for the below creating the two output rows for each input row.

SELECT *
FROM dbo.Posts
UNPIVOT (X FOR U IN (OwnerUserId,LastEditorUserId)) Unpvt

In this case the plan is a bit more complex as the semantics for or are that if both column values are the same only one row should be emitted from the join on Users (not two)

These are then put through the merge interval so that in the event the values are the same the range is collapsed down and only one seek is executed against Users - otherwise two seeks are executed against it.

The value 62 is a flag meaning that the seek should be an equality seek.

Regarding

I don't understand how it has access to these when it hasn't done the nested loop join from IX_NC_REPUTATION to the subtree containing Expr1010 and Expr1011

These are defined in the yellow highlighted concatenation operator. This is on the outer side of the yellow highlighted nested loops. So this runs before the yellow highlighted seek on the inside of that nested loops.

A rewrite that gives a similar plan (though with the merge interval replaced by a merge union) is below in case this helps.

SELECT DISTINCT D2.UserId
FROM   dbo.Posts p
       CROSS APPLY (SELECT Users.Id AS UserId
                    FROM   (SELECT p.OwnerUserId
                            UNION /*collapse duplicate to single row*/
                            SELECT p.LastEditorUserId) D1(UserId)
                           JOIN Users
                             ON Users.Id = D1.UserId) D2
OPTION (FORCE ORDER)

Dependant on what indexes are available on the Posts table a variant of this query may be more efficient than your proposed UNION ALL solution. (the copy of the database I have has no useful index for this and the proposed solution does two full scans of Posts. The below does it in one scan)

WITH Unpivoted AS
(
SELECT UserId
FROM dbo.Posts
UNPIVOT (UserId FOR U IN (OwnerUserId,LastEditorUserId)) Unpivoted
)
SELECT DISTINCT Users.Id
FROM dbo.Users INNER HASH JOIN Unpivoted
       ON  Users.Id = Unpivoted.UserId
WHERE Users.Reputation = 5

Related Solutions

Sql-server – SQL Server Index Scan Actual Executions

Schema and indexes are only one aspect of query plan and performance. Your statement "but with different data" is likely the source of the difference. The number of rows and the distribution of data is essential to the query optimizer. If you have significantly more rows in D2, or if the data is of entirely different characteristics (wider or narrower range of values), then you should expect to see different performance and execution plans.

For each set of statistics, SQL Server keeps a maximum of 200 samples. As the rows in the tables grow and the more irregular the distribution of values the more likely it is that SQL Server will not have enough information to generate optimal execution plans. That's where the use of filtered indexes and statistics comes into play.

If this is a parameterized query you may also be running into a parameter sniffing problem. Note that if you're using local variables the calculation changes also.

Different execution plan for the same query if I change a value in the predicate

The problem was histograms, I ran statistics and disabled histogram creation and the execution plan used nested loops:

BEGIN
  DBMS_STATS.GATHER_table_STATS (OWNNAME => 'MIDAS', TABNAME => 'MINCISOC', 
  METHOD_OPT => 'FOR ALL COLUMNS SIZE 1');
END;

If I run it with FOR ALL COLUMNS SIZE AUTO again the same problem because it uses hash join. Thanks to Phil for the suggestion.

Best Answer

Related Solutions

Sql-server – SQL Server Index Scan Actual Executions

Different execution plan for the same query if I change a value in the predicate

Related Question