I have the following [fairly meaningless, just for the purpose of demonstration] query in the StackOverflow database:
SELECT *
FROM Users u
LEFT JOIN Comments c
ON u.Id = c.UserId OR
u.Id = c.PostId
WHERE u.DisplayName = 'alex'
The only index on the Users
table is a clustered index on ID.
The Comments
table has the following Non-Clustered indexes as well as Clustered Index on ID:
CREATE INDEX IX_UserID ON Comments
(
UserID,
PostID
)
CREATE INDEX IX_PostID ON Comments
(
PostID,
UserID
)
The estimated plan for the query is here:
I can see the first thing the optimizer will do is perform a CI scan on the users table to filter only those users where DisplayName = Alex
, effectively doing this:
SELECT *
FROM Users u
WHERE u.DisplayName = 'alex'
ORDER BY Id
and retreiving results as such:
Then it will scan the comments CI and for every row, look to see if the row satisfies the predicate
u.Id = c.UserId OR u.Id = c.PostId
Despite the two indexes, this CI scan is performed.
Wouldn't it be more efficient if the optimizer did a separate seek on each of the indexes in the Comments table above and join them together?
If I visualise what that would look like, in the screenshot above we can see the first result of the Users CI scan is ID 420
I can visualize what the IX_UserID
Index looks like using
SELECT UserID,
PostID
FROM Comments
ORDER BY UserID,
PostID
so if I seek to the rows for user ID 420 as an index seek would:
for every row where UserID = 420
, I can look if u.Id = c.UserId OR u.Id = c.PostId
of, course they all match the u.Id = c.UserId
part of our predicate,
So for the second part of our index seek, we can seek through our index IX_PostID
which can be visualised as follows:
SELECT PostID,
UserID
FROM Comments
ORDER BY PostID,
UserID
If I seek to Post ID 420 I can see nothing is there:
So we then go back to the results of the CI scan, move to the next row (userId 447) and repeat the process.
The behaviour I have described above is possible using in a WHERE
clause:
SELECT UserID,
PostID
FROM Comments
WHERE UserID = 420 OR PostID = 420
ORDER BY UserID,
PostID
My question therefore is, why isn't an OR
condition in a JOIN
clause able to perform an index seek on appropriate indexes?
Best Answer
Rather than focusing on how to improve a query like this, which is what the other answers are doing, I'm going to try to answer the question being asked: why doesn't the optimizer produce a plan like the one you've described (that scans the Users table, and then seeks into the two indexes on the Comments table).
Here's your original query again (note I'm using
MAXDOP 2
just to simulate what I saw in your execution plans):And the plan:
dbo.Users
with residual predicate to get just the "alex" usersdbo.Comments
table and filter matches in the join operatorOne attempt to get the plan you want would be to try and force a seek on the
dbo.Comments
table:The plans looks like this:
dbo.Users
table (with a residual predicate to only get users named "alex"),So the answer is that the optimizer is definitely capable of producing such a plan. And it doesn't seem to be a cost-based decision (the seek plan looks much cheaper).
My best guess is that this is just some kind of limitation in the optimizer's exploration process - it doesn't seem to favor converting a left join with an or clause into an apply. This is really unfortunate in this particular case, as performance is dismal in the scan plan (the query takes 45 seconds on my machine) vs the apply plan (less than 1 second).
Side note: You can override the heuristic that disfavors index union plans with undocumented trace flag 8726. See https://dba.stackexchange.com/a/23779 for additional details on that front!
As Rob Farley helpfully pointed out, using
APPLY
directly (potentially with aUNION
as well) is a better approach to get the plan you're looking for - both of those produce the "better" version of this plan (theFORCESEEK
version). I would say that "OR
in aJOIN
" is kind of a known anti-pattern, and should be avoided since it doesn't seem like the optimizer has great support for that type of query directly.