I am attempting to update a query that utilizes the IN
operator within a WHERE
clause predicate with EXISTS
to compare potential performance improvements and better understand what is happening behind the scenes when the two are interchanged. It is my understanding that in practice, the query optimizer treats EXISTS
and IN
the same way whenever it can.
I'm noticing that when the query is ran with the IN
operator, it returns the desired result set. However, when I replace it with the EXISTS
equivalent, it pulls in all values from the primary table I want to filter. It is ignoring the provided input values passed to EXISTS (SELECT ...
and returning all possible distinct values.
The use case is relatively straightforward: the query accepts a pipe delimited @Series
string that can contain up to 4 values, e.g. S1|S2|S3|S4
or S2|S4
. From here I string_split
the input into a table variable @SeriesSplit
to determine the corresponding internal [SeriesId]
for the [Series]
. The result set returned is then filtered to exclude the [Series]
that were not passed.
To illustrate, here is a similar table definition:
DROP TABLE IF EXISTS [dbo].[Document]
IF OBJECT_ID('[dbo].[Document]', 'U') IS NULL
BEGIN
CREATE TABLE [dbo].[Document] (
[DocumentId] bigint IDENTITY(1,1) NOT NULL
,[DocumentSeriesId] [tinyint] NOT NULL
,CONSTRAINT [PK_Document] PRIMARY KEY CLUSTERED ([DocumentId] ASC)
,INDEX [IX_Document_SeriesId] NONCLUSTERED ([DocumentSeriesId] ASC)
);
END;
GO
Populate the test table with dummy data.
SET IDENTITY_INSERT [dbo].[Document] ON;
;WITH [DocumentSeed] AS (
SELECT
1 AS [DocumentId]
UNION ALL
SELECT
[DocumentId] + 1
FROM
[DocumentSeed]
WHERE
[DocumentId] < 2048)
INSERT INTO [dbo].[Document] ([DocumentId], [DocumentSeriesId])
SELECT
[DocumentId]
,ABS(CHECKSUM(NEWID()) % 4) + 1
FROM
[DocumentSeed] OPTION (MAXRECURSION 2048);
SET IDENTITY_INSERT [dbo].[Document] OFF;
First, the query that utilizes the IN
operator and returns the desired results.
-- Specify the Document Series to be returned.
DECLARE @Series varchar(12) = 'S1|S2'
-- Split the user input and insert it into a table variable.
DECLARE @SeriesSplit table ([SeriesId] tinyint, [Series] varchar(2))
BEGIN
INSERT INTO @SeriesSplit ([SeriesId], [Series])
SELECT
CASE [value] WHEN 'S1' THEN 1 WHEN 'S2' THEN 2 WHEN 'S3' THEN 3 WHEN 'S4' THEN 4 ELSE 5 END AS [SeriesId]
,LTRIM(RTRIM([value])) AS [Series]
FROM
string_split(@Series,'|')
WHERE
[value] <> ''
END;
-- Return the result set of desired [DocumentSeriesId]
-- In the real use case, DISTINCT is not used and more columns are returned.
-- However, to illustrate the issue at hand, return only the [DocumentSeriesId] as this is what we are filtering off.
SELECT DISTINCT
D1.[DocumentSeriesId]
FROM
[dbo].[Document] D1
WHERE
D1.[DocumentSeriesId] IN (SELECT SS.[SeriesId] FROM @SeriesSplit SS)
The outputs two rows as desired. The execution plan (PasteThePlan) shows it performing a Distinct Sort
to filter the appropriate rows.
If I change the WHERE
clause to utilize EXISTS
, I am greeted with all four possible outcomes even though I have set to return only two.
-- Specify the Document Series to be returned.
DECLARE @Series varchar(14) = 'S1|S2'
-- Split the user input and insert it into a table variable.
DECLARE @SeriesSplit table ([SeriesId] tinyint, [Series] varchar(4))
BEGIN
INSERT INTO @SeriesSplit ([SeriesId], [Series])
SELECT
CASE [value] WHEN 'S1' THEN 1 WHEN 'S2' THEN 2 WHEN 'S3' THEN 3 WHEN 'S4' THEN 4 ELSE 5 END AS [SeriesId]
,LTRIM(RTRIM([value])) AS [Series]
FROM
string_split(@Series,'|')
WHERE
[value] <> ''
END;
-- Return the result set of desired [DocumentSeriesId]
-- In the real use case, DISTINCT is not used and more columns are returned.
-- However, to illustrate the issue at hand, return only the [DocumentSeriesId] as this is what we are filtering off.
SELECT DISTINCT
D1.[DocumentSeriesId]
FROM
[dbo].[Document] D1
WHERE
EXISTS (SELECT SS.[SeriesId] FROM @SeriesSplit SS JOIN [dbo].[Document] D2 ON SS.[SeriesId] = D2.[DocumentSeriesId])
The execution plan (PasteThePlan) is vastly different with this small change. It is tossing a warning that there is No Join Predicate
within the Left Semi Join
predicate. It was my assumption that WHERE EXISTS ...
would implicitly satisfy this argument as it does with WHERE [DocumentSeriesId] IN ...
. However, given that the Left Semi Join
operator returns each row from the initial input when there is a matching row in the second input, and because the No Join Predicate
warning exists, query optimizer assumes that each row is a matching row. Therefore all subsequent batch operations are ran against all 2048 rows in the table.
What can I look into to better interpret what the execution plan is describing so that I can understand how to properly resolve the issue?
Or, alternatively, am I simply missing the purpose of the EXISTS
operator when filtering result sets based on a static input?
Best Answer
You didn't rewrite the query properly. When converting an
IN
subqquery to anEXISTS
one, yourEXISTS
subquery should be correlated. But you defined theDocument
table again in the subquery, making it uncorrelated.Your query basically says
It should be: