Sql-server – Interchanging IN with EXISTS produces different result sets

I am attempting to update a query that utilizes the IN operator within a WHERE clause predicate with EXISTS to compare potential performance improvements and better understand what is happening behind the scenes when the two are interchanged. It is my understanding that in practice, the query optimizer treats EXISTS and IN the same way whenever it can.

I'm noticing that when the query is ran with the IN operator, it returns the desired result set. However, when I replace it with the EXISTS equivalent, it pulls in all values from the primary table I want to filter. It is ignoring the provided input values passed to EXISTS (SELECT ... and returning all possible distinct values.

The use case is relatively straightforward: the query accepts a pipe delimited @Series string that can contain up to 4 values, e.g. S1|S2|S3|S4 or S2|S4. From here I string_split the input into a table variable @SeriesSplit to determine the corresponding internal [SeriesId] for the [Series]. The result set returned is then filtered to exclude the [Series] that were not passed.

To illustrate, here is a similar table definition:

DROP TABLE IF EXISTS [dbo].[Document]
IF OBJECT_ID('[dbo].[Document]', 'U') IS NULL
BEGIN
    CREATE TABLE [dbo].[Document] (
        [DocumentId] bigint IDENTITY(1,1) NOT NULL
        ,[DocumentSeriesId] [tinyint] NOT NULL
        ,CONSTRAINT [PK_Document] PRIMARY KEY CLUSTERED ([DocumentId] ASC)
        ,INDEX [IX_Document_SeriesId] NONCLUSTERED ([DocumentSeriesId] ASC)
    );
END;
GO

Populate the test table with dummy data.

SET IDENTITY_INSERT [dbo].[Document] ON;
;WITH [DocumentSeed] AS (
    SELECT
        1 AS [DocumentId]
    UNION ALL
    SELECT
        [DocumentId] + 1
    FROM
        [DocumentSeed]
    WHERE
        [DocumentId] < 2048)

INSERT INTO [dbo].[Document] ([DocumentId], [DocumentSeriesId])
SELECT
    [DocumentId]
    ,ABS(CHECKSUM(NEWID()) % 4) + 1
FROM
    [DocumentSeed] OPTION (MAXRECURSION 2048);
SET IDENTITY_INSERT [dbo].[Document] OFF;

First, the query that utilizes the IN operator and returns the desired results.

-- Specify the Document Series to be returned.
DECLARE @Series varchar(12) = 'S1|S2'
-- Split the user input and insert it into a table variable.
DECLARE @SeriesSplit table ([SeriesId] tinyint, [Series] varchar(2))
BEGIN
INSERT INTO @SeriesSplit ([SeriesId], [Series])
SELECT
    CASE [value] WHEN 'S1' THEN 1 WHEN 'S2' THEN 2 WHEN 'S3' THEN 3 WHEN 'S4' THEN 4 ELSE 5 END AS [SeriesId]
    ,LTRIM(RTRIM([value])) AS [Series]
FROM
    string_split(@Series,'|')
WHERE
    [value] <> ''
END;

-- Return the result set of desired [DocumentSeriesId]
-- In the real use case, DISTINCT is not used and more columns are returned.
-- However, to illustrate the issue at hand, return only the [DocumentSeriesId] as this is what we are filtering off.
SELECT DISTINCT
    D1.[DocumentSeriesId]
FROM
    [dbo].[Document] D1
WHERE
    D1.[DocumentSeriesId] IN (SELECT SS.[SeriesId] FROM @SeriesSplit SS)

The outputs two rows as desired. The execution plan (PasteThePlan) shows it performing a Distinct Sort to filter the appropriate rows.

If I change the WHERE clause to utilize EXISTS, I am greeted with all four possible outcomes even though I have set to return only two.

-- Specify the Document Series to be returned.
DECLARE @Series varchar(14) = 'S1|S2'
-- Split the user input and insert it into a table variable.
DECLARE @SeriesSplit table ([SeriesId] tinyint, [Series] varchar(4))
BEGIN
INSERT INTO @SeriesSplit ([SeriesId], [Series])
SELECT
    CASE [value] WHEN 'S1' THEN 1 WHEN 'S2' THEN 2 WHEN 'S3' THEN 3 WHEN 'S4' THEN 4 ELSE 5 END AS [SeriesId]
    ,LTRIM(RTRIM([value])) AS [Series]
FROM
    string_split(@Series,'|')
WHERE
    [value] <> ''
END;

-- Return the result set of desired [DocumentSeriesId]
-- In the real use case, DISTINCT is not used and more columns are returned.
-- However, to illustrate the issue at hand, return only the [DocumentSeriesId] as this is what we are filtering off.
SELECT DISTINCT
    D1.[DocumentSeriesId]
FROM
    [dbo].[Document] D1
WHERE
    EXISTS (SELECT SS.[SeriesId] FROM @SeriesSplit SS JOIN [dbo].[Document] D2 ON SS.[SeriesId] = D2.[DocumentSeriesId])

The execution plan (PasteThePlan) is vastly different with this small change. It is tossing a warning that there is No Join Predicate within the Left Semi Join predicate. It was my assumption that WHERE EXISTS ... would implicitly satisfy this argument as it does with WHERE [DocumentSeriesId] IN .... However, given that the Left Semi Join operator returns each row from the initial input when there is a matching row in the second input, and because the No Join Predicate warning exists, query optimizer assumes that each row is a matching row. Therefore all subsequent batch operations are ran against all 2048 rows in the table.

What can I look into to better interpret what the execution plan is describing so that I can understand how to properly resolve the issue?

Or, alternatively, am I simply missing the purpose of the EXISTS operator when filtering result sets based on a static input?

Best Answer

You didn't rewrite the query properly. When converting an IN subqquery to an EXISTS one, your EXISTS subquery should be correlated. But you defined the Document table again in the subquery, making it uncorrelated.

Your query basically says

Give me all data from Documents if @SeriesSplit has at least one row - and nothing if @SeriesSplit is empty.

It should be:

SELECT DISTINCT
    D1.[DocumentSeriesId]
FROM
    [dbo].[Document] D1
WHERE
    EXISTS (SELECT 1 FROM @SeriesSplit SS 
            WHERE SS.[SeriesId] = D1.[DocumentSeriesId]) ;

Best Answer

Related Solutions

Sql-server – How to efficiently check EXISTS on multiple columns

Sql-server – Connecting Linked Server

Related Question