Sql-server – Delete Joined to Derived Table Causes Index Scan and Table Spool

deletederived-tablesentity-frameworksql serversql server 2014

Our application uses a popular 3rd-party extension to Entity Framework that allows for valuable things that native Entity Framework is not capable of or designed to do, like bulk deletes and updates.

It seems that the query pattern produced by the bulk delete API method from this library is not optimized well by the SQL Server query optimizer.

I am using SQL Server 2014. Consider the following table and index (names have been changed to protect the innocent):

CREATE TABLE [dbo].[a_staging_table]
(
    [staging_ky] [BIGINT] IDENTITY(1,1) NOT NULL,
    [col_a] [INT] NOT NULL,
    [col_b] [VARCHAR](15) NOT NULL,
    [col_c] [VARCHAR](256) NOT NULL,
    [col_d] [INT] NULL,
    [col_e] [INT] NOT NULL,
    [col_f] [VARCHAR](3) NOT NULL,
    [col_g] [VARCHAR](3) NULL,
    [col_h] [VARCHAR](20) NULL,
    [col_i] [CHAR](4) NOT NULL,
    [col_j] [INT] NOT NULL,
    [col_k] [VARCHAR](10) NULL,
    [col_l] [DATETIME] NOT NULL,
    [col_m] [VARCHAR](25) NOT NULL,
    [col_n] [DATETIME] NOT NULL,
    [col_o] [VARCHAR](25) NOT NULL,
    [col_p] [BIGINT] NOT NULL,
    [col_q] [BIT] NOT NULL,
    [col_r] [DATETIME] NULL,
    [col_s] [VARCHAR](25) NULL,
PRIMARY KEY CLUSTERED ([staging_ky] ASC)
);
GO

CREATE NONCLUSTERED INDEX [IDX_dbo_a_staging_table_col_p]
    ON [dbo].[a_staging_table]([col_p] ASC)
    WITH (FILLFACTOR = 100);
GO

The application is trying to delete all records from the table where col_p is equal to a certain value, and the generated SQL query from Entity Framework looks like this:

DELETE [dbo].[a_staging_table]
FROM [dbo].[a_staging_table] AS j0 INNER JOIN (
SELECT 
    1 AS [C1], 
    [Extent1].[staging_ky] AS [staging_ky]
    FROM [dbo].[a_staging_table] AS [Extent1]
    WHERE [Extent1].[col_p] = @p__linq__0
) AS j1 ON (j0.[staging_ky] = j1.[staging_ky])

Looking at the query plan, SQL Server does not optimize away the join between the table and itself, and instead, the plan does an index scan with a nested loops join and seek to the same index, followed by a table spool, prior to the delete.

A more traditional and optimal delete statement is something like this statement, which just does an index seek, prior to the clustered index delete:

DELETE [dbo].[a_staging_table]
FROM [dbo].[a_staging_table] AS j0 
WHERE [j0].[col_p] = @p__linq__0

The actual query plans for both statements are posted at:https://www.brentozar.com/pastetheplan/?id=SygZoD6um

Is this query pattern generated by this library a known query anti-pattern for a delete statement in SQL Server?

Best Answer

Is this query pattern generated by this library a known query anti-pattern for a delete statement in SQL Server?

This strikes me as a slightly curious question to ask. Why does the answer to that question matter? What would you do with that information? If the library that you're using doesn't meet your business needs, then four options come to mind:

Use a different library.
Implement a workaround, for example creating a stored procedure to do the deletes with more efficient code.
Providing feedback to the author of the library.
Opening a UserVoice feedback item or support ticket with Microsoft to get better optimizer support for this type of query.

With all of that said, I think that most SQL Server people would view that query with suspicion. I also think that it's fair to say that joining to tables that you don't need is an anti-pattern.

All of the relational databases that I've worked with have limitations around the inferences that they can make for eliminating unnecessary joins. One of the reasons that you get that ugly plan in SQL Server is that there isn't optimizer support for eliminating an inner self-join on a table's primary key. This can be observed with the following simple demo:

CREATE TABLE [dbo].[a_staging_table]
(
    [staging_ky] [BIGINT] IDENTITY(1,1) NOT NULL,
    [col_p] [BIGINT] NOT NULL,
    [filler] [VARCHAR](100) NULL,
PRIMARY KEY CLUSTERED ([staging_ky] ASC)
);

INSERT INTO [dbo].[a_staging_table] WITH (TABLOCK)
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), REPLICATE('Z', 100)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);

CREATE NONCLUSTERED INDEX [IDX_dbo_a_staging_table_col_p]
    ON [dbo].[a_staging_table]([col_p] ASC)
    WITH (FILLFACTOR = 100);

The following query:

SELECT j0.[staging_ky]
FROM [dbo].[a_staging_table] AS j0
LEFT OUTER JOIN [dbo].[a_staging_table] AS j1
ON j0.[staging_ky] = j1.[staging_ky];

Results in join elimination:

But a similar query with an INNER JOIN does not:

The optimizer doesn't support every possible query transformation. In this example, as programmers we can deduce that a join isn't necessary but the optimizer cannot. This might be unsupported in the product because Microsoft viewed this scenario as not common enough to address, the check was not worth the compile time overhead, or because it would be a difficult check to implement.

For completeness, the spool is present in the query plan due to halloween protection.

Related Solutions

Sql-server – Why is SQL Server using a clustered index scan for a self referencing FK cascade delete

It needs to validate that the row you are trying to delete is not a parent of an existing row.

You don't have an index on ParentTestId.

So it must do the scan.

CREATE NONCLUSTERED INDEX ix ON  [dbo].[Test](ParentTestId)

Then you see a seek.

BTW: The 20% estimated cost of the scan is likely to be an underestimate in this case.

The FK validation is under a left semi join and SQL Server costs it as though only a partial scan will be needed and it will find a matching row and the delete will fail.

Presumably the rows you are actually deleting will succeed more often than not and so a full scan will be required in order to validate that there are no conflicting rows.

Using trace flag 4138 to turn off row goals

DELETE FROM dbo.Test
WHERE  TestId = 200 
OPTION (querytraceon 4138 )

The re-costed plan shows the CI scan at 100% rather than 20% (as it now assumes a full scan will be needed)

This difference in estimated cost is sufficient for the missing index suggestion to show up.

The costs shown in this plan are still not very representative however. You might notice that they add up to 219%.

Also the overall plan cost of the queries with and without the trace flag are both identical at 0.0168268. The full CI scan ought, in fact, to be costed at 0.152373 (0.0485075 + 0.103866)

enter image description here

but it seems to be capped at no more than the original plan cost (and the overall plan cost doesn't get adjusted upwards either hence incorrect percentages)

Sql-server – Delete from table where two of the three keys in another select

Your code is valid SQL but not in SQL-Server. They haven't yet implemented the syntax (a,b) IN ((1,11), (2,12), ...) or (a,b) IN (SELECT expa, expb....).

You can work around with either an EXISTS subquery:

DELETE FROM tableA AS a
WHERE EXISTS
      ( SELECT 1
        FROM tableA AS b
        WHERE b.IDA = a.IDA AND b.IDB = a.IDB 
        HAVING COUNT(*) > 1
      ) 
  AND a.comment = 'temp' ;

or probably more efficiently with a join:

DELETE a
FROM tableA AS a
  JOIN
    ( SELECT 
      FROM tableA 
      GROUP BY IDA, IDB
      HAVING COUNT(*) > 1
   ) AS b
     ON  b.IDA = a.IDA AND b.IDB = a.IDB 
WHERE a.comment = 'temp' ;

and even better with a derived table or a cte (no self-join):

WITH dups AS
  ( SELECT comment, 
           cnt = COUNT(*) OVER (PARTITION BY IDA, IDB)
     FROM tableA 
  ) 
DELETE 
FROM dups
WHERE comment = 'temp' 
  AND cnt > 1 ;

Best Answer

Related Solutions

Sql-server – Why is SQL Server using a clustered index scan for a self referencing FK cascade delete

Sql-server – Delete from table where two of the three keys in another select

Related Question