Need some help in understanding the slowness in one of the UPDATE
statements like below:-
UPDATE TOP (100) xyz
SET xyz.flag = 1
OUTPUT inserted.Rcode, inserted.EDR, inserted.id, abc.EID,abc.CID,abc.ENID,abc.Cdate
FROM dbo.table1 xyz WITH (UPDLOCK, READPAST)
INNER JOIN dbo.table2 abc WITH (NOLOCK)
on xyz.id=abc.id
WHERE xyz.flag = 0
Table1 has approx. 0.5 million rows and Table 2 has approx. 5 million rows
Slow Plan
Hash Match distinct flow operator shows a yellow alarm and message is:
Operator used Tempdb to spill data for executed with spill level 4 with 1 spilled thread"
Build residual:
database.dbo.table2.id as abc.id = database.dbo.table2.id as abc.id
I took a screenshot. Unfortunately due to security reasons I can't provide more than that, not even an anonymized plan. From my working station I cannot access the internet so there is no way I can get plan explorer to run there.
Generally for a smaller subset of rows it's under sec like when we have just matching 10K rows or something. But with higher amount of data this seems to be tipping point and app cannot afford 1 min of run time. From SSMS I get 30 secs but from app we have avg. 50 secs approx. RCSI is in testing phase.
My good plan does not have that Hash Match Flow Distinct operator visible as shown in my screenshot, while rest of plan remains same. Good one completes under 3 secs or so. As seen nearly 16 seconds are spent on that operator. Can we eliminate it via proper indexing or query re-write?
Table schema
CREATE TABLE dbo.table1
(
Recid VARCHAR(128) COLLATE SQL_Latin1_general_CP1_CI_AS NOT NULL,
Cdate DATETIME NULL,
flag BIT NULL DEFAULT (0),
Rcode INT NULL,
EDR VARCHAR(255) COLLATE SQL_Latin1_general_CP1_CI_AS NULL,
id BIGINT NULL
);
CREATE TABLE dbo.table2
(
ENID BIGINT IDENTITY(1,1) NOT NULL,
EID VARCHAR(50) COLLATE SQL_Latin1_general_CP1_CI_AS NOT NULL,
CID VARCHAR(350) COLLATE SQL_Latin1_general_CP1_CI_AS NOT NULL,
CDate DATETIME NOT NULL DEFAULT(getdate()),
id BIGINT NOT NULL,
CONSTRAINT PK_ENID PRIMARY KEY (ENID ASC, EID ASC),
);
-- table1
CREATE INDEX ix_Cdate on dbo.table1 (Cdate) WITH (FILLFACTOR=100);
CREATE CLUSTERED INDEX ix_Recid on dbo.table1 (Recid) WITH (FILLFACTOR=80);
-- table2
CREATE INDEX ix_ENID_id on dbo.table2 (ENID,id) WITH (FILLFACTOR=100);
Changes
Changes I made and some numbers:
-
Added hint
OPTION (QUERYTRACEON 4138)
– avg. execution 7 secs down
from original 50secs, but app team seems not have access to perform
this in code. Need to check further on this. -
OPTION (ORDER GROUP)
gave same results of avg. 50secs so no improvement
there. -
Added index as suggested:
CREATE INDEX i ON dbo.table2 (id) INCLUDE (CID, CDate);
Not much improvements there. Avg. 45 secs and plan was similar to one attached in this question (top plan).
Before and after each test I made sure plan was not generated from previous cached plan.
Fast plan
Attaching the plan which is faster and without any change in data or query is still fast for same amount of rows in both tables. App team continuously submit above query throughout the day to finish the batch by completing those TOP 100. There is a plan change based on some tipping number and below is how the good plan looks:
Edit:- With everything unchanged, no code change or any index being added, as suggested when i am trying to add hint (FORCESEEK) its giving me below error
Query processor could not produce a query plan because of the hints
defined in this query. Resubmit the query without specifying any hints
and without using SET FORCEPLAN.
Best Answer
You have three main problems:
id
.TOP (100)
introduces a row goal, so estimations may be too low.UPDATE
is non-deterministic.Multiple rows from table2 could match on
id
, so it is not clear which matching row from table2 should be used to provide values for theOUTPUT
clause. The aggregate is there to group on table2id
and chooseANY
matching values for the other columns. The aggregate is a Flow Distinct because of the row goal.One needs to be very careful with
ANY
aggregates in non-deterministicUPDATE
statements because you may get incorrect results.There is not enough detail in the question to make high quality recommendations, but:
CREATE INDEX i ON dbo.table2 (id) INCLUDE (CID, CDate);
OPTION (QUERYTRACEON 4138)
to disable the row goal, orOPTION (ORDER GROUP)
to use a Stream Aggregate instead of Hash.UPDATE
depends on the data relationships. The key point is to identify at most one row from the source that matches each target row. Typically, this will involve a unique index or constraint, or usingROW_NUMBER
orTOP (1)
.Step 2 may or may not be necessary. I add it for completeness.
You may find it easier to visualize the issues and tune the query by writing it in this form:
Execution plan:
I probably wouldn't bother with a filtered index on table1, but if you did want to try it, this appears to be suitable:
If you want to continue with the update syntax given in the question without addressing all the underlying issues properly, you may find this is faster: