Sql-server – Query tuning update with hash flow distinct operator

execution-planquery-performancesql serversql server 2014update

Need some help in understanding the slowness in one of the UPDATE statements like below:-

UPDATE TOP (100) xyz
SET xyz.flag = 1
OUTPUT inserted.Rcode, inserted.EDR, inserted.id, abc.EID,abc.CID,abc.ENID,abc.Cdate
FROM dbo.table1 xyz WITH (UPDLOCK, READPAST)
INNER JOIN dbo.table2 abc WITH (NOLOCK)
on xyz.id=abc.id
WHERE xyz.flag = 0

Table1 has approx. 0.5 million rows and Table 2 has approx. 5 million rows

Slow Plan

Hash Match distinct flow operator shows a yellow alarm and message is:

Operator used Tempdb to spill data for executed with spill level 4 with 1 spilled thread"

Build residual:

database.dbo.table2.id as abc.id = database.dbo.table2.id as abc.id

I took a screenshot. Unfortunately due to security reasons I can't provide more than that, not even an anonymized plan. From my working station I cannot access the internet so there is no way I can get plan explorer to run there.

Generally for a smaller subset of rows it's under sec like when we have just matching 10K rows or something. But with higher amount of data this seems to be tipping point and app cannot afford 1 min of run time. From SSMS I get 30 secs but from app we have avg. 50 secs approx. RCSI is in testing phase.

My good plan does not have that Hash Match Flow Distinct operator visible as shown in my screenshot, while rest of plan remains same. Good one completes under 3 secs or so. As seen nearly 16 seconds are spent on that operator. Can we eliminate it via proper indexing or query re-write?

Table schema

CREATE TABLE dbo.table1
(
    Recid  VARCHAR(128) COLLATE SQL_Latin1_general_CP1_CI_AS NOT NULL,
    Cdate DATETIME NULL,
    flag BIT   NULL DEFAULT (0),
    Rcode INT NULL,
    EDR VARCHAR(255) COLLATE SQL_Latin1_general_CP1_CI_AS NULL,
    id BIGINT NULL
);

CREATE TABLE dbo.table2
(
    ENID BIGINT IDENTITY(1,1) NOT NULL,
    EID VARCHAR(50) COLLATE SQL_Latin1_general_CP1_CI_AS NOT NULL,
    CID VARCHAR(350) COLLATE SQL_Latin1_general_CP1_CI_AS NOT NULL,
    CDate DATETIME NOT NULL DEFAULT(getdate()),
    id BIGINT NOT NULL,

    CONSTRAINT PK_ENID PRIMARY KEY (ENID ASC, EID ASC),
);

-- table1
CREATE INDEX ix_Cdate on dbo.table1 (Cdate) WITH (FILLFACTOR=100);
CREATE CLUSTERED INDEX ix_Recid on dbo.table1 (Recid) WITH (FILLFACTOR=80);

-- table2
CREATE INDEX ix_ENID_id on dbo.table2 (ENID,id) WITH (FILLFACTOR=100);

Changes

Changes I made and some numbers:

Added hint OPTION (QUERYTRACEON 4138) – avg. execution 7 secs down
from original 50secs, but app team seems not have access to perform
this in code. Need to check further on this.
OPTION (ORDER GROUP)gave same results of avg. 50secs so no improvement
there.
Added index as suggested:

CREATE INDEX i ON dbo.table2 (id) INCLUDE (CID, CDate);

Not much improvements there. Avg. 45 secs and plan was similar to one attached in this question (top plan).

Before and after each test I made sure plan was not generated from previous cached plan.

Fast plan

Attaching the plan which is faster and without any change in data or query is still fast for same amount of rows in both tables. App team continuously submit above query throughout the day to finish the batch by completing those TOP 100. There is a plan change based on some tipping number and below is how the good plan looks:

Edit:- With everything unchanged, no code change or any index being added, as suggested when i am trying to add hint (FORCESEEK) its giving me below error

Query processor could not produce a query plan because of the hints
defined in this query. Resubmit the query without specifying any hints
and without using SET FORCEPLAN.

Best Answer

You have three main problems:

There is no useful index to support the join on id.
The TOP (100) introduces a row goal, so estimations may be too low.
The UPDATE is non-deterministic.

Multiple rows from table2 could match on id, so it is not clear which matching row from table2 should be used to provide values for the OUTPUT clause. The aggregate is there to group on table2 id and choose ANY matching values for the other columns. The aggregate is a Flow Distinct because of the row goal.

One needs to be very careful with ANY aggregates in non-deterministic UPDATE statements because you may get incorrect results.

There is not enough detail in the question to make high quality recommendations, but:

Add an index like CREATE INDEX i ON dbo.table2 (id) INCLUDE (CID, CDate);
Use OPTION (QUERYTRACEON 4138) to disable the row goal, or OPTION (ORDER GROUP) to use a Stream Aggregate instead of Hash.
How you fix the non-deterministic UPDATE depends on the data relationships. The key point is to identify at most one row from the source that matches each target row. Typically, this will involve a unique index or constraint, or using ROW_NUMBER or TOP (1).

Step 2 may or may not be necessary. I add it for completeness.

You may find it easier to visualize the issues and tune the query by writing it in this form:

UPDATE TOP (100) 
    xyz WITH (UPDLOCK, READPAST)
SET xyz.flag = 1
OUTPUT 
    inserted.Rcode, inserted.EDR, inserted.id, 
    abc.EID, abc.CID, abc.ENID, abc.Cdate
FROM dbo.table1 AS xyz
CROSS APPLY
(
    -- At most one source row per target row
    SELECT TOP (1) 
        abc.* 
    FROM dbo.table2 AS abc
    WHERE
        abc.id = xyz.id
    -- ORDER BY something to choose the one row
) AS abc
WHERE 
    xyz.flag = 0;

Execution plan:

I probably wouldn't bother with a filtered index on table1, but if you did want to try it, this appears to be suitable:

CREATE INDEX i 
ON dbo.table1 (Recid) 
INCLUDE (id, flag) 
WHERE flag = 0;

If you want to continue with the update syntax given in the question without addressing all the underlying issues properly, you may find this is faster:

UPDATE TOP (100) xyz
SET xyz.flag = 1
OUTPUT inserted.Rcode, inserted.EDR, inserted.id, abc.EID,abc.CID,abc.ENID,abc.Cdate
FROM dbo.table1 xyz WITH (UPDLOCK, READPAST)
INNER JOIN dbo.table2 abc WITH (NOLOCK, FORCESEEK)
on xyz.id=abc.id
WHERE xyz.flag = 0;

Related Solutions

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.

Sql-server – Create a table dynamically in SQL Server

You can try with different format of date. I also changed AUTO_INCREMENT to Identity

DECLARE @tableToDump nvarchar(100); 

SET @tableToDump = 'backupCdc'+CONVERT(varchar(10),getdate(),112);

DECLARE @DynamicSQL nvarchar(1000);

SET @DynamicSQL=N'create table '+ @tableToDump +' ('+'cid int primary key IDENTITY(1,1), employeeno Varchar(100), fieldName Varchar(100) NOT NULL, fieldValue Varchar(1000))';
/*Adding braces is important as suggested by ypercube */
exec (@DynamicSQL);