Sql-server – Inconsistent Execution Plan for Stored Procedure

execution-plansql serversql server 2014stored-procedures

We have a business-critical stored procedure that normally runs daily at 2am from a scheduled job (in an SSIS package) on the production DB server. The same procedure/package is called from a second job, 15 minutes later, from a SQL Agent job running on a different server (as an emergency failover in case anything goes awry with the first job).

The procedure is defined WITH RECOMPILE.

The procedure normally executes in about 45 seconds. Last Wednesday, and again this morning (also Wednesday, coincidence?!?), the 2am procedure took 90 minutes to execute. While it was executing, the 2:15am job ran and that execution took the usual 45 seconds.

I have the execution plans from both situations. There are some table variable processes that should include estimated row counts in the neighborhood of 200K rows. The faulty plan reports these table variables with an estimated 130 billion rows. [Side note: I have already rewritten the code to use temp tables instead of table variables, based on this discussion and will be moving it to production in the near future]

Our monitoring software (Solar Winds DPA) reports excessive CXPACKET waits for the 2am execution. This seems to indicate issues with parallelism and is likely related to the table variables being used in the procedure.

There is still user activity on the server during this time, and some scheduled jobs, but nothing that I see that would affect this procedure or its execution plan. An index maintenance job is run at 2:30am.

I understand the poorly-performing execution plan is related to the temp tables, but why would this same procedure executed 15 minutes later have such a drastically different execution plan (and why does the 2am execution run fine the rest of the week?)

Here are links to the .sqlplan files: The Bad Plan and the Good Plan.

Best Answer

For the 'good' plan, all the table variable cardinality estimates are 1 row. This is the most common outcome when using table variables, unless trace flag 2453 is enabled, or a statement-level recompilation occurs (for example because OPTION (RECOMPILE) is used, or one of the regular tables in the query has passed its recompilation threshold.

For the 'bad' plan, table variable cardinalities are accurate, implying one of the conditions mentioned above was in play. This may seem counter-intuitive, since better information usually leads to better plans, but table variables do not support statistics, so the extra information is rather limited. The optimizer knows there are 'x' rows, but has no idea about the distribution of values within those rows. A different kind of incomplete information, perhaps, but still.

Anyway, it just so happens that the plan built when the table variables are assumed to contain one row happens to produce good performance. There is more than a little luck involved in this. Unless you enjoy debugging rare plan regressions, I would avoid relying on luck too much.

Specifics

The faulty plan reports these table variables with an estimated 130 billion rows.

The part of the plan you are referring to is:

Performance spool plan fragment

As you can see, it is the Table Spool that is estimated to produce ~130 billion rows; the table variable emits only 198,411.

The sort and spool combination is designed to optimize repeated scans, by caching the result from one iteration of the nested loop join and replaying the saved result on the next iteration if the correlated parameter(s) have not changed. The sort ensures any potential duplicates arrive together, since the spool only caches the most recent result. The estimate from the spool is the total number of rows (198,411 from the table variable * 653,969 iterations).

The useful predicate relating the rows from the sort with the table variable is stuck on the nested loops left outer join iterator:

Join predicate

Looking at this in conjunction with the output columns from the table variable, we can conclude that an index on the table variable on PatientID, FirstTestDate would almost certainly eliminate this problem.

An analysis of sub_PSTRules could remove the index and table spools seen there, though these are not having much of an effect on performance at this stage:

Index spool fragment

Nevertheless, it is wasteful to have SQL Server build a temporary nonclustered index each time, then throw it away at the end. The missing (filtered) index is likely:

CREATE INDEX give_me_a_good_name
ON dbo.sub_PSTRules
    (SubscriberSID, CinicSID, OfficeSID)
INCLUDE
    (PSTQuestionGroupSID)
WHERE
    OfficeSID IS NULL;