Sql-server – Query cycles between suspended and runnable without ever completing

blockingquery-performancesql serversql-server-2012

I have an insert statement, extracting data through multiple inner joins from around 8 tables. It is getting into suspended state then runnable and vice versa forever; never coming back to running state.

Here is the estimated execution plan: https://www.brentozar.com/pastetheplan/?id=Sy6gPl-6L

While running the trace on this particular process, I see, it holds lock on tempdb and also on one user database but not running any statements on them. Simply its state is changing in between suspended & runnable. When I kill and re-run, it is getting executed normally. I see there are no processes or jobs conflicting with this insert statement.

The wait types are "io_completion" and "sos_scheduler_yield" while the query got stuck. I have used sys.syprocesses and sp_who2 active also dm_exec_requests to monitor. I have run trace while the process is running, once the process went into suspended state showing up "io_completion" to "sos_scheduler_yield" and vice versa.

I am not able to figure out why it is happening. Could you please put some light on it and advise any solutions.

Best Answer

Explanation of behavior

Some of the causes of the IO_COMPLETION wait type are:

  • Writing intermediate sort buffers to disk (these are called ‘Bobs’)
  • Reading and writing sort results from/to disk during a sort spill

There are two sorts that could be spilling, which uses tempdb.

A source of the slowness could also be one the unfortunate "many to many merge join" in the middle of the execution plan (which also uses tempdb, although it doesn't cause IO_COMPLETION waits):

screenshot of many to many merge join in the estimated plan

The estimated plan shows ~2 GB of data coming out of that merge join - and that amount of data could be even higher if the estimates are off.

You mentioned the problem is intermittent, which could be because of tempdb contention (if other queries are running at the time).

Suggestions for improvement

Temp table rewrite

The best option I can think of would be to break the query up into smaller chunks. For instance, you could just select the trans_header rows that meet your where clause into a #temp table. Then use that temp table in the main query instead of trans_header.

This could improve estimates and let the optimizer produce a better plan - potentially avoiding any spills or other tempdb activity.

That, generally, looks like this. Sorry if there are any slight typos, it's tough not having intellisense ?

SELECT 
    a.[term_id],
    a.[trans_ref_no],
    a.supplier_no,
    a.cust_no,
    a.trans_id,
    a.carrier,
    a.folio_yr, 
    a.folio_mo, 
    a.folio_no,
    a.transaction_date
INTO #trans_header_temp
FROM [TOPHAT].[trans_header] AS a with (nolock)
WHERE
    a.folio_yr=year(getdate())-1
    AND a.transaction_date <>' 00000';

INSERT INTO [TOPHAT].[terminal_volume]
           ([term_id]
           ,[terminal_name]
           ,[folio_yr]
           ,[folio_mo]
           ,[folio_no]
     ,[folio_year_month]
           ,[product_id]
           ,[prod_name]
           ,[product_id_name]
           ,[supplier_no]
           ,[supplier_name]
           ,[supplier_no_name]
           ,[customer_no]
           ,[customer_name]
     ,[carrier_no]
     ,[carrier_name]
           ,[bbl]
           ,[gallon]
           ,[customer_no_name]
           ,[trans_code]
           ,[transaction_description]
           ,[transaction_code_description]
           ,[transaction_year]
           ,[transaction_month]
           ,[transaction_date]
           ,[getdate])
select a.[term_id],
 tp.name as terminal_name,
 a.folio_yr, 
 a.folio_mo, 
 a.folio_no,
 cast(a.folio_yr + '-' + a.folio_mo +'-01' as date) as folio_year_month,
 p.prod_id as product_id, 
 p.prod_name as prod_name,
 p.prod_id+' '+p.prod_name  as product_id_name, 
 s.supplier_no,
 s.supplier_name,
 s.supplier_no+' '+s.supplier_name as supplier_no_name,
 cu.cust_no as customer_no, 
 cu.cust_name as customer_name, 
 a.carrier as carrier_no, 
 ca.name as carrier_name,
 sum((convert(decimal(12,2),b.net)*(case when b.sign is null then 1 else -1 end))/42) as 'bbl',
 sum(convert(decimal(12,2),b.net)*(case when b.sign is null then 1 else -1 end)) as gallon,
 cu.cust_no+' '+ cu.cust_name as customer_no_name,
 tv.trans_code, 
 tv.trans_desc as transaction_description, 
 tv.trans_code+' '+tv.trans_desc as transaction_code_description, 
 '20'+ left(a.transaction_date, 2) as transaction_year, 
 substring(a.transaction_date, 3, 2) as transaction_month,
  case when isdate(a.transaction_date) = 0 then null
   else 
   cast(a.transaction_date as date) end as transaction_date, 
 cast(getdate() as date)  as [getdate]
   from
#trans_header_temp as a with (nolock) 
 inner join [TOPHAT].[terminal_profile] as tp  with (nolock) 
  on a.term_id = tp.term_id
 inner join [TOPHAT].[trans_products] as b with (nolock) 
  on a.[trans_ref_no] = b.[trans_ref_no] 
 inner join [TOPHAT].[product] as p with (nolock) 
  on p.term_id = a.term_id
  and p.prod_id = b.prod_id
 inner join TOPHAT.supplier as s with (nolock) 
  on s.supplier_no = a.supplier_no
 inner join TOPHAT.customer as cu with (nolock) 
  on cu.cust_no = a.cust_no 
  and cu.supplier_no = a.supplier_no
 inner join TOPHAT.trans_value as tv with (nolock) 
  on tv.trans_code = a.trans_id
 inner join [TOPHAT].[carrier] as ca with (nolock)
  on ca.term_id = a.term_id
  and ca.carr_no = a.carrier

group by 
a.[term_id],
tp.name ,
a.folio_yr, 
a.folio_mo, 
a.folio_no,
p.prod_id,
p.prod_name,
p.prod_id+' '+p.prod_name,
s.supplier_no,
s.supplier_name,
s.supplier_no+' '+s.supplier_name,
cu.cust_no, 
cu.cust_name, 
cu.cust_no+' '+ cu.cust_name,
a.carrier, 
ca.name,
tv.trans_code,
tv.trans_desc,
tv.trans_code+' '+tv.trans_desc,
'20'+ left(a.transaction_date, 2),
substring(a.transaction_date, 3, 2) ,
a.transaction_date,
cast(a.folio_yr + '-' + a.folio_mo +'-01' as date)

Fix implicit conversions

Speaking of estimates, you have several implicit conversion warnings. It's difficult to advise on how to deal with them, since we don't have table and index definitions, but you should review them to see if they can be avoided. Especially the ones in the WHERE clause:

screenshot of plan warnings

Join hint

One low-effort approach would be to try and avoid that specific merge join with a join hint. This might cause performance to get worse, though, because it will force the order of the joins as written, which limits the optimizer can do quite a bit:

 inner HASH join [TOPHAT].[product] as p with (nolock) 
  on p.term_id = a.term_id
  and p.prod_id = b.prod_id

Note that this wouldn't likely help with sort spills.