Sql-server – Job waiting on CXCONSUMER

sql serversql-server-2016wait-types

We have got a procedure on SQL Server agent jobs and this job started to hang on transaction for many hours with wait stats CXCONSUMER and never complete. The problem is after we noticed that it was hanging we kill the job and started manually during and the job was working fine and completing. But right now neither starting manually or on sql agent job seems never completing and hanging on waits with CXCONSUMER.

According to sp_who2, sp_whoisactive and sql server profiler there are no active locks it is waiting for. It keeps yielding CXCONSUMER and it's processor time and does basically nothing. Here is the execution plan after 3h –

And the other symptoms is the tempdb is unaccesible through properties when this job runs I get this error "Property SpaceAvailable is not available for Database '[tempdb]'. This property may not exist for this object, or may not be retrievable due to insufficient access rights" despite there are available space on tempdb.

I will appreciate your help in this matter.

Thank you and Best Regards,enter image description here

Best Answer

There are really two questions in this question:

  1. Why is the query slow?
  2. Why are tempdb properties inaccessible?

Let's start with #2.

Why are tempdb properties inaccessible?

You get this error:

Cannot show requested dialog. (SqlMgmt) Property SpaceAvailable is not available for Database '[tempdb]'. This property may not exist for this object, or may not be retrievable due to insufficient access rights. (Microsoft.SqlServer.Smo)

SSMS Error dialog

This will happen when you create a temp table inside a transaction. The CREATE TABLE holds metadata locks on system tables that blocks the queries to populate the database properties window, and causes the opening of that dialog to time out & show this error.

If you run this statement in SSMS, and try to open the tempdb properties window at the same time, I expect it will open just fine:

SELECT o1.object_id
INTO #blep
FROM sys.objects o1
CROSS JOIN sys.objects o2
CROSS JOIN sys.objects o3
CROSS JOIN sys.objects o4
CROSS JOIN sys.objects o5;

However, with the same query dropped into a transaction, and you will get the above error that you reported getting:

BEGIN TRANSACTION
SELECT o1.object_id
INTO #blep
FROM sys.objects o1
CROSS JOIN sys.objects o2
CROSS JOIN sys.objects o3
CROSS JOIN sys.objects o4
CROSS JOIN sys.objects o5;

While you didn't provide the code beyond what is included in the plan, the behavior would indicate that your SELECT...INTO is happening within a transaction, similar to the above example.

I would recommend never creating a temp table within a transaction, due to the blocking on system tables within tempdb. This affects concurrency, in addition to operational issues like the one with the database properties window.

If your procedure requires the use of transactions, you should create the table OUTSIDE the transaction, then begin the transaction & do the necessary work. If you depend on the behavior of SELECT...INTO to define the schema (ex, with a dynamic/unknown schema), use SELECT...INTO...WHERE 1=2; to create the table empty, then proceed after that to begin your transaction and do work.

Now, on to the other question:

Why is the query slow?

I'm not sure which line in the sp_whoisactive output correlates to your query. All of them? Session_id 797 is showing high CXCONSUMER waits, and has been running for 2 hours. I'll assume that is the problematic session. It has performed almost 57.9 million milliseconds of CPU cycles (967 minutes) in 2h 12mi of "real" time (132 minutes). That's about 7x more CPU time than "real" time--which certainly indicates that it is busy doing parallel work.

I can tell from the schema that this is an SAP database. The SAP schema is a bit of a quirky one, with the MANDT column being the leading column on every clustered composite index. This column is basically "company code" and used to separate data for various subsidiaries. SAP supplies this column within the join/filter predicate on EVERY query to ensure it's working with data for the right company. If your company doesn't keep separate books for separate subsidiaries, then the MANDT column is always the exact same value.

Because the non-selective predicate is supplied, it can trick SQL Server into showing a seek on the clustered primary key, when it's really doing a table scan. If every row has a value of AMTWO for MANDT, and you query AND MANDT = 'AMTWO', the optimizer might think "Hey, that's the leading column on the PK, I can seek to that!" But in reality, it seeks to the first occurrence of that value (ie the first row) and reads through to the last occurrence of that value (ie, the last row). Because every row is exactly the same value, so it has to scan every row, even though it shows as a seek in the plan.

The reality is that there's a better index to use, but SQL Server didn't realize that another index (or combination of indexes) would perform better.

Additionally, depending on your SQL Server version and whether you are using Trace Flag 2389, SAP often struggles with the "Ascending key problem" related to their heavy use of dates to key transactional data.

Both of these problems can be resolved by aggressively updating statistics on your SAP database.

There certainly could be something more going on with your query more than stats, but based on my experience with SAP, ensuring you have up-to-date stats would be at the top of my priority list.

You say it "started to hang on transaction for many hours"... how fast was it running before? Was it processing a similar amount of data was fast? The plan indicates it's working off just over a million rows in #repositioning, 200k rows in #stock and 65 million rows in MSEG---so I wouldn't expect the query to be extremely fast to begin with.

After you update stats, I'd be keen to capture actual IO & CPU costs for an execution of this code, using something like Plan Explorer, as that often gives much more insight into the problem, particularly if it's a case where SQL Server has selected a "bad" plan, and a better/faster plan exists but wasn't chosen.