Sql-server – Why does the parallel query sometimes process billions of extra rows

parallelismperformancequery-performancesql serversql-server-2016

I was running some data prep queries when I noticed that one of them sometimes ran much longer than expected. The query should have finished in seconds but sometimes would take minutes to run. Here is the query:

CREATE TABLE dbo.X_HEAP (ID INT NOT NULL, ID2 INT NOT NULL);

INSERT INTO dbo.X_HEAP WITH (TABLOCK)
SELECT TOP (20000000) 
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, 100000 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3;

Here's an actual execution plan for when the query runs quickly. I can't find anything out of the ordinary in the plan.

However, there is something wrong in the query plan when the query takes longer than expected. The final (reading from right to left) nested loop operator generates around 13.3 billion rows. I would expect only expect it to generate around 20 million rows because of the row goal introduced by TOP. I've observed between one to three threads spin out of control and it seems like this is directly responsible for the long runtime.

too many rows

Why does my parallel query sometimes process billions of extra rows? What can I do about it?

I am on SQL Server 2016 SP1. I've been unable to reproduce the issue with a compatibility level of 120.

Microsoft SQL Server 2016 (SP1) (KB3182545) – 13.0.4001.0 (X64) Oct 28 2016 18:17:30 Copyright (c) Microsoft Corporation Developer Edition (64-bit) on Windows 10 Home 6.3 (Build 14393: )

Best Answer

Let's start by discussing why the fast query processes a little more than exactly 20 million rows. Here's a screenshot:

few extra rows

In my testing, the final nested loop operator running with MAXDOP 4 always generates 20008861 rows before the rows are reduced to exactly 20 million by the TOP operator. This is because parallel queries in SQL Server work with packets of data (it's the "PACKET" in the common CXPACKET wait event). We can see evidence of this by running a similar query that returns just 100 rows and with trace flag 8649 to force a parallel plan:

SELECT TOP (100) 
  ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, 100000 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
CROSS JOIN master..spt_values t3
OPTION (QUERYTRACEON 8649);

Most of the time I see around 9000 total rows processed:

packet sizes

The rows processed by each thread are usually close to multiples of 1023. So perhaps the packet size for this query is 1023? If threads are working with packets of data and the other threads are already in the process of getting more data when we finally get the last row, it seems reasonable that we could see more rows processed than strictly necessary due to parallelism.

For another way to analyze the packet size we can reduce the MAXDOP of the query. With a row goal of 100000 I consistently see 108442 rows processed by the nested loop join. The number of rows processed by each thread can change but the overall total was always the same in my tests. Changing MAXDOP to 3 reduces the number of rows processed to 106395. Change MAXDOP to 2 reduces the number of rows processed to 104348. For this query and a few other similar ones, I always observed a reduction of 2047 rows per DOP. Suspiciously this is nearly double the value from before, 1023. I'm not sure why it's not consistent, but at the very least we can conclude that we process more extra rows with more DOP.

That was somewhat educational but it doesn't explain why the query sometimes processes billions of rows. To get insight into that we can use the sys.dm_exec_query_profiles dmv introduced in SQL Server 2014. Here's a snapshot of some of the rows from dmv during one of the slow runs:

profiles 1

Here's the query plan with the relevant nodes labeled to give some context to the data:

labeled nodes

There are a few interesting things to note here. We've inserted 19997604 rows into the table, so the query is very close to done. Threads 1-3 for the nested loop join at node_id 8 have finished all of their work. The TOP operator at node 4 has already met its row goal of 20000000 rows and has closed. However, thread 4 for the nested loop join continues to fetch rows. Those rows aren't being sent anywhere according to the last_row_time column. Also, the parallelism node at node 3 is still open. If I let the query run nothing changes except the row count for thread 4:

profiles 2

Eventually it will run out of packets to send and the query will finish.

It seems as if the problem is directly related to parallelism. The query gets "stuck" at node 3 in the plan, which is the distribute steams operator for the parallel insert. SQL Server 2016 supports parallel inserts into existing heap tables. SQL Server 2014 does not, so this could explain why I couldn't reproduce the issue with a lower compatibility level. So perhaps the problem is related to the parallel insert.

The most obvious workaround is to make the entire query run serially. However, suppose I still need part of the query to run in parallel for performance reasons. Parallel insert requires a TABLOCK hint. If I remove the TABLOCK hint I'm unable to reproduce the issue, but I also lose minimal logging. If I don't want to lose minimal logging I can set trace flag 9495 at the global level. It's important to also include a NORECOMPILE hint or to wipe the plan from the cache because cached plans can cause parallel inserts even with trace flag 9495 enabled. An alternative to the trace flag is to something else which prevents parallel inserts, such as adding a nonclustered index to the table. Any index will do, including a filtered one that doesn't actually contain any rows:

CREATE INDEX X_DUMMY_INDEX ON dbo.X_HEAP (ID) WHERE (ID = -1 AND ID = -2);

In summary, I can't say for sure why the query sometimes processes billions of rows but it definitely seems like unintended behavior. Perhaps it is some kind of parallelism-related race condition. For this query disabling parallel insert seems sufficient to prevent the bug from reappearing.