Sql-server – Can adding a columnstore index to a table affect read performance of a query that uses a rowstore index on the same table

columnstoreindexnonclustered-indexsql serversql-server-2016

I'm doing some testing of columnstore indexing on a single table that has about 500 million rows.
The performance gains on aggregate queries have been awesome (a query that previously took about 2 minutes to run now runs in 0 seconds to aggregate the entire table).

But I also noticed another test query that leverages seeking on an existing rowstore index on the same table is now running 4x as slow as it previously did before creating the columnstore index. I can repeatedly demonstrate when dropping the columnstore index the rowstore query runs in 5 seconds, and by adding back in the columnstore index the rowstore query runs in 20 seconds.

I'm keeping an eye on the actual execution plan for the rowstore index query, and it's almost exactly the same in both cases, regardless if the columnstore index exists. (It uses the rowstore index in both cases.)

The rowstore test query is:

SELECT *
INTO #TEMP
FROM Table1 WITH (FORCESEEK)
WHERE IntField1 = 571
    AND DateField1 >= '6/01/2020'

The rowstore index used in this query is: CREATE NONCLUSTERED INDEX IX_Table1_1 ON Table1 (IntField1, DateField1) INCLUDE (IntField2)

The columnstore test query is:

SELECT COUNT(DISTINCT IntField2) AS IntField2_UniqueCount, COUNT(1) AS RowCount
FROM Table1
WHERE IntField1 = 571 -- Some other test columnstore queries also don't use any WHERE predicates on this table
    AND DateField1 >= '1/1/2019' 

The columnstore index is: CREATE NONCLUSTERED COLUMNSTORE INDEX IX_Table1_2 ON Table1 (IntField2, IntField1, DateField1)

Here is the execution plan for the rowstore index query before I create the columnstore index:
Execution Plan - Rowstore Index - Pre-Columnstore Index Creation

Here is the execution plan for the rowstore index query after I create the columnstore index:
Execution Plan - Rowstore Index - Post-Columnstore Index Creation

The only differences I notice between the two plans is the Sort operation's warning goes away after creating the columnstore index, and the Key Lookup and Table Insert (#TEMP) operators take significantly longer.

Here is the Sort operation's info with the warning (before creating the columnstore index):
Sort Operation - Warning

Here's the Sort operation's info without the warning (after creating the columnstore index):
Sort Operation

I would've thought a read query that is specifically leveraging the same rowstore index and execution plan in both cases should have roughly the same performance on every run, regardless of what other indexes exist on that table. What gives here?

Edit: Here's the TIME and IO stats before creating the index:
Stats - Before Columnstore Index Creation

Here's the stats after creating the columnstore index:
Stats - After Columnstore Index Creation

Best Answer

Adding the nonclustered columnstore index allows for a batch mode sort in the second execution plan. This causes all of the processing to be done on one thread - so even though the query has a parallel plan, it's essentially running serially. You can see that by looking at the details of the different operators.

I reproduced your problem locally, here's the sort operator with per-thread counts - as you can see, everything is on thread 1:

screenshot of SSMS showing thread skew and execution mode for the sort operator

Notice the "Actual Execution Mode" is "Batch."

Everything after the sort (the nested loops join, key lookup, etc) is essentially serial, which is what slows the query down.

See this KB article for details and possible solutions:

Adds trace flag 9358 to disable batch mode sort operations in a complex parallel query in SQL Server 2016

Batch mode sorts were introduced in SQL Server 2016 under compatibility level 130. If a query execution plan contains parallel batch mode sorts in conjunction with directly-upstream parallel operators, you may encounter degraded performance compared to row mode sort plan equivalents.

This occurs due to a parallel batch sort outputting fully sorted data via a single thread to the upstream parallel operator (for example, a parallel merge join operator). The performance degradation occurs when the upstream parallel operator uses single-threaded processing due to the incoming single-threaded batch mode sort operator.

For completeness, the options outlined there are either:

  • enable TF 9358
  • enable query optimizer hotfixes (through TF 4199, the QUERY_OPTIMIZER_HOTFIXES database option, or the ENABLE_QUERY_OPTIMIZER_HOTFIXES query hint)

Getting rid of the sort is another solution for this problem. The sort is only present to try and prevent too much random I/O from the nested loops join, which is using unordered prefetch, as mentioned in this article by Craig Freedman:

Optimizing I/O Performance by Sorting – Part 1

The plan uses the non-clustered index to avoid unnecessarily touching many rows. Yet, performing 64,000 random I/Os is still rather expensive so SQL Server adds a sort. By sorting the rows on the clustered index key, SQL Server transforms the random I/Os into sequential I/Os.

You can get rid of the sort by:

  • eliminating the need for the key lookup (by selecting less columns, or creating a covering nonclustered index)
  • disabling nested loops prefetching by adding (undocumented, unsupported trace flag) OPTION (QUERYTRACEON 9115) to the query