SQL Server – Eager Spool Operator Usefulness for Clustered Columnstore Delete

columnstoresql serversql server 2014sql-server-2016

I'm testing deleting data from a clustered columnstore index.

I noticed that there is a large eager spool operator in the execution plan:

enter image description here

This completes with the following characteristics:

  • 60 million rows deleted
  • 1.9 GiB TempDB used
  • 14 minutes execution time
  • Serial plan
  • 1 rebind on spool
  • Estimated cost for scan: 364.821

If I trick the estimator into underestimating, I get a faster plan that avoids use of TempDB:

enter image description here

Estimated cost of scan: 56.901

(This is an estimated plan, but the numbers in comments are correct.)

Interestingly, the spool disappears again if I flush the delta stores by running the following:

ALTER INDEX IX_Clustered ON Fact.RecordedMetricsDetail REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

The spool appears to only be introduced when there are more than some threshold of pages in the delta stores.

To check the size of the delta stores, I'm running the following query to check for in-row pages for the table:

SELECT  
  SUM([in_row_used_page_count]) AS in_row_used_pages,
  SUM(in_row_data_page_count) AS in_row_data_pages
FROM sys.[dm_db_partition_stats] as pstats
JOIN sys.partitions AS p
ON pstats.partition_id = p.partition_id
WHERE p.[object_id] = OBJECT_ID('Fact.RecordedMetricsDetail');

Is there any plausible benefit to the spool iterator in the first plan? I have to assume it is intended as a performance enhancement and not for halloween protection because its presence is not consistent.

I'm testing this on 2016 CTP 3.1, but I see the same behavior on 2014 SP1 CU3.

I've posted a script that generates schema and data and walks you through demonstrating the problem here.

The question is mostly out of curiosity about the optimizer's behavior at this point as I have a workaround for the issue that prompted the question (a large spool filled TempDB). I'm now deleting by using partition switching instead.

Best Answer

Is there any plausible benefit to the spool iterator in the first plan?

This depends on what you regard as "plausible", but the answer according to the cost model is yes. Of course this is true, because the optimizer always chooses the cheapest plan it finds.

The real question is why the cost model considers the plan with the spool so much cheaper than the plan without. Consider estimated plans created for a fresh table (from your script) before any rows have been added to the delta store:

DELETE Fact.RecordedMetricsDetail
WHERE MeasurementTime < DATEADD(day,-1,GETUTCDATE())
OPTION (RECOMPILE);

The estimated cost for this plan is a huge 771,734 units:

Original plan

The cost is almost all associated with the Clustered Index Delete, because the deletions are expected to result in a great deal of random I/O. This is just the generic logic that applies to all data modifications. For example, an unordered set of modifications to a b-tree index are assumed to result in largely random I/O, with an associated high I/O cost.

Data-changing plans may feature a Sort to present rows in an order that will promote sequential access, for exactly these cost reasons. The impact is exacerbated in this case because the table is partitioned. Very partitioned, in fact; your script creates 15,000 of them. Random updates to a very partitioned table are costed especially high since the price to switch partitions (rowsets) mid-stream is given a high cost as well.

The last major factor to consider is that the simple update query above (where 'update' means any data-changing operation, including a delete) qualifies for an optimization called "rowset sharing", where the same internal rowset is used for both scanning and updating the table. The execution plan still shows two separate operators, but nevertheless, there is only one rowset used.

I mention this because being able to apply this optimization means the optimizer takes a code path that simply does not consider the potential benefits of explicitly sorting to reduce the cost of random I/O. Where the table is a b-tree, this makes sense, because the structure is inherently ordered, so sharing the rowset provides all the potential benefits automatically.

The important consequence is that the costing logic for the update operator does not consider this ordering benefit (promoting sequential I/O or other optimizations) where the underlying object is column store. This is because column store modifications are not performed in-place; they use a delta store. The cost model is therefore reflecting a difference between shared-rowset updates on b-trees versus columnstores.

Nevertheless, in the special case of a (very!) partitioned columnstore, there might still be a benefit to preserved ordering, in that performing all updates to one partition before moving to the next might still be advantageous from an I/O point of view.

The standard cost logic is reused for column stores here, so a plan that preserves partition ordering (though not order within each partition) is costed lower. We can see this on the test query by using undocumented trace flag 2332 to require sorted input to the update operator. This sets the DMLRequestSort property to true at the update, and forces the optimizer to produce a plan that provides all rows for one partition before moving to the next:

DELETE Fact.RecordedMetricsDetail
WHERE MeasurementTime < DATEADD(day,-1,GETUTCDATE())
OPTION (RECOMPILE, QUERYTRACEON 2332);

The estimated cost for this plan is very much lower, at 52.5174 units:

DMLRequestSort=true plan

This reduction in cost is all due to the lower estimated I/O cost at the update. The introduced Spool performs no useful function, except it can guarantee output in partition order, as required by the update with DMLRequestSort = true (the serial scan of a column store index cannot provide this guarantee). The cost of the spool itself is considered to be relatively low, especially compared with the (probably unrealistic) reduction in cost at the update.

The decision about whether to require ordered input to the update operator is made very early on in query optimization. The heuristics used in this decision have never been documented, but can be determined through trial and error. It seems that the size of any delta stores is an input to this decision. Once made, the choice is permanent for the query compilation. No USE PLANhint will succeed: the target of the plan either has ordered input to the update, or it does not.

There is another way to get a low-cost plan for this query without artificially limiting the cardinality estimate. A sufficiently low estimate to avoid the Spool will probably result in DMLRequestSort being false, resulting in a very high estimated plan cost due to the expected random I/O. An alternative is to use trace flag 8649 (parallel plan) in conjunction with 2332 (DMLRequestSort = true):

DELETE Fact.RecordedMetricsDetail
WHERE MeasurementTime < DATEADD(day,-1,GETUTCDATE())
OPTION (RECOMPILE, QUERYTRACEON 2332, QUERYTRACEON 8649);

This results in a plan that uses per-partition batch-mode parallel scan and an order-preserving (merging) Gather Streams exchange:

Ordered Delete

Depending on the run-time effectiveness of partition ordering on your hardware, this may perform best of the three. That said,large modifications are not a great idea on column store, so the partition-switching idea is almost certainly better. If you can cope with the long compilation times and quirky plan choices often seen with partitioned objects - especially when the number of partitions is large.

Combining many, relatively new, features, especially near their limits, is a great way to obtain poor execution plans. The depth of optimizer support tends to improve over time, but using 15,000 partitions of column store will likely always mean you live in interesting times.