SQL Server – Eager Spool Operator Usefulness for Clustered Columnstore Delete

columnstoresql serversql server 2014sql-server-2016

I'm testing deleting data from a clustered columnstore index.

I noticed that there is a large eager spool operator in the execution plan:

This completes with the following characteristics:

60 million rows deleted
1.9 GiB TempDB used
14 minutes execution time
Serial plan
1 rebind on spool
Estimated cost for scan: 364.821

If I trick the estimator into underestimating, I get a faster plan that avoids use of TempDB:

Estimated cost of scan: 56.901

(This is an estimated plan, but the numbers in comments are correct.)

Interestingly, the spool disappears again if I flush the delta stores by running the following:

ALTER INDEX IX_Clustered ON Fact.RecordedMetricsDetail REORGANIZE WITH (COMPRESS_ALL_ROW_GROUPS = ON);

The spool appears to only be introduced when there are more than some threshold of pages in the delta stores.

To check the size of the delta stores, I'm running the following query to check for in-row pages for the table:

SELECT  
  SUM([in_row_used_page_count]) AS in_row_used_pages,
  SUM(in_row_data_page_count) AS in_row_data_pages
FROM sys.[dm_db_partition_stats] as pstats
JOIN sys.partitions AS p
ON pstats.partition_id = p.partition_id
WHERE p.[object_id] = OBJECT_ID('Fact.RecordedMetricsDetail');

Is there any plausible benefit to the spool iterator in the first plan? I have to assume it is intended as a performance enhancement and not for halloween protection because its presence is not consistent.

I'm testing this on 2016 CTP 3.1, but I see the same behavior on 2014 SP1 CU3.

I've posted a script that generates schema and data and walks you through demonstrating the problem here.

The question is mostly out of curiosity about the optimizer's behavior at this point as I have a workaround for the issue that prompted the question (a large spool filled TempDB). I'm now deleting by using partition switching instead.

Best Answer

Is there any plausible benefit to the spool iterator in the first plan?

This depends on what you regard as "plausible", but the answer according to the cost model is yes. Of course this is true, because the optimizer always chooses the cheapest plan it finds.

The real question is why the cost model considers the plan with the spool so much cheaper than the plan without. Consider estimated plans created for a fresh table (from your script) before any rows have been added to the delta store:

DELETE Fact.RecordedMetricsDetail
WHERE MeasurementTime < DATEADD(day,-1,GETUTCDATE())
OPTION (RECOMPILE);

The estimated cost for this plan is a huge 771,734 units:

The cost is almost all associated with the Clustered Index Delete, because the deletions are expected to result in a great deal of random I/O. This is just the generic logic that applies to all data modifications. For example, an unordered set of modifications to a b-tree index are assumed to result in largely random I/O, with an associated high I/O cost.

Data-changing plans may feature a Sort to present rows in an order that will promote sequential access, for exactly these cost reasons. The impact is exacerbated in this case because the table is partitioned. Very partitioned, in fact; your script creates 15,000 of them. Random updates to a very partitioned table are costed especially high since the price to switch partitions (rowsets) mid-stream is given a high cost as well.

The last major factor to consider is that the simple update query above (where 'update' means any data-changing operation, including a delete) qualifies for an optimization called "rowset sharing", where the same internal rowset is used for both scanning and updating the table. The execution plan still shows two separate operators, but nevertheless, there is only one rowset used.

I mention this because being able to apply this optimization means the optimizer takes a code path that simply does not consider the potential benefits of explicitly sorting to reduce the cost of random I/O. Where the table is a b-tree, this makes sense, because the structure is inherently ordered, so sharing the rowset provides all the potential benefits automatically.

The important consequence is that the costing logic for the update operator does not consider this ordering benefit (promoting sequential I/O or other optimizations) where the underlying object is column store. This is because column store modifications are not performed in-place; they use a delta store. The cost model is therefore reflecting a difference between shared-rowset updates on b-trees versus columnstores.

Nevertheless, in the special case of a (very!) partitioned columnstore, there might still be a benefit to preserved ordering, in that performing all updates to one partition before moving to the next might still be advantageous from an I/O point of view.

The standard cost logic is reused for column stores here, so a plan that preserves partition ordering (though not order within each partition) is costed lower. We can see this on the test query by using undocumented trace flag 2332 to require sorted input to the update operator. This sets the DMLRequestSort property to true at the update, and forces the optimizer to produce a plan that provides all rows for one partition before moving to the next:

DELETE Fact.RecordedMetricsDetail
WHERE MeasurementTime < DATEADD(day,-1,GETUTCDATE())
OPTION (RECOMPILE, QUERYTRACEON 2332);

The estimated cost for this plan is very much lower, at 52.5174 units:

This reduction in cost is all due to the lower estimated I/O cost at the update. The introduced Spool performs no useful function, except it can guarantee output in partition order, as required by the update with DMLRequestSort = true (the serial scan of a column store index cannot provide this guarantee). The cost of the spool itself is considered to be relatively low, especially compared with the (probably unrealistic) reduction in cost at the update.

The decision about whether to require ordered input to the update operator is made very early on in query optimization. The heuristics used in this decision have never been documented, but can be determined through trial and error. It seems that the size of any delta stores is an input to this decision. Once made, the choice is permanent for the query compilation. No USE PLANhint will succeed: the target of the plan either has ordered input to the update, or it does not.

There is another way to get a low-cost plan for this query without artificially limiting the cardinality estimate. A sufficiently low estimate to avoid the Spool will probably result in DMLRequestSort being false, resulting in a very high estimated plan cost due to the expected random I/O. An alternative is to use trace flag 8649 (parallel plan) in conjunction with 2332 (DMLRequestSort = true):

DELETE Fact.RecordedMetricsDetail
WHERE MeasurementTime < DATEADD(day,-1,GETUTCDATE())
OPTION (RECOMPILE, QUERYTRACEON 2332, QUERYTRACEON 8649);

This results in a plan that uses per-partition batch-mode parallel scan and an order-preserving (merging) Gather Streams exchange:

Depending on the run-time effectiveness of partition ordering on your hardware, this may perform best of the three. That said,large modifications are not a great idea on column store, so the partition-switching idea is almost certainly better. If you can cope with the long compilation times and quirky plan choices often seen with partitioned objects - especially when the number of partitions is large.

Combining many, relatively new, features, especially near their limits, is a great way to obtain poor execution plans. The depth of optimizer support tends to improve over time, but using 15,000 partitions of column store will likely always mean you live in interesting times.

Related Solutions

SQL Server – Controlling Tempdb Activity Across Drives

File writes are distributed across files in the same file group proportionally based on the current size of each file in the file group. This is referred to as the "proportional fill algorithm" - look at http://sqlserver-performance-tuning.net/?p=2552 for some interesting details around that.

tempdb can have only a single filegroup. If you attempt to create a filegroup in tempdb you get the following:

Msg 1826, Level 16, State 1, Line 1
User-defined filegroups are not allowed on "tempdb".

If you have trace flag 1117 turned on, files within a group will autogrow simultaneously across files in the filegroup for each file that is not currently at its maximum size, and where space exists on the disk.

Does your instance have trace flag 1117 turned on? You'd probably want it off in this particular instance even though "best practices" often indicate having this turned on. There is an item on Microsoft Connect asking to have a setting like this that can be enabled/disabled on a per-database basis, here: https://connect.microsoft.com/SQLServer/feedback/details/781198/trace-flag-1117-autogrowth-of-data-files-is-instance-wide-would-like-a-flag-for-just-tempdb

Assuming the SSDs are devoted to tempdb, I agree with your assertion and suggest making the tempdb on the SSDs as big as possible (not 100% of the drive, perhaps leave 10% free). Make the tempdb files on disk as small as possible, say 1MB, with autogrowth, and maximum file size as big as you need. Monitor for tempdb filegrowth on the HDDs, and make the case for getting bigger SSDs if you think the company would benefit from it.

According to the documentation, SQL Server files can be created on raw partitions (partitions that haven't been formatted) by using just the drive letter of the partition in the ALTER DATABASE ... ADD FILE syntax. This apparently removes the need to grow or shrink the file since it inherently uses the entire raw partition, as needed. I'm uncertain if this would help your situation or not; just thought I'd throw it out there as an interesting factoid. See "If the file is on a raw partition, os_file_name must specify only the drive letter of an existing raw partition. Only one file can be put on each raw partition."

SQL Server – Clustered Index Scan with Greater Than in WHERE Clause

The values of variables aren't generally sniffed so it will just assume a flat 30% of the table will be returned for that greater than predicate against an unknown value (cf. Selectivity Guesses in absence of Statistics).

When you use the literal it can look up the known value in the column statistics to get a much more accurate estimate.

If it estimates such a high percentage of the table will match it is exceedingly unlikely to give you a plan with lookups (the exact tipping point depends on how wide the table and index is).

If you use option (recompile) the variable value is able to be sniffed however...

DECLARE @number bigint
SET @number = 123456
SELECT col1, col2
FROM t1
WHERE col3 > @number
OPTION (RECOMPILE);

Will likely give you the same plan as when using the literal.

Best Answer

Related Solutions

SQL Server – Controlling Tempdb Activity Across Drives

SQL Server – Clustered Index Scan with Greater Than in WHERE Clause

Related Question