I'm finding very limited reading on how a large xml column affects table performance in sql server 2014/6. For reference, it is a 36gig table and minus one xml column, it is 36megs. Some queries call for the column (usually single row requests), most do not. Is there a significant performance benefit to breaking that xml column out to its own linked table?
Sql-server – XML column performance in TSQL
sql server 2014sql-server-2016t-sqlxml
Related Solutions
I would use DATETIME2 as opposed to DATETIME for any new development work at this point. If you don't need sub-second granularity, use DATETIME2(0) and you'll see some space savings as well.
From here I would start experimenting. If performance is poor just querying off the single column, I'd then try adding persisted computed columns to separate out the date and time components. This will ensure that the separated DATE and TIME columns will always be the same as the "master" DATETIME2(n) column.
Is there any plausible benefit to the spool iterator in the first plan?
This depends on what you regard as "plausible", but the answer according to the cost model is yes. Of course this is true, because the optimizer always chooses the cheapest plan it finds.
The real question is why the cost model considers the plan with the spool so much cheaper than the plan without. Consider estimated plans created for a fresh table (from your script) before any rows have been added to the delta store:
DELETE Fact.RecordedMetricsDetail
WHERE MeasurementTime < DATEADD(day,-1,GETUTCDATE())
OPTION (RECOMPILE);
The estimated cost for this plan is a huge 771,734 units:
The cost is almost all associated with the Clustered Index Delete, because the deletions are expected to result in a great deal of random I/O. This is just the generic logic that applies to all data modifications. For example, an unordered set of modifications to a b-tree index are assumed to result in largely random I/O, with an associated high I/O cost.
Data-changing plans may feature a Sort to present rows in an order that will promote sequential access, for exactly these cost reasons. The impact is exacerbated in this case because the table is partitioned. Very partitioned, in fact; your script creates 15,000 of them. Random updates to a very partitioned table are costed especially high since the price to switch partitions (rowsets) mid-stream is given a high cost as well.
The last major factor to consider is that the simple update query above (where 'update' means any data-changing operation, including a delete) qualifies for an optimization called "rowset sharing", where the same internal rowset is used for both scanning and updating the table. The execution plan still shows two separate operators, but nevertheless, there is only one rowset used.
I mention this because being able to apply this optimization means the optimizer takes a code path that simply does not consider the potential benefits of explicitly sorting to reduce the cost of random I/O. Where the table is a b-tree, this makes sense, because the structure is inherently ordered, so sharing the rowset provides all the potential benefits automatically.
The important consequence is that the costing logic for the update operator does not consider this ordering benefit (promoting sequential I/O or other optimizations) where the underlying object is column store. This is because column store modifications are not performed in-place; they use a delta store. The cost model is therefore reflecting a difference between shared-rowset updates on b-trees versus columnstores.
Nevertheless, in the special case of a (very!) partitioned columnstore, there might still be a benefit to preserved ordering, in that performing all updates to one partition before moving to the next might still be advantageous from an I/O point of view.
The standard cost logic is reused for column stores here, so a plan that preserves partition ordering (though not order within each partition) is costed lower. We can see this on the test query by using undocumented trace flag 2332 to require sorted input to the update operator. This sets the DMLRequestSort
property to true at the update, and forces the optimizer to produce a plan that provides all rows for one partition before moving to the next:
DELETE Fact.RecordedMetricsDetail
WHERE MeasurementTime < DATEADD(day,-1,GETUTCDATE())
OPTION (RECOMPILE, QUERYTRACEON 2332);
The estimated cost for this plan is very much lower, at 52.5174 units:
This reduction in cost is all due to the lower estimated I/O cost at the update. The introduced Spool performs no useful function, except it can guarantee output in partition order, as required by the update with DMLRequestSort = true
(the serial scan of a column store index cannot provide this guarantee). The cost of the spool itself is considered to be relatively low, especially compared with the (probably unrealistic) reduction in cost at the update.
The decision about whether to require ordered input to the update operator is made very early on in query optimization. The heuristics used in this decision have never been documented, but can be determined through trial and error. It seems that the size of any delta stores is an input to this decision. Once made, the choice is permanent for the query compilation. No USE PLAN
hint will succeed: the target of the plan either has ordered input to the update, or it does not.
There is another way to get a low-cost plan for this query without artificially limiting the cardinality estimate. A sufficiently low estimate to avoid the Spool will probably result in DMLRequestSort being false, resulting in a very high estimated plan cost due to the expected random I/O. An alternative is to use trace flag 8649 (parallel plan) in conjunction with 2332 (DMLRequestSort = true):
DELETE Fact.RecordedMetricsDetail
WHERE MeasurementTime < DATEADD(day,-1,GETUTCDATE())
OPTION (RECOMPILE, QUERYTRACEON 2332, QUERYTRACEON 8649);
This results in a plan that uses per-partition batch-mode parallel scan and an order-preserving (merging) Gather Streams exchange:
Depending on the run-time effectiveness of partition ordering on your hardware, this may perform best of the three. That said,large modifications are not a great idea on column store, so the partition-switching idea is almost certainly better. If you can cope with the long compilation times and quirky plan choices often seen with partitioned objects - especially when the number of partitions is large.
Combining many, relatively new, features, especially near their limits, is a great way to obtain poor execution plans. The depth of optimizer support tends to improve over time, but using 15,000 partitions of column store will likely always mean you live in interesting times.
Related Question
- Sql-server – Help with SQL Server and XML columns
- Sql-server – Help with FOR XML EXPLICIT query
- Sql-server – INSERT performance degradation in SQL Server after a certain number of columns
- Sql-server – Join on NULL Key Column Optimizing as Table and Index Scans
- Sql-server – TSQL – create a function that takes a multi-row query result as an argument
- Sql-server – XML/SQL Server 2008 Error: XQuery…Cannot implicitly atomize or apply ‘fn:data()’ to complex content elements
- Sql-server – XQuery in SQL Server to convert XML column data to relational data table
Best Answer
The database engine is pretty smart about this and does not read LOB pages unless it has to as far as I can tell, but I created a test-rig to double-check.
Test Rig
This creates a test table with 1 million rows, the non-XML data around 36MB and the XML data about 23GB.
Test Queries
Test Results
Analysis
Query 1: A
select *...
causes a full table scan, so the database engine does a high number of LOB reads to read the XML which is stored out of row in LOB pages. XML can be stored in-row if it's small enough, but the default is out-of-row. This query takes 5 minutes, ie if you are reading LOB pages on a table like this, you will know about it.Query 2 A
select with column list
not including the XML column, has the same number of logical reads in the top window (11,154), but no LOB reads. The query takes less than a second.Query 3: After the non-clustered index has been added, a
select *
still causes a full table scan as expected. This query behaves the same as Query 1.Query 4: After the non-clustered index has been added, it covers the
select with column list
query so there is a non-clustered index scan with no LOB reads and lower logical reads overall.The
LOB Reads
looks like a broad categorisation as the XML storage pages are classed asTEXT_MIX_PAGE
in thesys.dm_db_database_page_allocations
DMV:In summary, it looks like there would be little benefit to splitting the same as the database engine can differentiate when and when not to perform LOB reads in this admittedly simple test rig. See here for some further details.