Sql-server – LOB_DATA, slow table scans, and some I/O questions

blobperformancesql serverxml

I have a rather big table with one of the columns being an XML data with an average size of XML entry being ~15 kilobytes. All other columns are regular ints, bigints, GUIDs etc. To have some concrete numbers, let's say the table has a million rows and is ~15 GB in size.

What I noticed is that this table is really slow to select data from if I want to select all the columns. When I do

SELECT TOP 1000 * FROM TABLE

it takes around 20-25 seconds to read the data from disk – even though I don't impose any ordering on the result.
I run the query with the cold cache (i.e. after DBCC DROPCLEANBUFFERS). Here's IO statistics results:

Scan count 1, logical reads 364, physical reads 24, read-ahead reads
7191, lob logical reads 7924, lob physical reads 1690, lob read-ahead
reads 3968.

It grabs ~15 MB of data. Execution plan shows Clustered Index Scan as I'd expect.

There's no IO going on on the disk besides my queries; I've also checked that clustered index fragmentation is close to 0%. This is a consumer-grade SATA drive, however I'd still think SQL Server would be able to scan the table faster than ~100-150 MB/min.

Presence of XML field causes most of the table data to be located on LOB_DATA pages (in fact ~90% of table pages are LOB_DATA).

I guess my question is – am I correct in thinking that LOB_DATA pages can cause slow scans not only because of their size, but also because SQL Server can't scan the clustered index effectively when there's a lot of LOB_DATA pages in the table?

Even more broadly – is it considered reasonable to have such a table structure/data pattern? Recommendations for using Filestream usually state much bigger field sizes, so I don't really wanna go that route. I've not really found any good info about this particular scenario.

I've been thinking towards XML compression, but it needs to be done on the client or with SQLCLR and would require quite some work to implement in the system.

I tried the compression, and since XMLs are highly redundant, I can (in a c# app) compress XML from 20KB to ~2.5KB and store it in VARBINARY column, preventing usage of LOB data pages. This speeds SELECTs 20x times in my tests.

Best Answer

Presence of XML field causes most of the table data to be located on LOB_DATA pages (in fact ~90% of table pages are LOB_DATA).

Merely having the XML column in the table does not have that effect. It is the presence of XML data that, under certain conditions, causes some portion of a row's data to be stored off row, on LOB_DATA pages. And while one (or maybe several ;-) might argue that duh, the XML column implies that there will indeed be XML data, it is not guaranteed that the XML data will need to be stored off row: unless the row is pretty much already filled outside of their being any XML data, small documents (up to 8000 bytes) might fit in-row and never go to a LOB_DATA page.

am I correct in thinking that LOB_DATA pages can cause slow scans not only because of their size, but also because SQL Server can't scan the clustered index effectively when there's a lot of LOB_DATA pages in the table?

Scanning refers to looking at all of the rows. Of course, when a data page is read, all of the in-row data is read, even if you selected a subset of the columns. The difference with LOB data is that if you don't select that column, then the off-row data won't be read. Hence it is not really fair to draw a conclusion about how efficiently SQL Server can scan this Clustered Index since you didn't exactly test that (or you tested half of it). You selected all columns, which includes the XML column, and as you mentioned, that is where most of the data is located.

So we already know that the SELECT TOP 1000 * test wasn't merely reading a series of 8k data pages, all in a row, but instead jumping to other locations per each row. The exact structure of that LOB data can vary based on how large it is. Based on research shown here ( What is the Size of the LOB Pointer for (MAX) Types Like Varchar, Varbinary, Etc? ), there are two types of off-row LOB allocations:

Inline Root -- for data between 8001 and 40,000 (really 42,000) bytes, space permitting, there will be 1 to 5 pointers (24 - 72 bytes) IN ROW that point directly to the LOB page(s).
TEXT_TREE -- for data over 42,000 bytes, or if the 1 to 5 pointers can't fit in-row, then there will be just a 24 byte pointer to the starting page of a list of pointers to the LOB pages (i.e. the "text_tree" page).

One of these two situations is occurring each time you retrieve LOB data that is over 8000 bytes or just didn't fit in-row. I posted a test script on PasteBin.com ( T-SQL script to test LOB allocations and reads ) that shows the 3 types of LOB allocations (based on the size of the data) as well as the effect each of those has on logical and physical reads. In your case, if the XML data really is less than 42,000 bytes per row, then none of it (or very little of it) should be in the least-efficient TEXT_TREE structure.

If you wanted to test how quickly SQL Server can scan that Clustered Index, do the SELECT TOP 1000 but specify one or more columns not including that XML column. How does that affect your results? It should be quite a bit faster.

is it considered reasonable to have such a table structure/data pattern?

Given that we have an incomplete description of the the actual table structure and data pattern, any answer might not be optimal depending on what those missing details are. With that in mind, I would say that there is nothing obviously unreasonable about your table structure or data pattern.

I can (in a c# app) compress XML from 20KB to ~2.5KB and store it in VARBINARY column, preventing usage of LOB data pages. This speeds SELECTs 20x times in my tests.

That made selecting all columns, or even just the XML data (now in VARBINARY) faster, but it actually hurts queries that don't select the "XML" data. Assuming you have about 50 bytes in the other columns and have a FILLFACTOR of 100, then:

No Compression: 15k of XML data should require 2 LOB_DATA pages, which then requires 2 pointers for the Inline Root. The first pointer is 24 bytes and the second is 12, for a total of 36 bytes stored in-row for the XML data. The total row size is 86 bytes, and you can fit about 93 of those rows onto a 8060 byte data page. Hence, 1 million rows requires 10,753 data pages.
Custom Compression: 2.5k of VARBINARY data will fit in-row. The total row size is 2610 (2.5 * 1024 = 2560) bytes, and you can fit only 3 of those rows onto a 8060 byte data page. Hence, 1 million rows requires 333,334 data pages.

Ergo, implementing custom compression results in a 30x increase in data pages for the Clustered Index. Meaning, all queries using a Clustered Index scan now have about 322,500 more data pages to read. Please see detailed section below for additional ramifications of doing this type of compression.

I would caution against doing any refactoring based on the performance of SELECT TOP 1000 *. That is not likely to be a query that the application will even issue, and should not be used as the sole basis for potentially needless optimization(s).

For more detailed info and more tests to try, please see the section below.

This Question cannot be given a definitive answer, but we can at least make some progress and suggest additional research to help move us closer to figuring out the exact issue (ideally based on evidence).

What we know:

Table has approximately 1 million rows
Table size is approximately 15 GB
Table contains one XML column and several other columns of types: INT, BIGINT, UNIQUEIDENTIFIER, "etc"
XML column "size" is, on average approximately 15k
After running DBCC DROPCLEANBUFFERS, it takes 20 - 25 seconds for the following query to complete: SELECT TOP 1000 * FROM TABLE
The Clustered Index is being scanned
Fragmentation on the Clustered Index is close to 0%

What we think we know:

No other disk activity outside of these queries. Are you sure? Even if there are no other user queries, are there background operations taking place? Are there processes external to SQL Server running on the same machine that could be taking up some of the IO? There might not be, but it isn't clear based solely on the info provided.
15 MB of XML data is being returned. What is this number based on? An estimation derived from the 1000 rows times the average of 15k of XML data per row? Or a programmatic aggregation of what was received for that query? If it is just an estimation, I wouldn't rely upon it since the distribution of the XML data might not be even in the way that is implied by a simple average.
XML Compression might help. How exactly would you do the compression in .NET? Via the GZipStream or DeflateStream classes? This is not a zero-cost option. It will certainly compress some of the data by a large percentage, but it will also require more CPU as you will need an additional process to compress / decompress the data each time. This plan would also completely remove your ability to:
- query the XML data via the .nodes, .value, .query, and .modify XML functions.
- index the XML data.
  
  Please keep in mind (since you mentioned that XML is "highly redundant") that the XML datatype is already optimized in that it stores the element and attribute names in a dictionary, assigning an integer index ID to each item, and then using that integer ID throughout the document (hence it does not repeat the full name per each usage, nor does it repeat it again as a closing tag for elements). The actual data also has extraneous white space removed. This is why extracted XML documents don't retain their original structure and why empty elements extract as <element /> even if they went in as <element></element>. So any gains from compressing via GZip (or anything else) will only be found by compressing the element and/or attribute values, which is a much smaller surface area that could be improved than most would expect, and most likely not worth the loss of capabilities as noted directly above.
  
  Please also keep in mind that compressing the XML data and storing the VARBINARY(MAX) result won't eliminate the LOB access, it will just reduce it. Depending on the size of the rest of the data on the row, the compressed value might fit in-row, or it might still require LOB pages.

That information, while helpful, is not nearly enough. There are a lot of factors that influence query performance, so we need a much more detailed picture of what is going on.

What we don't know, but need to:

Why does the performance of SELECT * matter? Is this a pattern that you use in code. If so, why?
What is the performance of selecting only the XML column? What are the statistics and timing if you do just: SELECT TOP 1000 XmlColumn FROM TABLE; ?
How much of the 20 - 25 seconds it takes to return these 1000 rows is related to network factors (getting the data across the wire), and how much is related to client factors (rendering that approximately 15 MB plus the rest of the non-XML data into the grid in SSMS, or possibly saving to disk)?

Factoring out these two aspects of the operation can sometimes be done by simply not returning the data. Now, one might think to select into a Temporary Table or Table Variable, but this would just introduce a few new variables (i.e. disk I/O for tempdb, Transaction Log writes, possible auto-growth of tempdb data and/or log file, need space in the Buffer Pool, etc). All of those new factors can actually increase the query time. Instead, I typically store the columns into variables (of the appropriate datatype; not SQL_VARIANT) that get overwritten with each new row (i.e. SELECT @Column1 = tab.Column1,...).

HOWEVER, as was pointed out by @PaulWhite in this DBA.StackExchange Q & A, Logical reads different when accessing the same LOB data, with additional research of my own posted on PasteBin ( T-SQL script to test various scenarios for LOB reads ), LOBs are not accessed consistently between SELECT, SELECT INTO, SELECT @XmlVariable = XmlColumn, SELECT @XmlVariable = XmlColumn.query(N'/'), and SELECT @NVarCharVariable = CONVERT(NVARCHAR(MAX), XmlColumn). So our options are a little more limited here, but here is what can be done:
1. Rule out network issues by executing the query on the server running SQL Server, either in SSMS or SQLCMD.EXE.
2. Rule out client issues in SSMS by going to Query Options -> Results -> Grid and checking the option for "Discard results after execution". Please note that this option will prevent ALL output, including messages, but can still be useful to rule out the time it takes SSMS to allocate the memory per each row and then draw it in the grid.
  Alternatively, you could execute the query via SQLCMD.EXE and direct the output to go to nowhere via: -o NUL:.
Is there a Wait Type associated with this query? If yes, what is that Wait Type?
What is the actual data size for the XML columns being returned? The average size of that column across the entire table doesn't really matter if the "TOP 1000" rows contain a disproportionately large portion of the total XML data. If you want to know about the TOP 1000 rows, then look at those rows. Please run the following:
```
SELECT TOP 1000 tab.*,
       SUM(DATALENGTH(tab.XmlColumn)) / 1024.0 AS [TotalXmlKBytes],
       AVG(DATALENGTH(tab.XmlColumn)) / 1024.0 AS [AverageXmlKBytes]
       STDEV(DATALENGTH(tab.XmlColumn)) / 1024.0 AS [StandardDeviationForXmlKBytes]
FROM   SchemaName.TableName tab;
```
The exact table schema. Please provide the full CREATE TABLE statement, including all indexes.
Query plan? Is that something that you can post? That info probably won't change anything, but it is better to know that it won't than to guess that it won't and be wrong ;-)
Is there physical / external fragmentation on the data file? While this might not be a large factor here, since you are using "consumer-grade SATA" and not SSD or even Super-Expensive SATA, the effect of sub-optimally ordered sectors will be more noticeable, especially as the number of those sectors that needs to be read increases.

What are the exact results of the following query:

SELECT * FROM sys.dm_db_index_physical_stats(DB_ID(),
                          OBJECT_ID(N'dbo.SchemaName.TableName'), 1, 0, N'LIMITED');

UPDATE

It occurred to me that I should try to reproduce this scenario to see if I experience similar behavior. So, I created a table with several columns (similar to the vague description in the Question), and then populated it with 1 million rows, and the XML column has approximately 15k of data per row (see code below).

What I found is that doing a SELECT TOP 1000 * FROM TABLE completed in 8 seconds the first time, and 2 - 4 seconds each time thereafter (yes, executing DBCC DROPCLEANBUFFERS before each run of the SELECT * query). And my several-year-old laptop is not fast: SQL Server 2012 SP2 Developer Edition, 64 bit, 6 GB RAM, dual 2.5 Ghz Core i5, and a 5400 RPM SATA drive. I am also running SSMS 2014, SQL Server Express 2014, Chrome, and several other things.

Based on the response time of my system, I will repeat that we need more info (i.e. specifics about the table and data, results of the suggested tests, etc) in order to help narrow down the cause of the 20 - 25 second response time that you are seeing.

SET ANSI_NULLS, NOCOUNT ON;
GO

IF (OBJECT_ID(N'dbo.XmlReadTest') IS NOT NULL)
BEGIN
    PRINT N'Dropping table...';
    DROP TABLE dbo.XmlReadTest;
END;

PRINT N'Creating table...';
CREATE TABLE dbo.XmlReadTest 
(
    ID INT NOT NULL IDENTITY(1, 1),
    Col2 BIGINT,
    Col3 UNIQUEIDENTIFIER,
    Col4 DATETIME,
    Col5 XML,
    CONSTRAINT [PK_XmlReadTest] PRIMARY KEY CLUSTERED ([ID])
);
GO

DECLARE @MaxSets INT = 1000,
        @CurrentSet INT = 1;

WHILE (@CurrentSet <= @MaxSets)
BEGIN
    RAISERROR(N'Populating data (1000 sets of 1000 rows); Set # %d ...',
              10, 1, @CurrentSet) WITH NOWAIT;
    INSERT INTO dbo.XmlReadTest (Col2, Col3, Col4, Col5)
        SELECT  TOP 1000
                CONVERT(BIGINT, CRYPT_GEN_RANDOM(8)),
                NEWID(),
                GETDATE(),
                N'<test>'
                  + REPLICATE(CONVERT(NVARCHAR(MAX), CRYPT_GEN_RANDOM(1), 2), 3750)
                  + N'</test>'
        FROM        [master].[sys].all_columns sac1;

    IF ((@CurrentSet % 100) = 0)
    BEGIN
        RAISERROR(N'Executing CHECKPOINT ...', 10, 1) WITH NOWAIT;
        CHECKPOINT;
    END;

    SET @CurrentSet += 1;
END;

--

SELECT COUNT(*) FROM dbo.XmlReadTest; -- Verify that we have 1 million rows

-- O.P. states that the "clustered index fragmentation is close to 0%"
ALTER INDEX [PK_XmlReadTest] ON dbo.XmlReadTest REBUILD WITH (FILLFACTOR = 90);
CHECKPOINT;

--

DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;

SET STATISTICS IO, TIME ON;
SELECT TOP 1000 * FROM dbo.XmlReadTest;
SET STATISTICS IO, TIME OFF;

/*
Scan count 1, logical reads 21,       physical reads 1,     read-ahead reads 4436,
              lob logical reads 5676, lob physical reads 1, lob read-ahead reads 3967.

 SQL Server Execution Times:
   CPU time = 171 ms,  elapsed time = 8329 ms.
*/

And, because we want to factor out the time taken to read the non-LOB pages, I ran the following query to select all but the XML column (one of the tests I suggested above). This returns in 1.5 seconds fairly consistently.

DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;

SET STATISTICS IO, TIME ON;
SELECT TOP 1000 ID, Col2, Col3, Col4 FROM dbo.XmlReadTest;
SET STATISTICS IO, TIME OFF;

/*
Scan count 1, logical reads 21,    physical reads 1,     read-ahead reads 4436,
              lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 1666 ms.
*/

Conclusion (for the moment)
Based on my attempt to recreate your scenario, I don't think we can point to either the SATA drive or non-sequential I/O as the main cause of the 20 - 25 seconds, especially because we still don't know how fast the query returns when not including the XML column. And I was not able to reproduce the large number of Logical Reads (non-LOB) that you are showing, but I have a feeling that I need to add more data to each row in light of that and the statement of:

~90% of table pages are LOB_DATA

My table has 1 million rows, each having just over 15k of XML data, and sys.dm_db_index_physical_stats shows that there are 2 million LOB_DATA pages. The remaining 10% would then be 222k IN_ROW data pages, yet I only have 11,630 of those. So once again, we need more info regarding the actual table schema and actual data.

Subqueries in CASE expressions

Consider the following (perfectly legal) query:

DECLARE @Base AS TABLE (a integer NULL);
DECLARE @When AS TABLE (b integer NULL);
DECLARE @Then AS TABLE (c integer NULL);
DECLARE @Else AS TABLE (d integer NULL);

SELECT
    CASE
        WHEN (SELECT W.b FROM @When AS W) = 1
            THEN (SELECT T.c FROM @Then AS T)
        ELSE (SELECT E.d FROM @Else AS E)
    END
FROM @Base AS B;

The semantics of CASE are that WHEN/ELSE clauses are generally evaluated in textual order. In the query above, it would be incorrect for SQL Server to return an error if the ELSE subquery returned more than one row, if the WHEN clause was satisfied. To respect these semantics, the optimizer produces a plan that uses pass-through predicates:

Pass-through predicates

The inner side of the nested loop joins are only evaluated when the pass-through predicate returns false. The overall effect is that CASE expressions are tested in order, and subqueries are only evaluated if no previous expression was satisfied.

CASE expressions with an EXISTS subquery

Where a CASE subquery uses EXISTS, the logical existence test is implemented as a semi-join, but rows that would normally be rejected by the semi-join have to be retained in case a later clause needs them. Rows flowing through this special kind of semi-join acquire a flag to indicate if the semi-join found a match or not. This flag is known as the probe column.

The details of the implementation is that the logical subquery is replaced by a correlated join ('apply') with a probe column. The work is performed by a simplification rule in the query optimizer called RemoveSubqInPrj (remove subquery in projection). We can see the details using trace flag 8606:

SELECT
    T1.ID,
    CASE
        WHEN EXISTS 
        (
            SELECT 1
            FROM #T2 AS T2
            WHERE T2.ID = T1.ID
        ) THEN 1 
    ELSE 0
    END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000
OPTION (QUERYTRACEON 3604, QUERYTRACEON 8606);

Part of the input tree showing the EXISTS test is shown below:

ScaOp_Exists 
    LogOp_Project
        LogOp_Select
            LogOp_Get TBL: #T2
            ScaOp_Comp x_cmpEq
                ScaOp_Identifier [T2].ID
                ScaOp_Identifier [T1].ID

This is transformed by RemoveSubqInPrj to a structure headed by:

LogOp_Apply (x_jtLeftSemi probe PROBE:COL: Expr1008)

This is the left semi-join apply with probe described previously. This initial transformation is the only one available in SQL Server query optimizers to date, and compilation will simply fail if this transformation is disabled.

One of the possible execution plan shapes for this query is a direct implementation of that logical structure:

NLJ Semi Join with Probe

The final Compute Scalar evaluates the result of the CASE expression using the probe column value:

Compute Scalar expression

The basic shape of the plan tree is preserved when the optimize considers other physical join types for the semi join. Only merge join supports a probe column, so a hash semi join, though logically possible, is not considered:

Merge with probe column

Notice the merge outputs an expression labelled Expr1008 (that the name is the same as before is a coincidence) though no definition for it appears on any operator in the plan. This is just the probe column again. As before, the final Compute Scalar uses this probe value to evaluate the CASE.

The problem is that the optimizer doesn't fully explore alternatives that only become worthwhile with merge (or hash) semi join. In the nested loops plan, there is no advantage to checking if rows in T2 match the range on every iteration. With a merge or hash plan, this could be a useful optimization.

If we add a matching BETWEEN predicate to T2 in the query, all that happens is that this check is performed for each row as a residual on the merge semi join (tough to spot in the execution plan, but it is there):

SELECT
    T1.ID,
    CASE
        WHEN EXISTS 
        (
            SELECT 1
            FROM #T2 AS T2
            WHERE T2.ID = T1.ID
            AND T2.ID BETWEEN 5000 AND 7000 -- New
        ) THEN 1 
    ELSE 0
    END AS DoesExist
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000;

Residual predicate

We would hope that the BETWEEN predicate would instead be pushed down to T2 resulting in a seek. Normally, the optimizer would consider doing this (even without the extra predicate in the query). It recognizes implied predicates (BETWEEN on T1 and the join predicate between T1 and T2 together imply the BETWEEN on T2) without them being present in the original query text. Unfortunately, the apply-probe pattern means this is not explored.

There are ways to write the query to produce seeks on both inputs to a merge semi join. One way involves writing the query in quite an unnatural way (defeating the reason I generally prefer EXISTS):

WITH T2 AS
(
    SELECT TOP (9223372036854775807) * 
    FROM #T2 AS T2 
    WHERE ID BETWEEN 5000 AND 7000
)
SELECT 
    T1.ID, 
    DoesExist = 
        CASE 
            WHEN EXISTS 
            (
                SELECT * FROM T2 
                WHERE T2.ID = T1.ID
            ) THEN 1 ELSE 0 END
FROM #T1 AS T1
WHERE T1.ID BETWEEN 5000 AND 7000;

TOP trick plan

I wouldn't be happy writing that query in a production environment, it's just to demonstrate that the desired plan shape is possible. If the real query you need to write uses CASE in this particular way, and performance suffers by there not being a seek on the probe side of a merge semi-join, you might consider writing the query using different syntax that produces the right results and a more efficient execution plan.

Sql-server – Which of these queries is best for performance

Sometimes I wonder if SHORT scripts really is the best thing to focus on.

The size of a script has little to do with how efficiently the query will execute. A more compact statement will likely consume fewer resources in terms of compilation, but (re)compilation is usually a rare occurrence in a live system.

Fewer table accesses is usually desirable, though, and this does lead to more compact code.

Very generally speaking, a smaller execution plan will yield better results, and a lower estimated cost will yield better results. Again, though, it's highly situational. Cost estimates in particular can be way off in some cases. It's important to measure the actual execution time, because at the end of the day, that's what matters.

With left joins i can achieve what i want with just a few lines. But then I tried with a longer script, using unions. Which is the best method?

First of all, we need to know how much data will be in these tables in a real system. Right now there's so little it will be difficult to use the STATISTICS TIME performance metrics to figure out a winner -- the results that come back will be dominated by factors other than the query execution. With more data, it's likely the plans will change, thus rendering the comparison here moot.

Having said that, by looking at the query plans as they are now from a logical point of view, the first one is the winner.

You can see that the Clustered Index Scan of quantities appears once in the first plan, while it appears four times in the second one. The second plan also contains an expensive Distinct Sort as a result of using UNIONs (this operator could be eliminated by using UNION ALLs instead, which won't change the results).

The first query could also probably be improved, by getting index seeks on the colors and sizes tables, instead of table scans. It might be worth trying a hash match plan as well (which is what you'll probably see when quantities and products are larger), but for tables this small, the startup cost may be too much overhead to be of benefit.

What I would suggest you do is run each of the statements you want to test 10,000+ times in a loop, figure out the average execution time, and then compare.

Best Answer

Related Solutions

Sql-server – Check existence with EXISTS outperform COUNT! … Not

Subqueries in CASE expressions

CASE expressions with an EXISTS subquery

Sql-server – Which of these queries is best for performance

Related Question