Sql-server – Why does minimal logging appear to result in a smaller table size

sql server

I'm testing using the publicly available AdventureworksDW2016CTP3 database.

If I insert the first 100k rows from FactResellerSalesXL_PageCompressed into a page compressed heap without minimal logging I get a table size of 12552 KB. Doing a rebuild of the table brings down the size to 6024 KB. If I do the same insert with minimal logging I get a table size of 5832 KB. Why is the initial table size over 50% smaller when minimal logging is used?

Below is code to reproduce the issue. I'm using SQL Server 2016 SP1:

Microsoft SQL Server 2016 (SP1) (KB3182545) – 13.0.4001.0 (X64) Oct 28 2016 18:17:30 Copyright (c) Microsoft Corporation Developer Edition (64-bit) on Windows 10 Home 6.3 (Build 14393: )

No trace flags are enabled.

The actual execution plans are effectively the same. I've verified that minimal logging occurs for the second insert but not the first. Changing the order of inserts does not make a difference. Clearing the buffer cache and plan cache also does not make a difference.

Table creation script:

-- create page compressed heap copy of FactResellerSalesXL_PageCompressed
SELECT * INTO FactResellerSalesXL_PageCompressed_Heap
FROM FactResellerSalesXL_PageCompressed
WHERE 1 = 0;
ALTER TABLE FactResellerSalesXL_PageCompressed_Heap REBUILD WITH (DATA_COMPRESSION = PAGE);

Code for the first insert:

-- insert 100k rows without minimal logging
INSERT INTO FactResellerSalesXL_PageCompressed_Heap
SELECT TOP 100000 *
FROM FactResellerSalesXL_PageCompressed
ORDER BY SalesOrderNumber, SalesOrderLineNumber
OPTION (MAXDOP 1);

-- check space
EXEC sp_spaceused FactResellerSalesXL_PageCompressed_Heap;

Result:

╔═════════════════════════════════════════╦════════╦══════════╦══════════╦════════════╦════════╗
║                  name                   ║  rows  ║ reserved ║   data   ║ index_size ║ unused ║
╠═════════════════════════════════════════╬════════╬══════════╬══════════╬════════════╬════════╣
║ FactResellerSalesXL_PageCompressed_Heap ║ 100000 ║ 12808 KB ║ 12496 KB ║ 8 KB       ║ 304 KB ║
╚═════════════════════════════════════════╩════════╩══════════╩══════════╩════════════╩════════╝

Code for the second insert:

TRUNCATE TABLE FactResellerSalesXL_PageCompressed_Heap;

-- insert 100k rows with minimal logging
INSERT INTO FactResellerSalesXL_PageCompressed_Heap WITH (TABLOCK)
SELECT TOP 100000 *
FROM FactResellerSalesXL_PageCompressed
ORDER BY SalesOrderNumber, SalesOrderLineNumber
OPTION (MAXDOP 1);

-- check space
EXEC sp_spaceused FactResellerSalesXL_PageCompressed_Heap;

Result:

╔═════════════════════════════════════════╦════════╦══════════╦═════════╦════════════╦════════╗
║                  name                   ║  rows  ║ reserved ║  data   ║ index_size ║ unused ║
╠═════════════════════════════════════════╬════════╬══════════╬═════════╬════════════╬════════╣
║ FactResellerSalesXL_PageCompressed_Heap ║ 100000 ║ 5832 KB  ║ 5776 KB ║ 8 KB       ║ 48 KB  ║
╚═════════════════════════════════════════╩════════╩══════════╩═════════╩════════════╩════════╝

Best Answer

When I initially looked over your question, I thought that perhaps the second insert was benefitting from the first insert. Now that I look more closely, my initial though was that, due to the exclusive table lock, you were getting much denser pages than without the lock. This is why the rebuild on the heap created without the lock gets you down to almost identical table size.

To test my theory, I did things slightly differently. I generated two separate tables in advance, and then ran each of your queries so that I created two tables that both existed at the same time (this eliminates quasi-related things, like page reuse from the truncate you performed). This way I could compare them directly.

SELECT * INTO dbo.Heap_Tablock 
FROM FactResellerSalesXL_PageCompressed WHERE 1 = 0;
ALTER TABLE dbo.Heap_Tablock REBUILD WITH (DATA_COMPRESSION = PAGE);

SELECT * INTO dbo.Heap_NoTablock 
FROM dbo.FactResellerSalesXL_PageCompressed WHERE 1 = 0;
ALTER TABLE dbo.Heap_NoTablock REBUILD WITH (DATA_COMPRESSION = PAGE);

My first observation was that the population without tablock, even if run second (so benefitting from data already in cache), execution time was about 4X (only about 2X with clean BP each time). This is partly due to minimal logging, of course. But all of it? Not sure yet.

-- 3 seconds; 9 seconds after DROPCLEANBUFFERS
INSERT INTO dbo.Heap_Tablock  WITH (TABLOCK)
SELECT TOP 100000 * FROM dbo.FactResellerSalesXL_PageCompressed
ORDER BY SalesOrderNumber, SalesOrderLineNumber
OPTION (MAXDOP 1);

-- 13 seconds; 17 seconds after DROPCLEANBUFFERS
INSERT INTO dbo.Heap_NoTablock 
SELECT TOP 100000 * FROM dbo.FactResellerSalesXL_PageCompressed
ORDER BY SalesOrderNumber, SalesOrderLineNumber
OPTION (MAXDOP 1);

Rather than look at sp_spaceused, which is rather high level and doesn't answer any questions, I looked at sys.dm_db_database_page_allocations instead. First, I just looked at the counts:

SELECT COUNT(*) FROM sys.dm_db_database_page_allocations(
  DB_ID(), OBJECT_ID(N'dbo.Heap_Tablock'), 0, NULL, 'LIMITED');

-- 737

SELECT COUNT(*) FROM sys.dm_db_database_page_allocations(
  DB_ID(), OBJECT_ID(N'dbo.Heap_NoTablock'), 0, NULL, 'LIMITED');

-- 1,569

So, without tablock, a lot more pages are associated with the table. But are they all allocated? Are the pages in the smaller table denser, contain ghost records, more likely to be compressed? Inquiring minds want to know!

SELECT is_allocated, is_page_compressed, has_ghost_records, COUNT(*), 
  AVG(1.0*page_free_space_percent)
FROM sys.dm_db_database_page_allocations(
  DB_ID(), OBJECT_ID(N'dbo.Heap_Tablock'), 0, NULL, 'DETAILED')
WHERE page_type = 1
GROUP BY page_type, is_allocated, is_page_compressed, has_ghost_records;

-- is_allocated  is_page_compressed  has_ghost_records  #     %
-- ------------  ------------------  -----------------  ----  ----------
--            0                   2                  0     2   50.000000
--            1                   1                  0   721  100.000000

SELECT is_allocated, is_page_compressed, has_ghost_records, COUNT(*),
  AVG(1.0*page_free_space_percent)
FROM sys.dm_db_database_page_allocations(
  DB_ID(), OBJECT_ID(N'dbo.Heap_NoTablock'), 0, NULL, 'DETAILED')
WHERE page_type = 1
GROUP BY page_type, is_allocated, is_page_compressed, has_ghost_records;

-- is_allocated  is_page_compressed  has_ghost_records  #     %
-- ------------  ------------------  -----------------  ----  ----------
--            1                   0                  0  1562   99.967989
---- *****************************^  DANGER!

I think this is an additional, unadvertised benefit of the exclusive lock (and it also explains the faster execution time: less I/O).

Again, just a theory based on zero source code access and a reluctance to fire up a debugger, but it seems like populating a heap, without the exclusive lock, does not page compress the data initially - it waits for the rebuild.

This is somewhat confirmed in Data Compression: Strategy, Capacity Planning and Best Practices, under "Manipulating Compressed Data", in the heap section of the table:

The newly inserted row is page-compressed:
- if new row goes to an existing page with page compression
- if the new row is inserted through BULK INSERT with TABLOCK
- if the new row is inserted through INSERT INTO ... (TABLOCK) SELECT ... FROM
Otherwise, the row is row-compressed.*

* The resulting row-compressed pages can be page-compressed by running a heap rebuild with page compression.

That is not a verbatim confirmation of what we're seeing, but it seems to jive, since you are inserting rows into new pages that can't already have been page compressed, and without the TABLOCK hint, you don't meet any of the criteria for page compression at insert time. (And without deeply investigating the data, there may not be much benefit to row compression alone, which is common for certain data patterns.) So, you don't see the page compression savings until you rebuild.

Just another argument for clustered indexes, IMHO.