Sql-server – rebuild on clustered index, why does datasize shrink

clustered-indexsql server

When we did a rebuild on a clustered index on a table that have about 15gb data in it and the datasize shrunk to 5gb, how can this be? What kind of "data" is removed?

Data size i mean the "data" column of DBCC sp_spaceused

Before Rebuild on clustered index:

name                  rows        reserved    data        index_size  unused
LEDGERJOURNALTRANS    43583730    39169656 KB 15857960 KB 22916496 KB 395200 KB

After Rebuild on clustered index:

name                  rows        reserved    data        index_size  unused
LEDGERJOURNALTRANS    43583730    29076736 KB 5867048 KB  22880144 KB 329544 KB

TSQL for rebuild:

USE [DAX5TEST]
GO
ALTER INDEX [I_212RECID] ON [dbo].[LEDGERJOURNALTRANS] REBUILD PARTITION = ALL WITH ( PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, ONLINE = ON, SORT_IN_TEMPDB = OFF, DATA_COMPRESSION = PAGE, FILLFACTOR = 85 )
GO

Best Answer

When a table has a clustered index, the index is the table data (otherwise you have a heap type table). A rebuild of the clustered index (any index in fact, but the space wouldn't be counted as "data" for a non-clustered index) will result in partially used pages being merged into a more full form.

As you insert data into an index (clustered or otherwise) in index order leaf pages are created as needed and you will only ever have one partial page: the one at the end. As you enter data out of index order a page needs to be split for the data to fit in the right place: you end up with two pages which are approximately half full and the new row goes into one of them. Over time this can happen a lot, consuming a fair amount of extra space, though to an extent future inserts will fill some of the gaps. Non-leaf pages will see a similar effect too, but the actual data pages are far more significant in size then they are.

Also deletes may result in partial pages. If you remove all the rows in a page it gets counted as "unused" but if it has one or more row of data left it is still counted as in use. Even if there is only one row using 10 bytes in a page, that page counts as 8192 bytes in the count of space used. Again future inserts might fill some of the gap.

For variable length rows, updates can also have the same effect: as a row gets smaller it may leave space in its page that is not later easy to reuse, and if a row in a nearly full page grows longer it could force a page split.

SQL Server does not spend time trying to normalise the data by rearranging how pages are used, until explicitly told to such as your index rebuild order, as such garbage collection exercises could be a performance nightmare.

I suspect this is what you are seeing, though I'd say that having enough space allocated for ~2.7 times the amount the data absolutely needs to is a particularly bad case. It might imply that you have something random as one of the significant keys in the index (a UUID column perhaps) meaning that new rows are unlikely to ever be added in index order, and/or that a significant number of deletes have happened recently.

Page Split Example

Inserting in index order with fixed length rows of which four fit into a page:

Start with one empty page: 
        [__|__|__|__]
Add the first item in index order:
        [00|__|__|__]
Add the next three
        [00|02|04|06]
Adding the next will result in a new page:
        [00|02|04|06] [08|__|__|__]
And so on...
        [00|02|04|06] [08|10|12|14] [16|18|__|__]

Now for adding rows out of index order (this is why I used even numbers only above): Adding 11 would mean either extending that second page (not possible as they are of fixed size), moving everything above 11 up one (far too expensive on a large index) or splitting the page like so:

[00|02|04|06] [08|10|11|__] [12|14|__|__] [16|18|__|__]

From here, adding 13 and 17 won't result in a split as there currently room in the relevant pages:

[00|02|04|06] [08|10|11|__] [12|13|14|__] [16|17|18|__]

but adding 03 will:

[00|02|03|__] [04|06|__|__] [08|10|11|__] [12|13|14|__] [16|17|18|__]

As you can see, after those insert operations we currently have 5 data pages allocated which could fit a total of 20 rows, but we only have 14 rows there ("wasting" 30% of the space).

A rebuild with default options (see below about "fill factor") would result in:

[00|02|03|04] [06|08|10|11] [12|13|14|16] [17|18|__|__]

saving one page in this simple example. It is easy to see how deletes can have a similar effect as out-of-index-order inserts.

Mitigation

If you are expecting data to come in a fairly random order with respect to index order, you can use the FILLFACTOR option when creating or rebuilding an index to tell SQL Server to artificially leave gaps to later fill in - reducing page splits in the long run but taking more space initially. Of course getting this value wrong can make things far worse rather than making the situation better, so handle with care.

Page splitting, particularly on the clustered index, can have a performance implication for inserts/updates so FILLFACTOR is sometimes tweaked for that reason instead of the space use issue in databases that see a lot of write activity (but for most apps, where reads outweigh writes by several orders of magnitude, you are generally better off leaving fill-factor at 100% except for specific cases like where you have indexes over columns with effectively random content).

I assume other big name DBs have a similar option, if you need this level of control in them too.

Update

Regarding the ALTER INDEX statement added to the question after I started typing the above: I assume that the options are the same as when the index was first built (or last rebuilt) but if not then the compression option could be very significant if it was added this time around. Also in that statement the fillfactor is set to 85% not 100% so each leaf page will be ~15% empty immediately after the rebuild.