Perhaps you've got a maintenance job - such as reindexing - that occurs once a week - that would cause a massive jump in diff backup file size. Or perhaps it's a once-a-week application job that imports or changes data?
If you're also taking regular transaction log backups (say once an hour), if you look at the file size of those backup files it may give you a clue as well as to when the change activity is.
My approach is to disable all the nonclustered indexes [...] then build a clustered index for the table using this partition scheme.
Creating a clustered index on a heap automatically rebuilds all nonclustered indexes (even disabled ones). The nonclustered indexes are rebuilt but not partitioned. Assuming the desired end state is a partitioned clustered table with aligned indexes, rebuilding the nonclustered indexes to be non-aligned is entirely wasted effort.
What worries me slightly is that tempdb is now pushing nearly 1 TB and steadily climbing, despite the current table being around half that size. The MS docs I've read suggest the tempdb space usage should be about the size of the final table/clustered index.
The question of sort space is very complex. To understand all the details (including the effect of parallelism) you would need to carefully read an entire series of posts by the SQL Server Query Processing Team. Converting a heap to a partitioned clustered table with parallelism enabled is probably pretty close to worst case.
At it's most basic (neglecting most of the important information in the QP Team's posts), you are asking SQL Server to run a query like:
SELECT *
FROM DailyTable
ORDER BY
$partition.monthly_on_primary(LoadDate),
LoadDate,
SeqNumber;
This query will not execute quickly, regardless of where you choose to write the sort runs that do not fit in memory to. Add to that the work of actually building a complete new copy of the entire data set in separate rowsets, and the work involved in rebuilding the nonclustered indexes pointlessly...
Advice
There are many considerations in getting this change to work efficiently. The important ones are to avoid sorting at all where possible, and to use parallel minimally-logged bulk load wherever possible.
The details of that depend on details not contained in the question, and a full solution is beyond an answer here. Nevertheless, the outline of an approach that has worked well for me personally in the past is:
- Extract the existing data using
bcp
to one file per final partition
- Drop the existing table and create the new one
- Load the new table using parallel minimally-logged bulk load
The per-partition data extract needs be ordered on (LoadDate, SeqNumber)
. Ideally, you would avoid a sorting operation. If you have an existing nonclustered index on (LoadDate, SeqNumber) you can extract data in the right order without sorting if you construct the query correctly.
Once per-partition data has been extracted to separate files (this can be done in parallel if your hardware is up to it), the source table can then be dropped, freeing space. A new partitioned heap or clustered table is then created, and bulk loaded with the pre-sorted data, possibly also in parallel.
Done right, the entire process requires no more than 1x data size and achieves the fastest possible data transfer rates in both directions, with the least amount of log use.
Best Answer
That sounds perfectly expected. If you are updating most of your data and rebuilding all of your indexes: all the affected extents need to be in the differential backup, there is simply no way around that because it must contain every extent that has changed since the last full or differential backup.
If you are not already doing so add the
COMPRESSION
option to your backups, which can result in considerable savings (and speed up the backup process too by reducing IO).Also if the size really is a massive issue, this might be one of the few occasions to question if a backup is needed particularly regularly. If you are regenerating all of the data in your warehouse database every day from other sources then you might not need to worry as much about backing it up as often as you do the source data, assuming of course that the source databases have good backup taking and testing processes in place, because you can reconstruct that data from the other sources (or backups thereof) anyway if disaster strikes. Of course this reconstruction process could be significantly slower than restoring a backup, so consult you users and your service restore time SLAs before changing the warehouse's backup plan.
Possible Remediation:
Two possible options to reduce the amount of data that you need to transfer spring to mind:
After each update, make a copy of the data without any non-clustered indexes (backup, restore locally, drop all non-clustered indexes, backup again and transfer this version) and recreate the indexes after restoring at the other end. This avoids transferring all the index data.
Make sure the daily changes are as minimal as possible by only
INSERT
ing new rows andUPDATE
ing those that need it (leaving identical rows untouched), not rebuilding any indexes (this will mean not disabling them so may slow down the ETL process).Depending on the complexity of your ETL process refactoring it this way may be a massive job. One way around this may be to add an extra stage to the end of the process: keep a second copy of the warehouse, rebuild the first using the normal process, then update the second using one
MERGE
statement per table toINSERT
/UPDATE
/DELETE
as needed being very careful to leave any unchanged rows alone. Then take your differential backups from this DB. Keep minimal indexes in the primary instance of the copy, just keep the primary keys and clustered indexes (which you will want for efficiency of theMERGE
operations) and rebuild the rest after restore.This reduces the amount of no-op changes, so will hopefully markedly reduce the size of each differential, and reduce space in the backups devoted to index data.
Both these methods try to reduce the backup size that you are transferring at the expense of adding a fair chunk of extra processing at the receiving end (rebuilding indexes after restore) and sending end (maintaining the second copy at that location) so you are trading one bottleneck for another pair which may or may not be practical.
Depending on your data sources and the complexity of the ETL process you may have better luck with building the warehouse DB at both locations each time, as Chris suggested in the comments.