Sql-server – Huge Size (still) with Differential Backup

backupdata-warehouseetlindexsql server

I am new to differential backup so please bear with me.

I have a database which the full backup size is ~55GB.
I perform ETL nightly on this database. The ETL process involves UPDATE, DELETE, and INDEX DISABLE/REBUILD operations.
1st differential backup produced 48GB backup file!

Now, I know that this is most likely (if not 100%) caused by the index rebuilds (and they are on big tables too!).
As index rebuilding is a pretty common technique used during ETL process, what option(s) do I have to minimise the size of differential backup on a data warehouse database?

Best Answer

That sounds perfectly expected. If you are updating most of your data and rebuilding all of your indexes: all the affected extents need to be in the differential backup, there is simply no way around that because it must contain every extent that has changed since the last full or differential backup.

If you are not already doing so add the COMPRESSION option to your backups, which can result in considerable savings (and speed up the backup process too by reducing IO).

Also if the size really is a massive issue, this might be one of the few occasions to question if a backup is needed particularly regularly. If you are regenerating all of the data in your warehouse database every day from other sources then you might not need to worry as much about backing it up as often as you do the source data, assuming of course that the source databases have good backup taking and testing processes in place, because you can reconstruct that data from the other sources (or backups thereof) anyway if disaster strikes. Of course this reconstruction process could be significantly slower than restoring a backup, so consult you users and your service restore time SLAs before changing the warehouse's backup plan.

Possible Remediation:

Two possible options to reduce the amount of data that you need to transfer spring to mind:

  1. After each update, make a copy of the data without any non-clustered indexes (backup, restore locally, drop all non-clustered indexes, backup again and transfer this version) and recreate the indexes after restoring at the other end. This avoids transferring all the index data.

  2. Make sure the daily changes are as minimal as possible by only INSERTing new rows and UPDATEing those that need it (leaving identical rows untouched), not rebuilding any indexes (this will mean not disabling them so may slow down the ETL process).
    Depending on the complexity of your ETL process refactoring it this way may be a massive job. One way around this may be to add an extra stage to the end of the process: keep a second copy of the warehouse, rebuild the first using the normal process, then update the second using one MERGE statement per table to INSERT/UPDATE/DELETE as needed being very careful to leave any unchanged rows alone. Then take your differential backups from this DB. Keep minimal indexes in the primary instance of the copy, just keep the primary keys and clustered indexes (which you will want for efficiency of the MERGE operations) and rebuild the rest after restore.
    This reduces the amount of no-op changes, so will hopefully markedly reduce the size of each differential, and reduce space in the backups devoted to index data.

Both these methods try to reduce the backup size that you are transferring at the expense of adding a fair chunk of extra processing at the receiving end (rebuilding indexes after restore) and sending end (maintaining the second copy at that location) so you are trading one bottleneck for another pair which may or may not be practical.

Depending on your data sources and the complexity of the ETL process you may have better luck with building the warehouse DB at both locations each time, as Chris suggested in the comments.