Sql-server – Copy to Data Warehouse

backupdata-warehousesql server 2014sql-server-2016transaction-log

I have been tasked with copying select databases from the production SQL server to the Data Warehouse (at least nightly). However, no matter how I try to copy the data over, I run into an issue. Below is our setup:

  • Production Enviroment

    • SQL 2014 Enterprise
    • around 20 TB of data
    • Weekly Full backups, Nightly Diff (minus on night of weekly), every 15 minutes Transaction Logs backup
  • Data Warehouse

    • SQL 2016 Enterprise
    • In the same network/datacenter as prod
    • In theory a 10G connection between servers
    • Both in VMware both on same storage device

Here is the options I have found and the reasons it will be troublesome:

  • Mirroring – Breaks Full/Diff backups
  • Log Shipping – Breaks 15 minute transaction logs
  • Restoring from full/diff nightly – Too much data (still a possibility, but MGMT not happy about it)
  • Presenting same virtual drive on both servers – Causes issues with vmware/non-standard setup

All I have to do is get the data over, after that its up to our vendor to do the manipulations. Am I overlooking some options? Has anyone else run into this? I am new to DBA, so any help is appreciated.

Best Answer

The restore from full/diff nightly is actually a good option, because it kills two birds with one stone: you get the Data Warehouse and you're also regularly testing your backups.

The typical method to populate a data warehouse from production is to use an ETL (Extract-Translate-Load) technology, like SQL Server Integration Services (SSIS). SSIS packages can be created in Visual Studio or simple ones can be created with the Import and Export Wizard. Now with 20 TB you obviously can't copy all the data every night - even a 10G connection probably couldn't keep up with that. But if you can copy most of the data over once and then incrementally load the changed/updated data every night, that should work. (The Generate Scripts wizard could help you replicate the table structures if necessary.)

Because you're running Enterprise edition on the source instance, you have several options for finding what's changed on a daily/nightly basis. Your best option would be Change Data Capture (aka CDC). The little brother of CDC is Change Tracking, which is less powerful but also a little easier to work with. If you're lucky enough that most or all of the tables have reliable UpdateDate columns, then that will work as well.

There would be challenges in keeping the structure of your source and Data Warehouse tables in sync, but you should be able to mitigate that by making sure you push changes to the Data Warehouse databases when you push them to the production database. If the Data Warehouse is just a copy of the production stuff, then you should be able to just run the same release package/scripts.

I realize this may seem daunting, but once you've got it set up, it should mostly run itself.