Sql-server – Points that support creating a separate database for source(which is usually coming in through an API Request)

data-warehousesql server

I am doing an academic project with limited knowledge of data warehouse. I am using SQL Server to build that.

I have three different sources - 
Sources - JSON Request, Web Crawling, Excel Data Source.

I Presume we need three destinations – Source Database, Staging Database, Data Warehouse.

The loading of the source database happens through SSIS – Execute SQL Task. I am running a python script in there, to pull the data and store it in the database tables.

I am a bit confused here – Should I change my script in such a way that the script does all the transformations and load them into the Staging Tables ? If I can do that, then there is no need for a Source database itself here.

Please share the best practices here – I understand the question is vague, but

what are the points that could support my argument in having a
separate database for source?

Best Answer

The value of having a separate source database is that if your transformation process fails, you don't have to go to the external sources to pull the data again. You can simply fix the transformation issue and start at the transformation step, skipping the data load.

If the time that is required to pull the data from the external sources is insignificant, then it may not be worth the extra effort. You simply have to weigh the various factors to determine what's best for the project:

How long does it take to extract the data?
If the transformation fails, how long will the original data still be available if you need to extract it again?
How quickly will you be able to address any transformation issues?
Is this a short term process, or will it be running indefinitely? If it will be running indefinitely, you'll want to spend more time engineering so that you spend less time fixing problems, especially if someone else will be managing it eventually.

Using a source database could also speed up your testing processes since you can test the load of the source database and the transformation processes separately.

Best Answer

Related Solutions

Sql-server – ETL: extraction strategy for 200 source databases

Sql-server – Header and line item data source mismatch

Related Question