Sql-server – best ETL design to transfer transaction tables records into the data-warehouse

data-warehousedesign-patternetlsql serverssis

I have 2 type of tables to populate the data-warehouse with every day, lookup tables or configuration tables with few 100s records, and thats easy where i just truncate and refill the table.

but for transaction tables, that have many records, i usually increment, that is i run the ETL daily to add yesterdays records.

i have 2 problems that i face always

when the job fails for any reason (i lose that Days transactions)
when for any reason the job run twice or i run it twice (i get duplicates)

now i am trying to design a way where i over come these 2 problems as well as am trying to develop the ETL in such a way that it can auto fix it self incase any of these events occur.

i want it to check if there are missing days and run run the ETL for that day, and check if there are duplicates and delete them.

below are ways i though of
1. i take in the last 5 days regardless, every day the ETL runs, deletes the last 5 days and refill.
2. i check the destination tables if they have missing dates in the last month and then i query the source with the missing days.

keeping in mind that the source is a huge table in a production environment that i have to optimize my query to the maximum when requesting from it.

thanks

Best Answer

Do the transactions have an audit timestamp ? It has to be one that goes up only (no late-arriving facts. insert/update audit timestamps are good for this)
If so you could use this to define a range to extract. It's a common technique for this type of thing:

For each extraction, at the start, determine the range you want to extract (let's call the minimum timestamp min_ts and the maximum timestamp max_ts)
At the start of the extraction, put 1 line in a separate table (let's call it extraction_log and also give it a PK) with fields: extraction_id = a unique key, min_ts, max_ts, status = 'Starting Extraction'.
Use min_ts & max_ts to extract the data, either in 1 go (select * from where ts > min_ts and ts <= max_ts) or in chunks if needed.
Ad the end of a successful extraction, update the line and set status to 'Finished OK'

how to determine min_ts and max_ts ?

You could take min_ts from the extraction_log, using the last successful max_ts. select max(max_ts) from extraction_log where status = 'Finished OK'
You could take max_ts from the source db at the start of your extraction. select max(audit_ts) from source_table

There are alternatives here. If you're extracting these into a temporary staging table (a best practice), you could also take too much, ie the last 5 days and deal with duplicates later when upserting the entries in your ODS. For max_ts, if you're absolutely certain that the clocks between your dwh and the source are and will remain in sync (this is a very dangerous assumption - not recommended), you could even use sysdate()

Technically, you can get away with less effort. You don't really need the statuses, or a extraction_log table that keeps track of each batch. But I have found having it like this helps a lot with debugging and troubleshooting later on. Additionally, if you need a routine to remove entries from a load that failed halfway, to find gaps for ranges of past extractions, and so on, the extraction_log will help. You might even want to include the extraction_id as an additional column in your ODS.

some further thoughts

if you don't have a good timestamp candidate, a technical key in the source system with the same properties (must go up only, no late-arriving facts) should be fine too.
if the timestamps are generated by the source application and there is a risk that 2 transactions with the exact same timestamp are not inserted at the same time (quite common), then it's safer to use a max_ts that is slightly in the past (select max(max_ts) - 5 minutes from source_table.)
regarding failed loads. if you're only concerned with cleaning up entries of the last failed loads, you can add it as a first step in your ETL flow. (delete from dwh_table where ts > min_ts) This will remove the entries from any failures (if any) after the last successful extraction. It won't deal with failures between earlier successful extractions.

Best Answer

Related Solutions

ETL Best Practices – Data Transfer Between OLTP Systems

SQL Server Data Warehouse Design – Best Practices

Related Question