Sql-server – best ETL design to transfer transaction tables records into the data-warehouse

data-warehousedesign-patternetlsql serverssis

I have 2 type of tables to populate the data-warehouse with every day, lookup tables or configuration tables with few 100s records, and thats easy where i just truncate and refill the table.

but for transaction tables, that have many records, i usually increment, that is i run the ETL daily to add yesterdays records.

i have 2 problems that i face always

  1. when the job fails for any reason (i lose that Days transactions)
  2. when for any reason the job run twice or i run it twice (i get duplicates)

now i am trying to design a way where i over come these 2 problems as well as am trying to develop the ETL in such a way that it can auto fix it self incase any of these events occur.

i want it to check if there are missing days and run run the ETL for that day, and check if there are duplicates and delete them.

below are ways i though of
1. i take in the last 5 days regardless, every day the ETL runs, deletes the last 5 days and refill.
2. i check the destination tables if they have missing dates in the last month and then i query the source with the missing days.

keeping in mind that the source is a huge table in a production environment that i have to optimize my query to the maximum when requesting from it.

thanks

Best Answer

Do the transactions have an audit timestamp ? It has to be one that goes up only (no late-arriving facts. insert/update audit timestamps are good for this)
If so you could use this to define a range to extract. It's a common technique for this type of thing:

  • For each extraction, at the start, determine the range you want to extract (let's call the minimum timestamp min_ts and the maximum timestamp max_ts)
  • At the start of the extraction, put 1 line in a separate table (let's call it extraction_log and also give it a PK) with fields: extraction_id = a unique key, min_ts, max_ts, status = 'Starting Extraction'.
  • Use min_ts & max_ts to extract the data, either in 1 go (select * from where ts > min_ts and ts <= max_ts) or in chunks if needed.
  • Ad the end of a successful extraction, update the line and set status to 'Finished OK'

how to determine min_ts and max_ts ?

  • You could take min_ts from the extraction_log, using the last successful max_ts. select max(max_ts) from extraction_log where status = 'Finished OK'

  • You could take max_ts from the source db at the start of your extraction. select max(audit_ts) from source_table

There are alternatives here. If you're extracting these into a temporary staging table (a best practice), you could also take too much, ie the last 5 days and deal with duplicates later when upserting the entries in your ODS. For max_ts, if you're absolutely certain that the clocks between your dwh and the source are and will remain in sync (this is a very dangerous assumption - not recommended), you could even use sysdate()

Technically, you can get away with less effort. You don't really need the statuses, or a extraction_log table that keeps track of each batch. But I have found having it like this helps a lot with debugging and troubleshooting later on. Additionally, if you need a routine to remove entries from a load that failed halfway, to find gaps for ranges of past extractions, and so on, the extraction_log will help. You might even want to include the extraction_id as an additional column in your ODS.

some further thoughts

  • if you don't have a good timestamp candidate, a technical key in the source system with the same properties (must go up only, no late-arriving facts) should be fine too.

  • if the timestamps are generated by the source application and there is a risk that 2 transactions with the exact same timestamp are not inserted at the same time (quite common), then it's safer to use a max_ts that is slightly in the past (select max(max_ts) - 5 minutes from source_table.)

  • regarding failed loads. if you're only concerned with cleaning up entries of the last failed loads, you can add it as a first step in your ETL flow. (delete from dwh_table where ts > min_ts) This will remove the entries from any failures (if any) after the last successful extraction. It won't deal with failures between earlier successful extractions.