I have 2 type of tables to populate the data-warehouse with every day, lookup tables or configuration tables with few 100s records, and thats easy where i just truncate and refill the table.
but for transaction tables, that have many records, i usually increment, that is i run the ETL daily to add yesterdays records.
i have 2 problems that i face always
- when the job fails for any reason (i lose that Days transactions)
- when for any reason the job run twice or i run it twice (i get duplicates)
now i am trying to design a way where i over come these 2 problems as well as am trying to develop the ETL in such a way that it can auto fix it self incase any of these events occur.
i want it to check if there are missing days and run run the ETL for that day, and check if there are duplicates and delete them.
below are ways i though of
1. i take in the last 5 days regardless, every day the ETL runs, deletes the last 5 days and refill.
2. i check the destination tables if they have missing dates in the last month and then i query the source with the missing days.
keeping in mind that the source is a huge table in a production environment that i have to optimize my query to the maximum when requesting from it.
thanks
Best Answer
Do the transactions have an audit timestamp ? It has to be one that goes up only (no late-arriving facts. insert/update audit timestamps are good for this)
If so you could use this to define a range to extract. It's a common technique for this type of thing:
select * from where ts > min_ts and ts <= max_ts
) or in chunks if needed.how to determine min_ts and max_ts ?
You could take min_ts from the extraction_log, using the last successful max_ts.
select max(max_ts) from extraction_log where status = 'Finished OK'
You could take max_ts from the source db at the start of your extraction.
select max(audit_ts) from source_table
There are alternatives here. If you're extracting these into a temporary staging table (a best practice), you could also take too much, ie the last 5 days and deal with duplicates later when upserting the entries in your ODS. For max_ts, if you're absolutely certain that the clocks between your dwh and the source are and will remain in sync (this is a very dangerous assumption - not recommended), you could even use sysdate()
Technically, you can get away with less effort. You don't really need the statuses, or a extraction_log table that keeps track of each batch. But I have found having it like this helps a lot with debugging and troubleshooting later on. Additionally, if you need a routine to remove entries from a load that failed halfway, to find gaps for ranges of past extractions, and so on, the extraction_log will help. You might even want to include the extraction_id as an additional column in your ODS.
some further thoughts
if you don't have a good timestamp candidate, a technical key in the source system with the same properties (must go up only, no late-arriving facts) should be fine too.
if the timestamps are generated by the source application and there is a risk that 2 transactions with the exact same timestamp are not inserted at the same time (quite common), then it's safer to use a max_ts that is slightly in the past (
select max(max_ts) - 5 minutes from source_table.
)regarding failed loads. if you're only concerned with cleaning up entries of the last failed loads, you can add it as a first step in your ETL flow. (
delete from dwh_table where ts > min_ts
) This will remove the entries from any failures (if any) after the last successful extraction. It won't deal with failures between earlier successful extractions.