I think this must be a fairly common thing to do, but I haven't been able to find an example. It involves merging data that involves dates/times (or more generally any sequential values) without including redundant records. By redundant I mean records that can be implied by other records.
E.g. if the price yesterday was $1, and the price today is $1, and there were no other price changes between yesterday and today, then the price today can be inferred from the price yesterday without storing a price for both days.
Here's the scenario I'm trying to solve. We have a table of historical prices for items. New prices are imported periodically into the table. The import file can contain dates from any time, we don't know that they are always "after" the data in the history table.
To avoid data bloat I only want to insert records if they give new information. So if a new record can be inferred from an old record then I don't want to insert the new record, and if an old record can be inferred from a new record then I want to remove the old record and insert the new record.
Some concrete examples might help, here are the two problem scenarios:
-
An incoming record can be inferred from an existing record, so the incoming record is redundant.
E.g.old record: 2013-04-23 1.00
new record: 2013-04-24 1.00 <– this is implied by the existing record, don't insert it
-
An existing record can be inferred from a new record, so the existing record is redundant.
E.g.new record: 2013-04-23 1.00
old record: 2013-04-24 1.00 <– this is implied by the new record, delete it
-
This just shows an example of when a value is not redundant.
E.g.old record: 2013-04-23 1.00
old record: 2013-04-24 1.20
new record: 2013-04-25 1.00 <– not redundant, the price changed since it was last 1.00
There is a more detailed example here http://sqlfiddle.com/#!3/2ef87/2
Currently I'm leaning towards a multi-step approach of:
- Delete from incoming where there is an existing record that has the same price with an earlier date (scenario 1 above).
- Delete from existing where there is an incoming record that has the same price with an earlier date (scenario 2 above).
- Insert the remaining incoming records.
There must be a nicer way, maybe using MERGE, but it's doing my head in trying to work out how to do it.
How do I efficiently "merge" the existing and incoming records?
Thanks
Best Answer
The issue of records arriving late makes duplicate-removal more complex, but it is not impossible. Using a view (as proposed in your other question) to dynamically remove duplicates is workable, but queries against that view can produce complex and/or inefficient query plans.
An alternative design is to keep duplicated records in a separate table, in case they are needed to properly process a future late-arriving record. This does add a little complexity to the data import process, but each step is not too hard, and the result is a nice clean duplicate-free History table:
Tables
Initial Data
The first step is to remove any redundancies in the input data, storing the removed data in the new holding table:
Classifying input rows
The next step is to decide whether each row in the input table is redundant (w.r.t the History table) or not. The following query sets the
Action
column of the input set data appropriately:Store Redundant Rows
Now we store the rows identified as redundant to the holding table:
New History rows
The non-redundant rows are added to the History table:
Reinstating redundant records
Adding new records can result in redundant rows needing to be reinstated. The following query identifies qualifying redundant rows and moves them to the History table:
Results
History table
History Duplicates table
Processing new data
The preceding steps are quite general. We can process a new batch of rows using exactly the same code. The next script loads the input table with two sample rows, one of which is a duplicate, and the other an example of needing to reinstate a previously-redundant row:
Running the rest of the general script produces this final state:
History:
History Duplicate: