Sql-server – Replace all rows in a table based on date

azure-sql-databasesql servert-sql

I need to copy the values from a staging table into a production table in SQL Azure V12, considering the following rule:

I need to replace all rows for a given day (see datefield column) in the production table with all rows for the same given day in the staging table provided that the extraction date/time (see extract_datetime column) of the rows in the staging table are newer than what is in the production table.

EDIT: I also must include all new rows for days that don't already exist in the production table. See the purple rows in the table examples below.

The reason for this rule is that records tend to be amended and deleted from the system which we are extracting data from, so even though at one time a given day could have 200 entries for it, it might be that subsequently we only end up with 3 rows for that same given day.

If a day is included in a data extract, it will always have the most up-to-date records for that entire day (from 00:00:00 to 23:59:59 hrs).

Here is an example. Consider the following tables:
example of staging and production tables

In this example, the production table has 9 rows in total, as follows:

  • 01/07/2016: 4 rows
  • 02/07/2016: 2 rows
  • 03/07/2016: 3 rows

Then we do a load what is in the staging table into the production table, and the following happens:

  • 01/07/2016: The 4 rows in production for this day are replaced with the 1 row in the staging for this day, because what is in staging has a newer extraction date/time.
  • 02/07/2016: The 2 rows in production for this day are replaced with the 4 rows in the staging for this day, because what is in staging has a newer extraction date/time.
  • 03/07/2016: The 3 rows in production for this day are kept and those in the staging table are discarded, because what is in staging has an older extraction date/time.
  • 07/07/2016: The 1 row in staging for this day is inserted into the production table. This is a brand new row and there is no match for it in the production table.

So in the end we end up with the following production table:

example of production table after load

These data loads are never bigger than a few megabytes, so at other tables with less complex rules what I tend to do is:

  • First delete all rows in staging for when there are newer records in production;
  • Delete all rows in production for when there are newer rows in staging; then
  • Insert the remaining rows from staging into production, then truncate staging for the next load.

In this case I am quick lost as on how to do this matching because it is not really in a row by row basis (so I think), so any advice on how to get this code working is greatly appreciated.

Regards,
P.

Best Answer

This is a potential solution, just based on your sample data only, if your data sample has more complexity, such as in ONE day, you have multiple records of different extract_datetime for both prod and staging tables, we may need to find other ways (but in such scenario, your business rules may need to be modified to remove confusions, currently your business rules and your sample data are excellently matched, i.e. no confusion or grey area)

use tempdb
--drop table dbo.Prod, dbo.Staging, #tmp;
--go

-- create sample tables
create table dbo.Prod  (datefield datetime, field1 varchar(20), extract_datetime datetime);
create table dbo.Staging (datefield datetime, field1 varchar(20), extract_datetime datetime);
create table #tmp (datefield datetime) -- holds the deleted [DateField] value form dbo.prod

go
-- populate sample tables
insert into dbo.Prod (datefield, field1, extract_datetime) values
('20160701', 'P11', '2016-08-02 00:10:00')
,('20160701', 'P12', '2016-08-02 00:10:00')
,('20160701', 'P12', '2016-08-02 00:10:00')
,('20160701', 'P14', '2016-08-02 00:10:00')
,('20160702', 'P21', '2016-08-05 00:10:00')
,('20160702', 'P22', '2016-08-05 00:10:00')
,('20160703', 'P31', '2016-08-05 00:10:00')
,('20160703', 'P32', '2016-08-05 00:10:00')
,('20160703', 'P33', '2016-08-05 00:10:00')
go
insert into dbo.Staging (datefield, field1, extract_datetime) values
('20160701', 'S11', '2016-08-25 00:10:00')
,('20160702', 'S21', '2016-08-25 00:10:00')
,('20160702', 'S22', '2016-08-25 00:10:00')
,('20160702', 'S23', '2016-08-25 00:10:00')
,('20160702', 'S24', '2016-08-25 00:10:00')
,('20160703', 'S31', '2016-07-05 00:10:00')
,('20160703', 'S32', '2016-07-05 00:10:00')
,('20160703', 'S33', '2016-07-05 00:10:00')
,('20160703', 'S34', '2016-07-05 00:10:00')
,('20160703', 'S35', '2016-07-05 00:10:00')
,('20160707', 'S41', '2016-07-05 00:10:00') -- new record as per new requiremnt, 2016/08/15
go

-- delete records from Prod table as per business rules, but log the deleted [DateField] into #tmp for next step use
merge dbo.Prod as p
using dbo.Staging as s
on (p.DateField = s.DateField)
when matched and p.Extract_dateTime < s.Extract_datetime
then 
   delete 
output deleted.datefield into #tmp;

-- re-populate the prod table with records from staging table
insert into dbo.prod (datefield, field1, extract_datetime)
select datefield, field1, extract_datetime 
from dbo.Staging 
where Datefield in (select distinct datefield from #tmp)
or Datefield not in (select datefield from dbo.Prod) -- added as per new requirement, 2016/08/15

-- check final result
select * from dbo.prod
order by  Datefield