Sql-server – How to merge data sets without including redundant rows

sql serversql-server-2008

I think this must be a fairly common thing to do, but I haven't been able to find an example. It involves merging data that involves dates/times (or more generally any sequential values) without including redundant records. By redundant I mean records that can be implied by other records.

E.g. if the price yesterday was $1, and the price today is $1, and there were no other price changes between yesterday and today, then the price today can be inferred from the price yesterday without storing a price for both days.

Here's the scenario I'm trying to solve. We have a table of historical prices for items. New prices are imported periodically into the table. The import file can contain dates from any time, we don't know that they are always "after" the data in the history table.

To avoid data bloat I only want to insert records if they give new information. So if a new record can be inferred from an old record then I don't want to insert the new record, and if an old record can be inferred from a new record then I want to remove the old record and insert the new record.

Some concrete examples might help, here are the two problem scenarios:

An incoming record can be inferred from an existing record, so the incoming record is redundant.
E.g.

old record: 2013-04-23 1.00

new record: 2013-04-24 1.00 <– this is implied by the existing record, don't insert it
An existing record can be inferred from a new record, so the existing record is redundant.
E.g.

new record: 2013-04-23 1.00

old record: 2013-04-24 1.00 <– this is implied by the new record, delete it
This just shows an example of when a value is not redundant.
E.g.

old record: 2013-04-23 1.00

old record: 2013-04-24 1.20

new record: 2013-04-25 1.00 <– not redundant, the price changed since it was last 1.00

There is a more detailed example here http://sqlfiddle.com/#!3/2ef87/2

Currently I'm leaning towards a multi-step approach of:

Delete from incoming where there is an existing record that has the same price with an earlier date (scenario 1 above).
Delete from existing where there is an incoming record that has the same price with an earlier date (scenario 2 above).
Insert the remaining incoming records.

There must be a nicer way, maybe using MERGE, but it's doing my head in trying to work out how to do it.

How do I efficiently "merge" the existing and incoming records?
Thanks

Best Answer

The issue of records arriving late makes duplicate-removal more complex, but it is not impossible. Using a view (as proposed in your other question) to dynamically remove duplicates is workable, but queries against that view can produce complex and/or inefficient query plans.

An alternative design is to keep duplicated records in a separate table, in case they are needed to properly process a future late-arriving record. This does add a little complexity to the data import process, but each step is not too hard, and the result is a nice clean duplicate-free History table:

Tables

-- Original table
CREATE TABLE dbo.History
(
    Effective   datetime NOT NULL,      -- When the value became current
    Product     integer NOT NULL,       -- The product
    Kind        tinyint NOT NULL,       -- The price kind (RRP, special, etc)
    Price       smallmoney NOT NULL,    -- The new price

    CONSTRAINT [PK dbo.History Effective, Product, Kind]
        PRIMARY KEY CLUSTERED
            (Effective, Product, Kind)
);

-- Holding area for duplicates that may be needed in future
CREATE TABLE dbo.HistoryDuplicates
(
    Effective   datetime NOT NULL,
    Product     integer NOT NULL,
    Kind        tinyint NOT NULL,
    Price       smallmoney NOT NULL,

    CONSTRAINT [PK dbo.HistoryDuplicates Product, Kind, Effective]
        PRIMARY KEY CLUSTERED
            (Product, Kind, Effective)
);

Initial Data

CREATE TABLE #NewRows
(
    Effective   datetime NOT NULL,
    Product     integer NOT NULL,
    Kind        tinyint NOT NULL,
    Price       smallmoney NOT NULL,
    [Action]    char(1) NOT NULL DEFAULT 'X',
);

CREATE UNIQUE CLUSTERED INDEX cuq
ON #NewRows (Product, Kind, Effective);

INSERT #NewRows
    (Effective, Product, Kind, Price)
VALUES
    ('2013-04-23T00:23:00', 1234, 1, 1.00),
    ('2013-04-24T00:24:00', 1234, 1, 1.00),
    ('2013-04-25T00:25:00', 1234, 1, 1.50),
    ('2013-04-25T00:25:00', 1234, 2, 2.00),
    ('2013-04-26T00:26:00', 1234, 1, 2.00),
    ('2013-04-27T00:27:00', 1234, 1, 2.00),
    ('2013-04-28T00:28:00', 1234, 1, 1.00);

The first step is to remove any redundancies in the input data, storing the removed data in the new holding table:

-- Remove redundancies in the input set
-- and save in the duplicates table
DELETE NR
OUTPUT
    DELETED.Effective,
    DELETED.Product,
    DELETED.Kind,
    DELETED.Price
INTO dbo.HistoryDuplicates
    (Effective, Product, Kind, Price)
FROM #NewRows AS NR
OUTER APPLY
(
    SELECT TOP (1)
        NR2.Price
    FROM #NewRows AS NR2
    WHERE
        NR2.Product = NR.Product
        AND NR2.Kind = NR.Kind
        AND NR2.Effective < NR.Effective
    ORDER BY
        NR2.Effective DESC
) AS X
WHERE
    EXISTS (SELECT X.Price INTERSECT SELECT NR.Price);

De-duplication execution plan

Classifying input rows

The next step is to decide whether each row in the input table is redundant (w.r.t the History table) or not. The following query sets the Action column of the input set data appropriately:

-- Decide what to do with each row
UPDATE NewRows
SET [Action] = 
    CASE
        WHEN NOT EXISTS (SELECT ExistingRow.Price INTERSECT SELECT NewRows.Price)
            THEN 'I' -- Insert
        WHEN ExistingRow.Price = NewRows.Price
            THEN 'D' -- Duplicate
        ELSE 'X'
    END
FROM #NewRows AS NewRows
OUTER APPLY
(
    SELECT TOP (1)
        H.Price,
        H.Effective
    FROM dbo.History AS H
    WHERE
        H.Product = NewRows.Product
        AND H.Kind = NewRows.Kind
        AND H.Effective <= NewRows.Effective
    ORDER BY
        H.Effective DESC
) AS ExistingRow;

CREATE UNIQUE CLUSTERED INDEX cuq
ON #NewRows 
    ([Action], Product, Kind, Effective) 
WITH (DROP_EXISTING = ON);

Action execution plan

Store Redundant Rows

Now we store the rows identified as redundant to the holding table:

-- Store duplicates
WITH Duplicates AS
(
    SELECT NR.*
    FROM #NewRows AS NR
    WHERE NR.[Action] = 'D'
)
MERGE dbo.HistoryDuplicates AS HD
USING Duplicates AS NR
    ON NR.Product = HD.Product
    AND NR.Kind = HD.Kind
    AND NR.Effective = HD.Effective
WHEN NOT MATCHED BY TARGET THEN 
    INSERT (Product, Kind, Effective, Price) 
    VALUES (Product, Kind, Effective, Price);

Redundant records plan

New History rows

The non-redundant rows are added to the History table:

-- Inserts
WITH Inserts AS
(
    SELECT NR.*
    FROM #NewRows AS NR
    WHERE NR.[Action] = 'I'
)
MERGE dbo.History AS H
USING Inserts AS NR
    ON NR.Product = H.Product
    AND NR.Kind = H.Kind
    AND NR.Effective = H.Effective
    AND NR.Price = H.Price
WHEN NOT MATCHED BY TARGET THEN 
    INSERT (Effective, Product, Kind, Price)
    VALUES (Effective, Product, Kind, Price);

New rows plan

Reinstating redundant records

Adding new records can result in redundant rows needing to be reinstated. The following query identifies qualifying redundant rows and moves them to the History table:

DELETE NextDuplicate
OUTPUT DELETED.Product,
       DELETED.Kind,
       DELETED.Price,
       DELETED.Effective
INTO dbo.History
    (Product, Kind, Price, Effective)
FROM #NewRows AS NR
CROSS APPLY
(
    SELECT TOP (1)
        H.Effective,
        H.Price
    FROM dbo.History AS H
    WHERE
        H.Product = NR.Product
        AND H.Kind = NR.Kind
        AND H.Effective > NR.Effective
    ORDER BY
        H.Effective ASC
) AS NextHistory
CROSS APPLY
(
    SELECT TOP (1)
        HD.Effective,
        HD.Product,
        HD.Kind,
        HD.Price
    FROM dbo.HistoryDuplicates AS HD
    WHERE
        HD.Product = NR.Product
        AND HD.Kind = NR.Kind
        AND HD.Effective > NR.Effective
        AND HD.Effective < NextHistory.Effective
    ORDER BY
        HD.Effective ASC
) AS NextDuplicate
WHERE
    NR.[Action] = 'I'
    AND NOT EXISTS (SELECT NextDuplicate.Price INTERSECT SELECT NR.Price);

Relocate rows plan

Results

History table

╔═════════════════════════╦═════════╦══════╦═══════╗
║        Effective        ║ Product ║ Kind ║ Price ║
╠═════════════════════════╬═════════╬══════╬═══════╣
║ 2013-04-23 00:23:00.000 ║    1234 ║    1 ║ 1.00  ║
║ 2013-04-25 00:25:00.000 ║    1234 ║    1 ║ 1.50  ║
║ 2013-04-25 00:25:00.000 ║    1234 ║    2 ║ 2.00  ║
║ 2013-04-26 00:26:00.000 ║    1234 ║    1 ║ 2.00  ║
║ 2013-04-28 00:28:00.000 ║    1234 ║    1 ║ 1.00  ║
╚═════════════════════════╩═════════╩══════╩═══════╝

History Duplicates table

╔═════════════════════════╦═════════╦══════╦═══════╗
║        Effective        ║ Product ║ Kind ║ Price ║
╠═════════════════════════╬═════════╬══════╬═══════╣
║ 2013-04-24 00:24:00.000 ║    1234 ║    1 ║ 1.00  ║
║ 2013-04-27 00:27:00.000 ║    1234 ║    1 ║ 2.00  ║
╚═════════════════════════╩═════════╩══════╩═══════╝

Processing new data

The preceding steps are quite general. We can process a new batch of rows using exactly the same code. The next script loads the input table with two sample rows, one of which is a duplicate, and the other an example of needing to reinstate a previously-redundant row:

-- If needed
DROP TABLE #NewRows;

CREATE TABLE #NewRows
(
    Effective   datetime NOT NULL,
    Product     integer NOT NULL,
    Kind        tinyint NOT NULL,
    Price       smallmoney NOT NULL,
    [Action]    char(1) NOT NULL DEFAULT 'X',
);

CREATE UNIQUE CLUSTERED INDEX cuq
ON #NewRows (Product, Kind, Effective);

INSERT #NewRows
    (Effective, Product, Kind, Price)
VALUES
    ('2013-04-24 00:24:01.000', 1234, 1, 1.00), -- New duplicate
    ('2013-04-26 00:26:30.000', 1234, 1, 5.00); -- Complex new row

Running the rest of the general script produces this final state:

History:

╔═════════════════════════╦═════════╦══════╦═══════╗
║        Effective        ║ Product ║ Kind ║ Price ║
╠═════════════════════════╬═════════╬══════╬═══════╣
║ 2013-04-23 00:23:00.000 ║    1234 ║    1 ║ 1.00  ║
║ 2013-04-25 00:25:00.000 ║    1234 ║    1 ║ 1.50  ║
║ 2013-04-25 00:25:00.000 ║    1234 ║    2 ║ 2.00  ║
║ 2013-04-26 00:26:00.000 ║    1234 ║    1 ║ 2.00  ║
║ 2013-04-26 00:26:30.000 ║    1234 ║    1 ║ 5.00  ║ <- New
║ 2013-04-27 00:27:00.000 ║    1234 ║    1 ║ 2.00  ║ <- Recovered
║ 2013-04-28 00:28:00.000 ║    1234 ║    1 ║ 1.00  ║
╚═════════════════════════╩═════════╩══════╩═══════╝

History Duplicate:

╔═════════════════════════╦═════════╦══════╦═══════╗
║        Effective        ║ Product ║ Kind ║ Price ║
╠═════════════════════════╬═════════╬══════╬═══════╣
║ 2013-04-24 00:24:00.000 ║    1234 ║    1 ║ 1.00  ║
║ 2013-04-24 00:24:01.000 ║    1234 ║    1 ║ 1.00  ║ <- New duplicate
╚═════════════════════════╩═════════╩══════╩═══════╝

Best Answer

Tables

Initial Data

Classifying input rows

Store Redundant Rows

New History rows

Reinstating redundant records

Results

Processing new data

Related Solutions

Sql-server – How to find out what is causing the Identity column to keep going up

Sql-server – How to Rename Table or Field With Merge Replication

Related Question