SQL Server – Remove Similar and Successive Rows

sql serversql server 2014t-sql

I have a table with many entries each describing the commerce state of an item.

The table has the following structure:

CREATE TABLE CommerceState
(
    Id INT NOT NULL PRIMARY KEY
    , ItemId INT NOT NULL
    , Listings INT NOT NULL
    , UnitPrice BIGINT NOT NULL
    , Quantity INT NOT NULL
    , [Timestamp] DATETIME NOT NULL
);

At certain intervals I'm saving the current state of an item into this table. It can happen that two consecutive updates contain the same data and just differ in the timestamp.

What I want to do now is to remove those kind of duplicates. I couldn't come up with a script so far. I just got solutions to completely delete all duplicates without respect to the actual timeline.

To visualize what I would like to achieve:

Before:

+ Id + ItemId + Listings + UnitPrice + Quantity + Date                    +
| 1  | 1234   | 10       | 100       | 50       | 2015-10-22 15:55:00.000 |
| 2  | 1235   | 10       | 25        | 150      | 2015-10-22 16:00:00.000 |
| 3  | 1234   | 9        | 100       | 50       | 2015-10-22 16:05:00.000 |
| 4  | 1235   | 5        | 25        | 30       | 2015-10-22 16:10:00.000 |
| 5  | 1235   | 10       | 25        | 150      | 2015-10-22 16:15:00.000 |
| 6  | 1235   | 10       | 25        | 150      | 2015-10-22 16:20:00.000 |
| 7  | 1234   | 10       | 100       | 50       | 2015-10-22 16:25:00.000 |
| 8  | 1234   | 10       | 100       | 50       | 2015-10-22 16:30:00.000 |
| 9  | 1234   | 10       | 100       | 50       | 2015-10-22 16:35:00.000 |
| 10 | 1234   | 9        | 100       | 50       | 2015-10-22 16:40:00.000 |

After:

+ Id + ItemId + Listings + UnitPrice + Quantity + Date                    +
| 1  | 1234   | 10       | 100       | 50       | 2015-10-22 15:55:00.000 |
| 2  | 1235   | 10       | 25        | 150      | 2015-10-22 16:00:00.000 |
| 3  | 1234   | 9        | 100       | 50       | 2015-10-22 16:05:00.000 |
| 4  | 1235   | 5        | 25        | 30       | 2015-10-22 16:10:00.000 |
| 5  | 1235   | 10       | 25        | 150      | 2015-10-22 16:15:00.000 |
| 7  | 1234   | 10       | 100       | 50       | 2015-10-22 16:25:00.000 |
| 10 | 1234   | 9        | 100       | 50       | 2015-10-22 16:40:00.000 |

Any ideas/suggestions on how to write a T-SQL script that does exactly this?

Best Answer

One way to do this is to look up the previous row for each item and check if that row has the same data. If so, we can delete the row:

DELETE c
FROM CommerceState c
CROSS APPLY (
    -- For each row, find the previous row for this item
    SELECT TOP 1 prev.id, prev.Listings, prev.UnitPrice, prev.Quantity
    FROM CommerceState prev
    WHERE prev.itemId = c.itemId
        AND prev.[Timestamp] < c.[Timestamp]
    ORDER BY [Timestamp] DESC
) d
-- If that row matches the current row, we'll delete the current row
WHERE d.Listings = c.Listings
    AND d.UnitPrice = c.UnitPrice
    AND d.Quantity = c.Quantity

Here is a full test script that creates your dummy data, runs the query, and confirms the results.

You can optionally add the following index in order to optimize the lookup of the most recent row, but it would depend on your workload whether the index is worth the extra overhead it incurs when inserting and deleting data.

-- Optionally add this index if you deem the tradeoff to be worthwhile
CREATE INDEX IX_CommerceState
    ON CommerceState (ItemId, [Timestamp])
    INCLUDE (Listings, UnitPrice, Quantity)

With the index, here is the query plan:

enter image description here