SQL Server – Remove Similar and Successive Rows

I have a table with many entries each describing the commerce state of an item.

The table has the following structure:

CREATE TABLE CommerceState
(
    Id INT NOT NULL PRIMARY KEY
    , ItemId INT NOT NULL
    , Listings INT NOT NULL
    , UnitPrice BIGINT NOT NULL
    , Quantity INT NOT NULL
    , [Timestamp] DATETIME NOT NULL
);

At certain intervals I'm saving the current state of an item into this table. It can happen that two consecutive updates contain the same data and just differ in the timestamp.

What I want to do now is to remove those kind of duplicates. I couldn't come up with a script so far. I just got solutions to completely delete all duplicates without respect to the actual timeline.

To visualize what I would like to achieve:

Before:

+ Id + ItemId + Listings + UnitPrice + Quantity + Date                    +
| 1  | 1234   | 10       | 100       | 50       | 2015-10-22 15:55:00.000 |
| 2  | 1235   | 10       | 25        | 150      | 2015-10-22 16:00:00.000 |
| 3  | 1234   | 9        | 100       | 50       | 2015-10-22 16:05:00.000 |
| 4  | 1235   | 5        | 25        | 30       | 2015-10-22 16:10:00.000 |
| 5  | 1235   | 10       | 25        | 150      | 2015-10-22 16:15:00.000 |
| 6  | 1235   | 10       | 25        | 150      | 2015-10-22 16:20:00.000 |
| 7  | 1234   | 10       | 100       | 50       | 2015-10-22 16:25:00.000 |
| 8  | 1234   | 10       | 100       | 50       | 2015-10-22 16:30:00.000 |
| 9  | 1234   | 10       | 100       | 50       | 2015-10-22 16:35:00.000 |
| 10 | 1234   | 9        | 100       | 50       | 2015-10-22 16:40:00.000 |

After:

+ Id + ItemId + Listings + UnitPrice + Quantity + Date                    +
| 1  | 1234   | 10       | 100       | 50       | 2015-10-22 15:55:00.000 |
| 2  | 1235   | 10       | 25        | 150      | 2015-10-22 16:00:00.000 |
| 3  | 1234   | 9        | 100       | 50       | 2015-10-22 16:05:00.000 |
| 4  | 1235   | 5        | 25        | 30       | 2015-10-22 16:10:00.000 |
| 5  | 1235   | 10       | 25        | 150      | 2015-10-22 16:15:00.000 |
| 7  | 1234   | 10       | 100       | 50       | 2015-10-22 16:25:00.000 |
| 10 | 1234   | 9        | 100       | 50       | 2015-10-22 16:40:00.000 |

Any ideas/suggestions on how to write a T-SQL script that does exactly this?

DELETE c FROM CommerceState c CROSS APPLY ( -- For each row, find the previous row for this item SELECT TOP 1 prev.id, prev.Listings, prev.UnitPrice, prev.Quantity FROM CommerceState prev WHERE prev.itemId = c.itemId AND prev.[Timestamp] < c.[Timestamp] ORDER BY [Timestamp] DESC ) d -- If that row matches the current row, we'll delete the current row WHERE d.Listings = c.Listings AND d.UnitPrice = c.UnitPrice AND d.Quantity = c.Quantity

Best Answer

One way to do this is to look up the previous row for each item and check if that row has the same data. If so, we can delete the row:

Here is a full test script that creates your dummy data, runs the query, and confirms the results.

You can optionally add the following index in order to optimize the lookup of the most recent row, but it would depend on your workload whether the index is worth the extra overhead it incurs when inserting and deleting data.

-- Optionally add this index if you deem the tradeoff to be worthwhile
CREATE INDEX IX_CommerceState
    ON CommerceState (ItemId, [Timestamp])
    INCLUDE (Listings, UnitPrice, Quantity)

With the index, here is the query plan:

Best Answer

Related Solutions

Sql-server – Automatic ETL between similar tables

Sql-server – How to remove Date from output

Related Question