I have a table with many entries each describing the commerce state of an item.
The table has the following structure:
CREATE TABLE CommerceState
(
Id INT NOT NULL PRIMARY KEY
, ItemId INT NOT NULL
, Listings INT NOT NULL
, UnitPrice BIGINT NOT NULL
, Quantity INT NOT NULL
, [Timestamp] DATETIME NOT NULL
);
At certain intervals I'm saving the current state of an item into this table. It can happen that two consecutive updates contain the same data and just differ in the timestamp
.
What I want to do now is to remove those kind of duplicates. I couldn't come up with a script so far. I just got solutions to completely delete all duplicates without respect to the actual timeline.
To visualize what I would like to achieve:
Before:
+ Id + ItemId + Listings + UnitPrice + Quantity + Date +
| 1 | 1234 | 10 | 100 | 50 | 2015-10-22 15:55:00.000 |
| 2 | 1235 | 10 | 25 | 150 | 2015-10-22 16:00:00.000 |
| 3 | 1234 | 9 | 100 | 50 | 2015-10-22 16:05:00.000 |
| 4 | 1235 | 5 | 25 | 30 | 2015-10-22 16:10:00.000 |
| 5 | 1235 | 10 | 25 | 150 | 2015-10-22 16:15:00.000 |
| 6 | 1235 | 10 | 25 | 150 | 2015-10-22 16:20:00.000 |
| 7 | 1234 | 10 | 100 | 50 | 2015-10-22 16:25:00.000 |
| 8 | 1234 | 10 | 100 | 50 | 2015-10-22 16:30:00.000 |
| 9 | 1234 | 10 | 100 | 50 | 2015-10-22 16:35:00.000 |
| 10 | 1234 | 9 | 100 | 50 | 2015-10-22 16:40:00.000 |
After:
+ Id + ItemId + Listings + UnitPrice + Quantity + Date +
| 1 | 1234 | 10 | 100 | 50 | 2015-10-22 15:55:00.000 |
| 2 | 1235 | 10 | 25 | 150 | 2015-10-22 16:00:00.000 |
| 3 | 1234 | 9 | 100 | 50 | 2015-10-22 16:05:00.000 |
| 4 | 1235 | 5 | 25 | 30 | 2015-10-22 16:10:00.000 |
| 5 | 1235 | 10 | 25 | 150 | 2015-10-22 16:15:00.000 |
| 7 | 1234 | 10 | 100 | 50 | 2015-10-22 16:25:00.000 |
| 10 | 1234 | 9 | 100 | 50 | 2015-10-22 16:40:00.000 |
Any ideas/suggestions on how to write a T-SQL script that does exactly this?
Best Answer
One way to do this is to look up the previous row for each item and check if that row has the same data. If so, we can delete the row:
Here is a full test script that creates your dummy data, runs the query, and confirms the results.
You can optionally add the following index in order to optimize the lookup of the most recent row, but it would depend on your workload whether the index is worth the extra overhead it incurs when inserting and deleting data.
With the index, here is the query plan: