So I have data that is structured like this:
|ID | Reference| Name | Datetime |
1 abc John 2016-02-08 08:25:35
2 abc John 2016-02-08 08:25:37
3 bnm Eliza 2016-02-08 08:25:39
4 abc John 2016-02-08 08:25:42
.......
351 cdf Eliza 2016-06-01 15:25:42
352 abc John 2016-06-02 15:25:42
Due to an error in a logging system, which for the time being is outside my reach, I get several rows of data for what should just be one. I can have up to 20-30 duplicates within a 20 second timeframe. I want to select only one record of those duplicates where the time only differs by 20 seconds, but at the same time I want to include records, such as id 352, which would be excluded if I used a group by clause. I also have other records inside the that 20 second timeframe that I want to include, such as id 3, so I cant directly filter out records that are too close.
Is an option to make a kind of clustered column that combine Reference, name and datetime down to f.ex 20 second precision, and group by that?
I want to keep the duplicate with the earliest starttime.
Best Answer
You can achieve this goal by using the windowing functionlag and self joining the table onto itself. I've included an example below. You may need to tweak some of the code to get exactly what you want, but I think this comes really close.
The Setup
The Query