Select Only One Duplicate with Different Time Column in SQL Server

sql server

So I have data that is structured like this:

|ID | Reference| Name | Datetime |
 1       abc     John      2016-02-08 08:25:35
 2       abc     John      2016-02-08 08:25:37
 3       bnm     Eliza     2016-02-08 08:25:39
 4       abc     John      2016-02-08 08:25:42
.......
 351     cdf     Eliza     2016-06-01 15:25:42
 352     abc     John      2016-06-02 15:25:42

Due to an error in a logging system, which for the time being is outside my reach, I get several rows of data for what should just be one. I can have up to 20-30 duplicates within a 20 second timeframe. I want to select only one record of those duplicates where the time only differs by 20 seconds, but at the same time I want to include records, such as id 352, which would be excluded if I used a group by clause. I also have other records inside the that 20 second timeframe that I want to include, such as id 3, so I cant directly filter out records that are too close.

Is an option to make a kind of clustered column that combine Reference, name and datetime down to f.ex 20 second precision, and group by that?

I want to keep the duplicate with the earliest starttime.

Best Answer

You can achieve this goal by using the windowing functionlag and self joining the table onto itself. I've included an example below. You may need to tweak some of the code to get exactly what you want, but I think this comes really close.

The Setup

DECLARE @TestData AS TABLE
    (
    ID INT NOT NULL
    , Reference CHAR(3) NULL
    , RefName CHAR(20) NULL
    , RefTime DATETIME2(7) NULL
    );

INSERT INTO @TestData
(ID, Reference, RefName, RefTime)
VALUES (1,'abc', 'John', '2016-02-08 08:25:35')
    , (2, 'abc', 'John', '2016-02-08 08:25:37')
    , (3, 'bnm', 'Eliza', '2016-02-08 08:25:39')
    , (4, 'abc', 'John', '2016-02-08 08:25:42')
    , (351, 'cdf', 'Eliza', '2016-06-01 15:25:42')
    , (352, 'abc', 'John', '2016-06-02 15:25:42');

The Query

WITH CTE_FirstPass AS
    (
    SELECT T.ID, T.Reference, T.RefName, T.RefTime
        , LAG(T.ID) OVER (PARTITION BY T.RefName, T.Reference ORDER BY T.RefTime) AS PrevIDForReference
    FROM @TestData AS T
    )
    , CTE_SecondPass AS
    (
    SELECT P.*
        , PriorT.RefTime AS PriorRefTime
        , ABS(DATEDIFF(SECOND, PriorT.RefTime, P.RefTime)) AS PriorTimeDiff
        , IncludeData = CASE    WHEN P.PrevIDForReference IS NULL THEN 1 
                                WHEN ABS(DATEDIFF(SECOND, PriorT.RefTime, P.RefTime)) <= 20 THEN 0
                                ELSE 1
                                END                         
    FROM CTE_FirstPass AS P
        LEFT OUTER JOIN @TestData AS PriorT ON PriorT.ID = P.PrevIDForReference AND PriorT.Reference = P.Reference AND PriorT.RefName = P.RefName
    )
    SELECT P.ID
        , P.Reference
        , P.RefName
        , P.RefTime
    FROM CTE_SecondPass AS P
    WHERE P.IncludeData = 1
    ORDER BY P.ID;