Select Only One Duplicate with Different Time Column in SQL Server

sql server

So I have data that is structured like this:

|ID | Reference| Name | Datetime |
 1       abc     John      2016-02-08 08:25:35
 2       abc     John      2016-02-08 08:25:37
 3       bnm     Eliza     2016-02-08 08:25:39
 4       abc     John      2016-02-08 08:25:42
.......
 351     cdf     Eliza     2016-06-01 15:25:42
 352     abc     John      2016-06-02 15:25:42

Due to an error in a logging system, which for the time being is outside my reach, I get several rows of data for what should just be one. I can have up to 20-30 duplicates within a 20 second timeframe. I want to select only one record of those duplicates where the time only differs by 20 seconds, but at the same time I want to include records, such as id 352, which would be excluded if I used a group by clause. I also have other records inside the that 20 second timeframe that I want to include, such as id 3, so I cant directly filter out records that are too close.

Is an option to make a kind of clustered column that combine Reference, name and datetime down to f.ex 20 second precision, and group by that?

I want to keep the duplicate with the earliest starttime.

Best Answer

You can achieve this goal by using the windowing functionlag and self joining the table onto itself. I've included an example below. You may need to tweak some of the code to get exactly what you want, but I think this comes really close.

The Setup

DECLARE @TestData AS TABLE
    (
    ID INT NOT NULL
    , Reference CHAR(3) NULL
    , RefName CHAR(20) NULL
    , RefTime DATETIME2(7) NULL
    );

INSERT INTO @TestData
(ID, Reference, RefName, RefTime)
VALUES (1,'abc', 'John', '2016-02-08 08:25:35')
    , (2, 'abc', 'John', '2016-02-08 08:25:37')
    , (3, 'bnm', 'Eliza', '2016-02-08 08:25:39')
    , (4, 'abc', 'John', '2016-02-08 08:25:42')
    , (351, 'cdf', 'Eliza', '2016-06-01 15:25:42')
    , (352, 'abc', 'John', '2016-06-02 15:25:42');

The Query

WITH CTE_FirstPass AS
    (
    SELECT T.ID, T.Reference, T.RefName, T.RefTime
        , LAG(T.ID) OVER (PARTITION BY T.RefName, T.Reference ORDER BY T.RefTime) AS PrevIDForReference
    FROM @TestData AS T
    )
    , CTE_SecondPass AS
    (
    SELECT P.*
        , PriorT.RefTime AS PriorRefTime
        , ABS(DATEDIFF(SECOND, PriorT.RefTime, P.RefTime)) AS PriorTimeDiff
        , IncludeData = CASE    WHEN P.PrevIDForReference IS NULL THEN 1 
                                WHEN ABS(DATEDIFF(SECOND, PriorT.RefTime, P.RefTime)) <= 20 THEN 0
                                ELSE 1
                                END                         
    FROM CTE_FirstPass AS P
        LEFT OUTER JOIN @TestData AS PriorT ON PriorT.ID = P.PrevIDForReference AND PriorT.Reference = P.Reference AND PriorT.RefName = P.RefName
    )
    SELECT P.ID
        , P.Reference
        , P.RefName
        , P.RefTime
    FROM CTE_SecondPass AS P
    WHERE P.IncludeData = 1
    ORDER BY P.ID;

Related Solutions

Sql-server – Get records between time span

Deeptechtons, we really would like to see your answer, but since you haven't posted it yet, here is how it could be done. This SQL is for Oracle not SQL Server, so it would have to be translated. There are details missing from your post, so some assumptions used are likely invalid.

with AllEmployees As 
   (select distinct PersonName from samples)
select * from AllEmployees 
minus
select PersonName from samples 
where trunc(not_avl_date) = trunc(sysdate) 
and (not_avl_for = 1
   or (not_avl_for = 2 and :timespan LIKE '%AM%')
   or (not_avl_for = 3 and :timespan LIKE '%PM%')
);

Sample data:

create table samples (rid Number(3), PersonName Varchar2(20), 
   not_avl_date date, not_avl_for integer);

insert into samples values (1,'Peter',sysdate-2, 1);
insert into samples values (2,'Peter',sysdate-1, 2);
insert into samples values (3,'Paul',sysdate-1, 3);
insert into samples values (4,'Sue',sysdate, 3);
insert into samples values (5,'Bob',sysdate-2, 2);
insert into samples values (6,'Zach',sysdate+1, 1);

SQL Server Database Design – One Table or Duplicate Table with Different Dependencies

Yes, definitely separate them as they represent different concepts. An invevtory item is a count of stock on hand and mirrors physical artefacts. An order item is a clause in a contract of sale. The order item may or may not be honoured, amended or have a different product substituted. It only gains anything like physical representation when a shipment and corresponding shipment items are created.

Each of these "item" tables will have a foreign key to a Product table. Each represents the intersection of Product and some header table resolving a many-to-many relationship.

Best Answer

Related Solutions

Sql-server – Get records between time span

SQL Server Database Design – One Table or Duplicate Table with Different Dependencies

Related Question