MySQL – How to Remove Duplicates in Sequence Based on Keys

duplicationMySQL

I am looking to remove duplicates on a row by row comparison based on keys Member and Group and data sorted ASC on Member, Group and StartDate. Just return the records in the select query and not delete in the table.

I am basically looking at a row by row comparison to eliminate the duplicates based on the 2 keys and value of the Status column with the data sorted by Member, Group and Start_date.

Sample data:

Member  Group   Status        Start_date        End Date
1         2     Active       4/12/2011 01:19    NULL
1         2     Active       4/12/2011 02:19    NULL
1         2     Inactive     4/12/2011 03:19    NULL
1         2     Inactive     4/15/2011 04:19    NULL
1         2     Inactive     4/16/2011 05:19    NULL
1         2     Pending      4/16/2011 11:23    NULL
1         2     Active       4/16/2011 11:25    NULL
1         2     Pending      4/16/2011 11:27    NULL
11        22    Inactive     4/16/2011 11:27    NULL    

Expected Output:

Member  Group   Status        Start_date        End Date
1         2     Active       4/12/2011 01:19    NULL
1         2     Inactive     4/12/2011 03:19    NULL
1         2     Pending      4/16/2011 11:23    NULL
1         2     Active       4/16/2011 11:25    NULL
1         2     Pending      4/16/2011 11:27    NULL
11        22    Inactive     4/16/2011 11:27    NULL    

DDL and DML:

CREATE TABLE `temp1` (
  `MEMBER_ID` int(10) unsigned NOT NULL,
  `GROUPIE_ID` int(10) unsigned NOT NULL,
  `STATUS` varchar(12) CHARACTER SET utf8 NOT NULL,
  `START_DATE` datetime DEFAULT NULL,
  `end_date` datetime DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (1,2,'Active', '2011-04-12 01:19:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (1,2,'Active', '2011-04-12 02:19:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (1,2,'InActive', '2011-04-12 03:19:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (1,2,'InActive', '2011-04-15 04:19:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (1,2,'InActive', '2011-04-16 05:19:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (1,2,'Pending', '2011-04-16 06:23:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (1,2,'Active', '2011-04-16 07:25:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (1,2,'Pending', '2011-04-16 08:27:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (1,2,'InActive', '2011-04-17 09:27:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (11,22,'Active', '2011-04-12 01:19:51 PM', NULL);
Insert into temp1(Member_id, Groupie_id, Status,Start_date, End_Date ) 
values (11,23,'InActive', '2011-04-13 02:19:51 PM', NULL);

Best Answer

This works on consecutive Start_date for each partition (on Member):

SELECT `Member`, `Group`, `Status`, `Start_date`, `End Date`
FROM (
    SELECT @row := CASE WHEN @status=Status AND @member=Member AND @group = `Group`
        THEN @row + 1 ELSE 1 END as row
        , @member:=Member as Member
        , @group:=`Group` as `Group`
        , @status:=Status as Status
        , `Start_date`, `End Date`
    FROM data d
        , (SELECT @row := 0, @status := '', @member := 0, @group := 0) v
    ORDER BY Member, `Group`, Start_date
) as n
WHERE row = 1
ORDER BY Member, Start_date
;

Sample SQL Fiddle.

The behavior of this query is similar to the ROW_NUMBER() available in Oracle (>= 10g), PostgreSQL (>= 8.4) and SQL Server (>= 2012).

Notes on Partitions:

  • This query and SQL Fiddle partition by Members and order Status by Members and Start_date.
  • Partition can be remove by removing @member=Member from CASE
  • Partition can be added or set by both Members and Group like this (SQL Fiddle):
    • Add a test on @group = 'group' to the CASE
    • Add @group := 'Group' as 'Group' in the inner SELECT [n]
    • Add @group := 0 in the variable SELECT [v]
    • Add Group to ORDER BY clauses

In term of performance, an index matching the order and columns used in the ORDER BY and CASE is very likely needed.

Output:

Member  | Group     | Status    | Start_date                | End Date
1       | 2         | Active    | April, 12 2011 03:19:00   | (null)
1       | 2         | Inactive  | April, 12 2011 05:19:00   | (null)
1       | 2         | Pending   | April, 16 2011 13:23:00   | (null)
1       | 2         | Active    | April, 16 2011 13:25:00   | (null)
1       | 2         | Pending   | April, 16 2011 13:27:00   | (null)
11      | 22        | Inactive  | April, 16 2011 13:27:00   | (null)