Mysql – remove almost duplicates based on insert time

MySQL

Have a MySql table that records magnetic stripe swipe times from identity cards:

CREATE TABLE `swipes` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `personalid` varchar(20) NOT NULL,
 `swipe_time` datetime NOT NULL,
 `status` tinyint(1) NOT NULL COMMENT '1=registered,0=not',
 `ride_taken` tinyint(1) NOT NULL DEFAULT '0',
 `source` tinyint(4) NOT NULL DEFAULT '0',
 PRIMARY KEY (`id`),
 UNIQUE KEY `my_key` (`personalid`,`swipe_time`,`status`,`ride_taken`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

The unique key is there to prevent exact duplicate inserts.
However there is another source of almost duplicates where the card gets swiped twice in quick succession or even for some odd reason within a minute or so when there may be other cards swiped in between. In some ways this is like bounce on electronic switch contacts.

Here's a snippet of rows to illustrate:

    id      personalid      swipe_time      status   ride_taken source
    1661    C08877547   2012-10-21 01:12:08     1        1        3
    1662    C09364782   2012-10-21 01:23:38     1        1        3
    1663    C09364782   2012-10-21 01:23:48     1        1        3
    1664    D09490557   2012-10-21 01:24:39     1        1        3

What I want is a query that will remove entries like id 1663, that are within some specified time of rows with the same personalid, status=1, ride_taken=1 and source

Yes I have searched and inspected both here and on stackoverflow
such as https://stackoverflow.com/questions/18932/how-can-i-remove-duplicate-rows
for a similar situation but have not found an answer yet. If there is one already, please point me at it.

I'm at a loss how to apply a time range filter for the swipe_time to delete only similar rows within a specified time range.

Best Answer

Here is something very quick and dirty

Get the minimum datetime for each personalid, status, ride_taken

SELECT B.* FROM
(SELECT personalid,status,ride_taken,MIN(swipe_time) swipe_time
FROM swipes GROUP BY personalid,status,ride_taken) A
INNER JOIN swipes B USING (personalid,status,ride_taken,swipe_time);

First let's load your sample data:

mysql> drop database if exists sdujan;
Query OK, 1 row affected (0.00 sec)

mysql> create database sdujan;
Query OK, 1 row affected (0.00 sec)

mysql> use sdujan;
Database changed
mysql> CREATE TABLE `swipes` (
    ->  `id` int(11) NOT NULL AUTO_INCREMENT,
    ->  `personalid` varchar(20) NOT NULL,
    ->  `swipe_time` datetime NOT NULL,
    ->  `status` tinyint(1) NOT NULL COMMENT '1=registered,0=not',
    ->  `ride_taken` tinyint(1) NOT NULL DEFAULT '0',
    ->  `source` tinyint(4) NOT NULL DEFAULT '0',
    ->  PRIMARY KEY (`id`),
    ->  UNIQUE KEY `my_key` (`personalid`,`swipe_time`,`status`,`ride_taken`)
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.09 sec)

mysql> insert into swipes values
    -> (1661,'C08877547','2012-10-21 01:12:08',1,1,3),
    -> (1662,'C09364782','2012-10-21 01:23:38',1,1,3),
    -> (1663,'C09364782','2012-10-21 01:23:48',1,1,3),
    -> (1664,'D09490557','2012-10-21 01:24:39',1,1,3);
Query OK, 4 rows affected (0.00 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> select * from swipes;
+------+------------+---------------------+--------+------------+--------+
| id   | personalid | swipe_time          | status | ride_taken | source |
+------+------------+---------------------+--------+------------+--------+
| 1661 | C08877547  | 2012-10-21 01:12:08 |      1 |          1 |      3 |
| 1662 | C09364782  | 2012-10-21 01:23:38 |      1 |          1 |      3 |
| 1663 | C09364782  | 2012-10-21 01:23:48 |      1 |          1 |      3 |
| 1664 | D09490557  | 2012-10-21 01:24:39 |      1 |          1 |      3 |
+------+------------+---------------------+--------+------------+--------+
4 rows in set (0.00 sec)

mysql>

Here is the output of my proposed query

mysql> SELECT B.* FROM
    -> (SELECT personalid,status,ride_taken,MIN(swipe_time) swipe_time
    -> FROM swipes GROUP BY personalid,status,ride_taken) A
    -> INNER JOIN swipes B USING (personalid,status,ride_taken,swipe_time);
+------+------------+---------------------+--------+------------+--------+
| id   | personalid | swipe_time          | status | ride_taken | source |
+------+------------+---------------------+--------+------------+--------+
| 1661 | C08877547  | 2012-10-21 01:12:08 |      1 |          1 |      3 |
| 1662 | C09364782  | 2012-10-21 01:23:38 |      1 |          1 |      3 |
| 1664 | D09490557  | 2012-10-21 01:24:39 |      1 |          1 |      3 |
+------+------------+---------------------+--------+------------+--------+
3 rows in set (0.00 sec)

mysql>

Give it a Try !!!

SUGGESTION #1

To accommodate the query better, add this index

ALTER TABLE swipes ADD UNIQUE KEY `my_key2`
(`personalid`,`status`,`ride_taken`,`swipe_time`);

SUGGESTION #2 (Optional)

To accommodate more than 255 swipe stations, perhaps source should be smallint unsigned

SUGGESTION #3 (Optional)

Some Transit Systems allow multiple swipes (up to 4 for PATH TransHudson) in one station. You could give some additional thought on this should you have to allow multiple swipes.

UPDATE 2012-10-24 17:30 EDT

If you are trying to limit within a time range, I have another viewpoint for you

Here is the subquery within my answer:

mysql> SELECT
    ->     personalid,status,ride_taken,MIN(swipe_time) swipe_time
    -> FROM
    ->     swipes GROUP BY personalid,status,ride_taken
    -> ;
+------------+--------+------------+---------------------+
| personalid | status | ride_taken | swipe_time          |
+------------+--------+------------+---------------------+
| C08877547  |      1 |          1 | 2012-10-21 01:12:08 |
| C09364782  |      1 |          1 | 2012-10-21 01:23:38 |
| D09490557  |      1 |          1 | 2012-10-21 01:24:39 |
+------------+--------+------------+---------------------+
3 rows in set (0.00 sec)

mysql>

Let's say you want to limit to the Latest Entry Per 10 seconds. You will first have to generate a ten-second interval boundary like this:

mysql> SELECT personalid,status,ride_taken,swipe_time,
    -> (swipe_time - INTERVAL MOD(UNIX_TIMESTAMP(swipe_time),10) SECOND) ten_sec_in
    -> FROM swipes;
+------------+--------+------------+---------------------+---------------------+
| personalid | status | ride_taken | swipe_time          | ten_sec_int         |
+------------+--------+------------+---------------------+---------------------+
| C08877547  |      1 |          1 | 2012-10-21 01:12:08 | 2012-10-21 01:12:00 |
| C09364782  |      1 |          1 | 2012-10-21 01:23:38 | 2012-10-21 01:23:30 |
| C09364782  |      1 |          1 | 2012-10-21 01:23:48 | 2012-10-21 01:23:40 |
| D09490557  |      1 |          1 | 2012-10-21 01:24:39 | 2012-10-21 01:24:30 |
+------------+--------+------------+---------------------+---------------------+
4 rows in set (0.00 sec)

mysql>

Perhaps the thing to do is introduce the ten_second boundary into the table

ALTER TABLE swipes ADD swipe_int DATETIME;
UPDATE swipes SET
swipe_int = swipe_time - INTERVAL MOD(UNIX_TIMESTAMP(swipe_time),10) SECOND;
ALTER TABLE swipes ADD UNIQUE (personalid,swipe_int);

I ran this

mysql> ALTER TABLE swipes ADD swipe_int DATETIME;
Query OK, 4 rows affected (0.05 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> UPDATE swipes SET
    -> swipe_int = swipe_time - INTERVAL MOD(UNIX_TIMESTAMP(swipe_time),10) SECOND;
Query OK, 4 rows affected (0.00 sec)
Rows matched: 4  Changed: 4  Warnings: 0

mysql> ALTER TABLE swipes ADD UNIQUE personalid_swipe_int_ndx (personalid,swipe_int);
Query OK, 4 rows affected (0.05 sec)
Records: 4  Duplicates: 0  Warnings: 0

mysql> select * from swipes;
+------+------------+---------------------+--------+------------+--------+---------------------+
| id   | personalid | swipe_time          | status | ride_taken | source | swipe_int           |
+------+------------+---------------------+--------+------------+--------+---------------------+
| 1661 | C08877547  | 2012-10-21 01:12:08 |      1 |          1 |      3 | 2012-10-21 01:12:00 |
| 1662 | C09364782  | 2012-10-21 01:23:38 |      1 |          1 |      3 | 2012-10-21 01:23:30 |
| 1663 | C09364782  | 2012-10-21 01:23:48 |      1 |          1 |      3 | 2012-10-21 01:23:40 |
| 1664 | D09490557  | 2012-10-21 01:24:39 |      1 |          1 |      3 | 2012-10-21 01:24:30 |
+------+------------+---------------------+--------+------------+--------+---------------------+
4 rows in set (0.00 sec)

mysql> show create table swipes\G
*************************** 1. row ***************************
       Table: swipes
Create Table: CREATE TABLE `swipes` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `personalid` varchar(20) NOT NULL,
  `swipe_time` datetime NOT NULL,
  `status` tinyint(1) NOT NULL COMMENT '1=registered,0=not',
  `ride_taken` tinyint(1) NOT NULL DEFAULT '0',
  `source` tinyint(4) NOT NULL DEFAULT '0',
  `swipe_int` datetime DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `my_key` (`personalid`,`swipe_time`,`status`,`ride_taken`),
  UNIQUE KEY `personalid_swipe_int_ndx` (`personalid`,`swipe_int`)
) ENGINE=MyISAM AUTO_INCREMENT=1665 DEFAULT CHARSET=utf8
1 row in set (0.01 sec)

mysql>

Once you do that, going forward, any INSERT within the same 10-second interval gets immediately rejected. That way, you do not have to code it. It will be part of the table design.

Naturally, you will need a trigger to populate swipe_int. Here is the trigger:

mysql> DELIMITER $$
mysql> CREATE TRIGGER swipes_bi BEFORE INSERT ON swipes
    -> FOR EACH ROW
    -> BEGIN
    ->     SET NEW.swipe_int = (NEW.swipe_time -
    ->     INTERVAL MOD(UNIX_TIMESTAMP(NEW.swipe_time),10) SECOND);
    -> END $$
Query OK, 0 rows affected (0.11 sec)

mysql> DELIMITER ;
mysql>