Mysql – Need to delete consecutive, unchanged records

MySQL

I have a rather large MySQL db that is tracking pricing changes for a large number of products. Each entry has a SKU a datetime and a price. Sometimes, there is no change in the price from one time period to the next.

I am trying to figure out a way to more efficiently store this data. I would like to delete any instances where the price didn't change. For example, if a product was $10 on march 10, 13, and 16th, then it went to $12 on the 19th, then we have no need to store those duplicate prices on the 13th and 16th. We can just delete those entries, and when presenting data, we will assume no entry for those dates means no change in price.

I am thinking the best way to do this is to run a script to clean up the table and remove these duplicates every day or so. Would there be any way to create a mySQL query that can do this? Otherwise, I guess I'll just need to write a script that will cycle through the whole table and do it with multiple queries.

edit: more detail

Here is my table:

+—-+—–+—————-+——-+
|   id | sku | date             | price |
+—-+—–+—————-+——-+

id is the auto-increment primary key.

Best Answer

For this table:

mysql> SELECT * FROM tracking;
+----+-----+---------------------+-------+
| id | sku | date                | price |
+----+---------------------------+-------+
|  1 |   1 | 2014-08-11 20:31:40 | 10.00 |
|  2 |   1 | 2014-08-11 20:32:05 | 11.00 |
|  3 |   2 | 2014-08-11 20:32:18 |  1.00 |
|  4 |   1 | 2014-08-11 20:32:22 | 11.00 |
|  5 |   2 | 2014-08-11 20:32:26 |  1.00 |
|  6 |   2 | 2014-08-11 20:32:32 |  1.00 |
|  7 |   1 | 2014-08-11 20:33:00 | 12.00 |
|  8 |   1 | 2014-08-11 20:33:03 | 11.00 |
|  9 |   1 | 2014-08-11 20:33:06 | 12.00 |
| 10 |   1 | 2014-08-11 20:33:10 | 12.00 |
| 11 |   2 | 2014-08-11 20:33:17 |  2.00 |
| 12 |   2 | 2014-08-11 20:33:17 |  2.00 |
+----+-----+---------------------+-------+
12 rows in set (0.01 sec)

You can run the following query to search for unchanged prices on consecutive records for the same sku:

mysql> SELECT t1.id, t1.sku, t1.date, t1.price
       FROM tracking t1
       WHERE t1.price = 
          (SELECT t2.price 
           FROM tracking t2 
           WHERE t2.sku = t1.sku 
                 and 
                 t1.date >= t2.date 
                 and 
                 t1.id > t2.id 
           ORDER BY t2.date 
           DESC LIMIT 1 
          );
+----+-----+---------------------+-------+
| id | sku | inserted            | price |
+----+-----+---------------------+-------+
|  4 |   1 | 2014-08-11 20:32:22 | 11.00 |
|  5 |   2 | 2014-08-11 20:32:26 |  1.00 |
|  6 |   2 | 2014-08-11 20:32:32 |  1.00 |
| 10 |   1 | 2014-08-11 20:33:10 | 12.00 |
| 12 |   2 | 2014-08-11 20:33:17 |  2.00 |
+----+-----+---------------------+-------+
5 rows in set (0.00 sec)

Those are the records to be deleted. So:

DELETE FROM tracking t1
WHERE ...

If you have duplicated records (same sku, price and date), it deletes the ones with higher id. This assumes that a record with a higher id will always have the same or higher date (it normally happens due to the auto_increment with no updates).

Please note that this is a correlated subquery, with an efficiency of O(n^2) row operations, where n is the number of records. You can solve this in O(n) programmatically if you can afford O(n) memory in a single table scan.

EDIT: Changed query for the structure proposed, handling duplicated values.