Mysql – Need to delete consecutive, unchanged records

MySQL

I have a rather large MySQL db that is tracking pricing changes for a large number of products. Each entry has a SKU a datetime and a price. Sometimes, there is no change in the price from one time period to the next.

I am trying to figure out a way to more efficiently store this data. I would like to delete any instances where the price didn't change. For example, if a product was $10 on march 10, 13, and 16th, then it went to $12 on the 19th, then we have no need to store those duplicate prices on the 13th and 16th. We can just delete those entries, and when presenting data, we will assume no entry for those dates means no change in price.

I am thinking the best way to do this is to run a script to clean up the table and remove these duplicates every day or so. Would there be any way to create a mySQL query that can do this? Otherwise, I guess I'll just need to write a script that will cycle through the whole table and do it with multiple queries.

edit: more detail

Here is my table:

+—-+—–+—————-+——-+
| id | sku | date | price |
+—-+—–+—————-+——-+

id is the auto-increment primary key.

Best Answer

For this table:

mysql> SELECT * FROM tracking;
+----+-----+---------------------+-------+
| id | sku | date                | price |
+----+---------------------------+-------+
|  1 |   1 | 2014-08-11 20:31:40 | 10.00 |
|  2 |   1 | 2014-08-11 20:32:05 | 11.00 |
|  3 |   2 | 2014-08-11 20:32:18 |  1.00 |
|  4 |   1 | 2014-08-11 20:32:22 | 11.00 |
|  5 |   2 | 2014-08-11 20:32:26 |  1.00 |
|  6 |   2 | 2014-08-11 20:32:32 |  1.00 |
|  7 |   1 | 2014-08-11 20:33:00 | 12.00 |
|  8 |   1 | 2014-08-11 20:33:03 | 11.00 |
|  9 |   1 | 2014-08-11 20:33:06 | 12.00 |
| 10 |   1 | 2014-08-11 20:33:10 | 12.00 |
| 11 |   2 | 2014-08-11 20:33:17 |  2.00 |
| 12 |   2 | 2014-08-11 20:33:17 |  2.00 |
+----+-----+---------------------+-------+
12 rows in set (0.01 sec)

You can run the following query to search for unchanged prices on consecutive records for the same sku:

mysql> SELECT t1.id, t1.sku, t1.date, t1.price
       FROM tracking t1
       WHERE t1.price = 
          (SELECT t2.price 
           FROM tracking t2 
           WHERE t2.sku = t1.sku 
                 and 
                 t1.date >= t2.date 
                 and 
                 t1.id > t2.id 
           ORDER BY t2.date 
           DESC LIMIT 1 
          );
+----+-----+---------------------+-------+
| id | sku | inserted            | price |
+----+-----+---------------------+-------+
|  4 |   1 | 2014-08-11 20:32:22 | 11.00 |
|  5 |   2 | 2014-08-11 20:32:26 |  1.00 |
|  6 |   2 | 2014-08-11 20:32:32 |  1.00 |
| 10 |   1 | 2014-08-11 20:33:10 | 12.00 |
| 12 |   2 | 2014-08-11 20:33:17 |  2.00 |
+----+-----+---------------------+-------+
5 rows in set (0.00 sec)

Those are the records to be deleted. So:

DELETE FROM tracking t1
WHERE ...

If you have duplicated records (same sku, price and date), it deletes the ones with higher id. This assumes that a record with a higher id will always have the same or higher date (it normally happens due to the auto_increment with no updates).

Please note that this is a correlated subquery, with an efficiency of O(n^2) row operations, where n is the number of records. You can solve this in O(n) programmatically if you can afford O(n) memory in a single table scan.

EDIT: Changed query for the structure proposed, handling duplicated values.

Related Solutions

MySQL Insert into two tables using new IDs

After your query runs, you'll have a set of copied rows (with configid=41) and an identical set of pasted rows (except for the configid=76 and the auto-created id).

Since, these ids are not known in advance, you'll need another way to identify rows of the config table, e.g. a unique key (besides the auto-incrementing one), so you can match (join) the newly created rows with the old ones.

If, for example, the (configid, optionname) is unique, then the following would work:

INSERT INTO pricing
    ( relid, price, ... )          --- relid and all the other columns, 
                                   --- except any autoincrement you may have
SELECT pasted.id, p.price, ....    --- and the same columns here
FROM 
    pricing AS p
  JOIN
    tblproductconfigoptionssub AS copied
      ON  copied.id = p.relid
      AND copied.configid = 41 
  JOIN
    tblproductconfigoptionssub AS pasted
      ON  pasted.optioname = copied.optioname
      AND pasted.configid = 76 ;

MySQL – How to Efficiently Swap a Portion of a Large Table’s Data

After a little bit more digging on the MySQL website, I found out about Exchanging Partitions and Subpartitions with Tables, which seems to be a new MySQL 5.6 feature:

ALTER TABLE x EXCHANGE PARTITION y WITH TABLE z;

It seems to be exactly what I was looking for.

After running some benchmarks on my server, I found out that exchanging a partition with a table is not instant, but still very fast: about 1s per 1,700,000 records in the table being integrated in the partitioned table (the number of records in the partition being taken out does not seem to have any influence on the speed), which is very close to the sub-second swapping I was hoping to get for a table with a few million records.

This is to be compared against inserting the rows directly in the partitioned table, where I get about 27,000 inserts / second.

Hence swapping an already filled up table with a partition is roughly 60x faster (for me, YMMV) than inserting directly into the partitioned table.

Sweet!

Best Answer

Related Solutions

MySQL Insert into two tables using new IDs

MySQL – How to Efficiently Swap a Portion of a Large Table’s Data

Related Question