Mysql – Select consecutive rows that are in a date range of each other

MySQL

Having a list of associated dates & names and I want to select: all rows with names having multiple dates when the difference between their dates is more than 1 month.

e.g.: only the entries indicated below marked with \*this*\

CREATE TABLE IF NOT EXISTS myTab (
    id          SERIAL PRIMARY KEY,     
    dateID      DATETIME DEFAULT 0,     
    name        VARCHAR(512)
) ENGINE=INNODB DEFAULT CHARSET=utf8;

INSERT INTO myTab 
    (dateID, name) 
VALUES
    ("20140811","Emmy"),    /*this*/
    ("20140922","Emmy"),    /*this*/
    ("20150920","Emmy"),    /*this*/
    ("20150922","Emmy"),
    ("20140722","Dave"),
    ("20140613","Stan"),
    ("20140622","Stan"),    /*this*/
    ("20151020","Stan"),    /*this*/
    ("20140305","Lora"),
    ("20140310","Lora");

In other words the criteria is:

Partition by name
Order by date
Compare 2 consecutive rows: IF diff > 1 MONTH THEN select both, ELSE skip

Here's a working example as well as my attempt based on another answer on SO:

Rextester working example and attempt

Additional conditions/hints/…

Rows having the same name are not necessarily inserted once after the other making them spaced with +1 id from eachothers. Nor they are inserted oredered by date. In the example above it's done so just for readability. In my real problem it's not the case!
After applying your suggestions on real data I noticed an extra condition to be added tagged /*this_EXRA*/ in the example above. The 3rd stan row is in less than 1 MONTH from the 2nd but validates it with the 4th. Thus, it should only be selected if it validates them both. So I guess this implicates looping row by row and compare with previous and next one each time.

Best Answer

The logic looks simple at first, but it's quite complicated to get it right.

Let's have a working solution first, and worry about performance later. Tested at rextester.com:

SELECT t.id, t.dateID, t.name 
FROM  myTab AS t
WHERE 
       ( SELECT b.dateID
         FROM myTab AS b
         WHERE t.name = b.name
           AND b.dateID < t.dateID
         ORDER BY b.dateID DESC
         LIMIT 1
       ) + INTERVAL 1 MONTH  <=  t.dateID    
    OR 
       t.dateID + INTERVAL 1 MONTH <= 
       ( SELECT b.dateID
         FROM myTab AS b
         WHERE t.name = b.name
           AND t.dateID < b.dateID
         ORDER BY b.dateID ASC
         LIMIT 1
       )
 ;

Regarding efficiency: the query will perform rather poorly. An index on (name, dateID, id) will help but the query will still need to do 2 subqueries for each row of the table.

ALTERNATE SUGGESTION

Try out your temp table solution using another method

STEP 01) CREATE TABLE foobar_new LIKE foobar;

STEP 02) Do your bulk INSERTs into foobar_new

STEP 03) CREATE TABLE foo_amount_new LIKE foo_amount;

STEP 04) Perform GROUP BY count on the latest bulk INSERT batch

INSERT INTO foo_amount_new
SELECT foo_id,COUNT(1) FROM foobar_new WHERE bar_id = ... 
GROUP BY foo_id;

STEP 05) Perform a bulk INSERT into foobar from foobar_new

INSERT INTO foobar SELECT * FROM foobar_new;

STEP 06) Perform a bulk UPDATE of foo_amount from foo_amount_new

UPDATE foo_amount A INNER JOIN foo_amount_new B
USING (foo_id) SET A.amount = A.amount + B.amount;

STEP 07) Drop the temp tables

DROP TABLE foobar_new;
DROP TABLE foo_amount_new;

Mysql – How to ensure that date-range queries involving multiple time zones are sargable

Based on the query you provided, I would say:

For date_add field, I would definitely recommend that you separate the date part and time part, as this will allow you to group by a field instead of a function.
Assuming that you will always be passing a id_website, I would recommend you create a composite index covering, and in this order: id_website, date_add_date, date_add_time.
After that, do not perform a DATE_FORMAT on GROUP BY but simply pass date_add_date

Also, might be worth considering partitioning your main tables such as visit, either by date_add_date or by id_website, depending on your need. Might be worth checking out pitfall of table partitioning as well:

http://www.mysqlperformanceblog.com/2010/12/11/mysql-partitioning-can-save-you-or-kill-you/

Additional conditions/hints/…

Best Answer

Related Solutions

Mysql – fast bulk incrementing in MySQL

ALTERNATE SUGGESTION

Mysql – How to ensure that date-range queries involving multiple time zones are sargable

Related Question