Mysql – Select consecutive rows that are in a date range of each other

MySQL

Having a list of associated dates & names and I want to select: all rows with names having multiple dates when the difference between their dates is more than 1 month.

e.g.: only the entries indicated below marked with \*this*\

CREATE TABLE IF NOT EXISTS myTab (
    id          SERIAL PRIMARY KEY,     
    dateID      DATETIME DEFAULT 0,     
    name        VARCHAR(512)
) ENGINE=INNODB DEFAULT CHARSET=utf8;

INSERT INTO myTab 
    (dateID, name) 
VALUES
    ("20140811","Emmy"),    /*this*/
    ("20140922","Emmy"),    /*this*/
    ("20150920","Emmy"),    /*this*/
    ("20150922","Emmy"),
    ("20140722","Dave"),
    ("20140613","Stan"),
    ("20140622","Stan"),    /*this*/
    ("20151020","Stan"),    /*this*/
    ("20140305","Lora"),
    ("20140310","Lora");

In other words the criteria is:

  • Partition by name
  • Order by date
  • Compare 2 consecutive rows: IF diff > 1 MONTH THEN select both, ELSE skip

Here's a working example as well as my attempt based on another answer on SO:

Rextester working example and attempt

Additional conditions/hints/…

  • Rows having the same name are not necessarily inserted once after the other making them spaced with +1 id from eachothers. Nor they are inserted oredered by date. In the example above it's done so just for readability. In my real problem it's not the case!
  • After applying your suggestions on real data I noticed an extra condition to be added tagged /*this_EXRA*/ in the example above. The 3rd stan row is in less than 1 MONTH from the 2nd but validates it with the 4th. Thus, it should only be selected if it validates them both. So I guess this implicates looping row by row and compare with previous and next one each time.

Best Answer

The logic looks simple at first, but it's quite complicated to get it right.

Let's have a working solution first, and worry about performance later. Tested at rextester.com:

SELECT t.id, t.dateID, t.name 
FROM  myTab AS t
WHERE 
       ( SELECT b.dateID
         FROM myTab AS b
         WHERE t.name = b.name
           AND b.dateID < t.dateID
         ORDER BY b.dateID DESC
         LIMIT 1
       ) + INTERVAL 1 MONTH  <=  t.dateID    
    OR 
       t.dateID + INTERVAL 1 MONTH <= 
       ( SELECT b.dateID
         FROM myTab AS b
         WHERE t.name = b.name
           AND t.dateID < b.dateID
         ORDER BY b.dateID ASC
         LIMIT 1
       )
 ;

Regarding efficiency: the query will perform rather poorly. An index on (name, dateID, id) will help but the query will still need to do 2 subqueries for each row of the table.