MySQL – Finding Gaps of Date Ranges – Overlapping Dates

gaps-and-islandsMySQL

I have been struggling for a long time with generating gaps in date ranges, according to a set.

CREATE TABLE `terms` (
  `id` int NOT NULL PRIMARY KEY,
  `term_begin` date NOT NULL,
  `term_end` date DEFAULT NULL,
  `term_served` int DEFAULT NULL,
  `office_type_id` int DEFAULT NULL,
  `person_id` int DEFAULT NULL,
  `state_id` int DEFAULT NULL,
  `district_id` int DEFAULT NULL,
  `removal_reason_id` int DEFAULT NULL,
  `political_party_id` int DEFAULT NULL,
  `is_elected` tinyint(1) DEFAULT NULL,
  `is_holdover` tinyint(1) DEFAULT NULL,
  `neat_race_id` varchar(64) DEFAULT NULL,
  `office_class` int NOT NULL,
  `notes` varchar(255) DEFAULT NULL,
  `is_vacant` tinyint(1) DEFAULT NULL
);

INSERT INTO `terms` (`id`, `term_begin`, `term_end`, `term_served`, `office_type_id`, `person_id`, `state_id`, `district_id`, `removal_reason_id`, `political_party_id`, `is_elected`, `is_holdover`, `neat_race_id`, `office_class`, `notes`, `is_vacant`) VALUES
(2, '2005-01-03', '2011-01-04', NULL, 1, 3, 2, NULL, NULL, 10001, 1, 0, '2004-11-02~WA~S', 3, 'S', NULL),
(44, '2007-01-04', '2013-01-02', NULL, 1, 96, 2, NULL, NULL, 10001, 1, 0, '2006-11-07~WA~S', 1, 'S', NULL),
(113, '2011-01-05', NULL, NULL, 1, 3, 2, NULL, NULL, 10001, 1, 0, '2010-11-02~WA~S', 3, 'S', NULL),
(147, '2013-01-03', NULL, NULL, 1, 96, 2, NULL, NULL, 10001, 1, 0, '2012-11-06~WA~S', 1, 'S', NULL),
(250, '2001-01-03', '2007-01-03', NULL, 1, 96, 2, NULL, NULL, 10001, 1, NULL, '2000-11-07~WA~S', 1, NULL, NULL);

INSERT INTO `terms` (`id`, `term_begin`, `term_end`, `term_served`, `office_type_id`, `person_id`, `state_id`, `district_id`, `removal_reason_id`, `political_party_id`, `is_elected`, `is_holdover`, `neat_race_id`, `office_class`, `notes`, `is_vacant`) VALUES
(26, '2005-01-03', '2011-01-04', NULL, 1, 57, 26, NULL, NULL, 10000, 1, 0, '2004-11-02~MO~S', 3, 'S', NULL),
(129, '2011-01-05', NULL, NULL, 1, 262, 26, NULL, NULL, 10000, 1, 0, '2010-11-02~MO~S', 3, 'S', NULL),
(61, '2007-01-04', '2013-01-02', NULL, 1, 133, 26, NULL, NULL, 10001, 1, 0, '2006-11-07~MO~S', 1, 'S', NULL),
(154, '2013-01-03', NULL, NULL, 1, 133, 26, NULL, NULL, 10001, 1, 0, '2012-11-06~MO~S', 1, 'S', NULL),
(218, '2003-01-03', '2009-01-05', NULL, 1, 134, 26, NULL, NULL, 10000, 1, NULL, '2002-11-05~MO~S', 1, NULL, NULL);

INSERT INTO `terms` (`id`, `term_begin`, `term_end`, `term_served`, `office_type_id`, `person_id`, `state_id`, `district_id`, `removal_reason_id`, `political_party_id`, `is_elected`, `is_holdover`, `neat_race_id`, `office_class`, `notes`, `is_vacant`) VALUES
(24, '2005-01-03', '2011-01-04', NULL, 1, 53, 24, NULL, NULL, 10000, 1, 0, '2004-11-02~GA~S', 3, 'S', NULL),
(124, '2011-01-05', NULL, NULL, 1, 53, 24, NULL, NULL, 10000, 1, 0, '2010-11-02~GA~S', 3, 'S', NULL),
(183, '2015-01-06', NULL, NULL, 1, 345, 24, NULL, NULL, 10000, 1, 0, '2014-11-04~GA~S', 2, 'S', NULL),
(208, '2003-01-03', '2015-01-05', NULL, 1, 193, 24, NULL, NULL, 10000, 1, NULL, '2002-11-05~GA~S', 2, NULL, NULL);

THE GOAL

For every grouping of (office_type_id, state_id, district_id, office_class)

find the term_end of the term, preceding another term (with same above group), as well as the term_begin of the succeeding term.

This will = the date range for which no records are logged

Every data chunk, has the same state as the other rows in the chunk, differentiated by office class

first chunk, has no gaps, and no overlaps (dates are for office class 1)

2001-01-03 -> 2007-01-03
2007-01-04 -> 2013-01-02
2013-01-03 -> NULL

Second Chunk, has no gaps, but has overlaps (dates are for office class 1)

2003-01-03 -> 2009-01-05
2007-01-04 -> 2013-01-02 (overlaps with above)
2013-01-03 -> NULL

Third Chunk, has gap (office class 2)

2003-01-03 -> 2009-01-05
2015-01-06 -> Null)

gap = 2009-01-05 -> 2015-01-06

This database grows forward. District_id can be null, but the otehr values can not be (state_id, office_type_id, office_class)

My attempts have largely been failing. A fiddle is created here

http://sqlfiddle.com/#!9/d3d19/22

I am first trying to merge the overlaps, and (add 1 day to term_end to make all regular 1 day gaps ignored).

This does not work. You see for state_id = 2 and office_class – 1, there are two rows.

I would imagine this should merge as 2001-01-03 -> Null

The merge is just the first step of the process I am trying to accomplish.

I am trying to find date ranges for which the office_type,
office_class, state, district
does not have term data

I am at a complete loss for this. I have been trying for a week now. I have a variety of other versions of queries to post as well if requested.

EXPECTED OUTPUT

[office_type_id ] [ state_id ] [district_id] [office_class] [FROM] [TO]

1                    24            NULL      2     '2009-01-05'  '2015-01-06'

The above means, that no a term exists from … -> '2009-01-05'. and from '2015-01-06' -> …. but no term exists between the two.

Best Answer

If overlaps are only partial, (i.e., a range may partially overlap another, but no range is a subset of another range), I think the following query will do what you want:

SELECT t1.office_type_id, t1.state_id, t1.district_id, t1.office_class,
        t1.term_end, MIN(t2.term_begin) next_begin
FROM terms t1 JOIN terms t2  
  ON t1.office_type_id=t2.office_type_id AND 
     t1.state_id=t2.state_id AND 
     (t1.district_id=t2.district_id OR 
       (t1.district_id IS NULL AND t2.district_id IS NULL)) AND 
     t1.office_class=t2.office_class 
WHERE t1.term_begin < t2.term_begin 
GROUP BY t1.office_type_id, t1.state_id, t1.district_id, t1.office_class, 
         t1.term_end 
HAVING t1.term_end < next_begin - INTERVAL 1 DAY;

If ranges may fully overlap, I suggest to create a view that removes such subranges:

CREATE VIEW terms1 AS
SELECT DISTINCT * FROM terms t3 
WHERE (office_type_id, state_id, office_class) NOT IN
     (SELECT office_type_id, state_id, office_class FROM terms t4 
      WHERE ((t4.term_begin < t3.term_begin AND t3.term_end <=t4.term_end) OR 
             (t4.term_begin = t3.term_begin AND t3.term_end < t4.term_end)) 
        AND (t3.district_id = t4.district_id OR 
             (t3.district_id IS NULL and t4.district_id IS NULL)));

Then you can use this view instead of the table in the query above.

(And life would have been a bit simpler if you had picked some other value than NULL for the default district_id).