Given the following data:
id | user_id | started | closed | dead
-------------------------------------------------------------------------------------------
7714 | 238846 | 2015-01-27 15:14:50 | 2015-02-02 14:14:13 | NULL
7882 | 238846 | 2015-01-28 13:25:58 | NULL | 2015-05-15 12:16:07
13190 | 259140 | 2015-03-17 10:11:44 | NULL | 2015-03-18 07:31:57
13192 | 259140 | 2015-03-17 10:12:17 | NULL | 2015-03-18 11:46:46
13194 | 259140 | 2015-03-17 10:12:53 | NULL | 2015-03-18 11:46:36
14020 | 259140 | 2015-03-23 14:32:16 | 2015-03-24 15:57:32 | NULL
17124 | 242650 | 2015-04-16 16:19:08 | 2015-04-16 16:21:06 | NULL
19690 | 238846 | 2015-05-15 13:17:31 | NULL | 2015-05-27 13:56:43
20038 | 242650 | 2015-05-19 15:38:17 | NULL | NULL
20040 | 242650 | 2015-05-19 15:39:58 | NULL | 2015-05-21 12:01:02
20302 | 242650 | 2015-05-21 13:09:06 | NULL | NULL
20304 | 242650 | 2015-05-21 13:09:54 | NULL | NULL
20306 | 242650 | 2015-05-21 13:10:19 | NULL | NULL
20308 | 242650 | 2015-05-21 13:12:20 | NULL | NULL
21202 | 238846 | 2015-05-29 16:47:29 | NULL | NULL
21204 | 238846 | 2015-05-29 16:47:56 | NULL | NULL
21208 | 238846 | 2015-05-29 17:05:15 | NULL | NULL
21210 | 238846 | 2015-05-29 17:05:55 | NULL | NULL
21918 | 242650 | 2015-06-04 17:04:29 | NULL | 2015-06-12 15:47:23
I need to build a data set that meets the following rules:
- Groups are defined first by
user_id
so we should be comparing only records from the sameuser_id
- All records that started at least within 15 days of when any other record was started, closed or dead should be counted as group.
- Of each group, the end should be calculated as either the first record closed or all records have a value for dead and we take the greatest date of the dead column.
- If a record did not start within 15 days of the start or end of another group, then it begins a new grouping.
Tentatively, I believe my data should look like this:
user_id | started | end ------------------------------------------------------ 238846 | 2015-01-27 15:14:50 | 2015-02-02 14:14:13 259140 | 2015-03-23 14:32:16 | 2015-03-24 15:57:32 242650 | 2015-04-16 16:19:08 | 2015-04-16 16:21:06 242650 | 2015-05-21 13:09:06 | NULL 238846 | 2015-05-15 13:17:31 | NULL
Can anyone provide some guidance on how to build a query to meet these conditions?
Here is a link to the DDL and DML statements for the data presented in this question.
Alternatively, we could skip rules #2 and #4 and more simply state that only records that overlap each other should be included. The more important rule is that in a given set, if there is a closed date then that becomes the end of the set and not the greatest dead date.
Best Answer
Due to the lack of clarity in the question, I came up with four different solutions. The solutions differ on:
Please note this is done in SQL Server, not MySQL. Other than some very minor syntax changes, it should work the same.
Common setup and sample data for all four methods
1. CASCADING - USING CLOSED RECORD solution
This is the solution I believe the asker is looking for & matches his results.
2. NON-CASCADING - USING CLOSED RECORD solution
Start calculated by first closed date when available, then by earliest start date.
3. NON-CASCADING - USING EARLIEST DATE solution
Start calculated by earliest date only.
4. CASCADING - USING EARLIEST DATE solution
Start calculated by earliest date only.