Sql-server – How to group records by a two-value column ignoring small gaps

gaps-and-islandssql serversql-server-2008

I have a subquery with a column named code that only has 1 and 0 values and an ID column because I query a lot of codes for each ID. I want to group by said column (and by ID) and by ocurrence but want to ignore groups with less than 3 records, so that bigger groups get merged.

I use ROW_NUMBER() - ROW_NUMBER() over (partition ...) to generate a group number which I use to group records by, but sometimes the records get mixed up caused by a coincidence in the sequence of the row number and the partitioned row number. This is an example

code | row_number | row_number_partition_by_code | group_number
------------------------------------------------------------------
  0  |     1      |             1                |     0
  0  |     2      |             2                |     0
  1  |     3      |             1                |     2
  1  |     4      |             2                |     2
  0  |     5      |             3                |     2
  1  |     6      |             3                |     3
  0  |     7      |             4                |     3
  0  |     8      |             5                |     3
  0  |     9      |             6                |     3
  0  |     10     |             7                |     3
  1  |     11     |             4                |     7
  0  |     12     |             1                |     0

As you can see, some groups contain both codes!

After this, I do

GROUP BY group_number 
HAVING COUNT(1) > 2

to eliminate small groups, and generate another group number to merge adjacent groups.

How can I group these values homogeneously?

More Details

All records are ordered by ID and by date (the later in descending order) in every ROW_NUMBER().

The final result I expect is the time in minutes of the latest code group for each ID, and merging groups that are separated by less than 3 minutes.

I'm more interested in the groups with the value 1, so if the latest code group is a 0 ( considering it is not a couple minutes long ) then ignore that ID completely.

Usually, there is one record per minute. Sometimes it can take hours without a record for any ID

Best Answer

I've set up next example according to your data and your explanations:

All records are ordered by ID and by date (the later in descending order)

CREATE TABLE tbl (id int, code int, dt datetime);

INSERT INTO tbl VALUES
  (1, 0, '20180112 10:10:40'),
  (1, 0, '20180112 10:09:10'),
  (1, 1, '20180112 10:08:15'),
  (1, 1, '20180112 10:06:00'),
  (1, 0, '20180112 10:05:00'),
  (1, 1, '20180112 10:04:55'),
  (1, 0, '20180112 10:03:40'),
  (1, 0, '20180112 10:03:30'),
  (1, 0, '20180112 10:03:20'),
  (1, 0, '20180112 10:02:10'),
  (1, 1, '20180112 10:01:30'),
  (1, 0, '20180112 10:01:15');

IMHO this is more about GROUPING AND WINDOW than GAPS AND ISLANDS. You can set the groups in this way:

SELECT id, code, dt,
       SUM(rst) OVER (PARTITION BY id ORDER BY id, dt DESC) grp
FROM   (SELECT id, code, dt,
               IIF(COALESCE(LAG(code) OVER (PARTITION BY id ORDER BY id, dt DESC), -1) <> code, 1, 0) rst
        FROM   tbl) x;
id | code | dt                  | grp
-: | ---: | :------------------ | --:
 1 |    0 | 12/01/2018 10:10:40 |   1
 1 |    0 | 12/01/2018 10:09:10 |   1
 1 |    1 | 12/01/2018 10:08:15 |   2
 1 |    1 | 12/01/2018 10:06:00 |   2
 1 |    0 | 12/01/2018 10:05:00 |   3
 1 |    1 | 12/01/2018 10:04:55 |   4
 1 |    0 | 12/01/2018 10:03:40 |   5
 1 |    0 | 12/01/2018 10:03:30 |   5
 1 |    0 | 12/01/2018 10:03:20 |   5
 1 |    0 | 12/01/2018 10:02:10 |   5
 1 |    1 | 12/01/2018 10:01:30 |   6
 1 |    0 | 12/01/2018 10:01:15 |   7

Once you have the groups you can calculate time difference of each group:

SELECT id, code, grp, MAX(dt) maxDT, MIN(dt) minDT,
       DATEDIFF(minute, MIN(dt), MAX(dt)) as minutes
FROM (
      SELECT id, code, dt,
             SUM(rst) OVER (PARTITION BY id ORDER BY id, dt DESC) grp
      FROM   (SELECT id, code, dt,
                     IIF(COALESCE(LAG(code) OVER (PARTITION BY id ORDER BY id, dt DESC), -1) <> code, 1, 0) rst
              FROM   tbl) x
      ) y
GROUP BY id, code, grp
ORDER BY id, grp;
id | code | grp | maxDT               | minDT               | minutes
-: | ---: | --: | :------------------ | :------------------ | ------:
 1 |    0 |   1 | 12/01/2018 10:10:40 | 12/01/2018 10:09:10 |       1
 1 |    1 |   2 | 12/01/2018 10:08:15 | 12/01/2018 10:06:00 |       2
 1 |    0 |   3 | 12/01/2018 10:05:00 | 12/01/2018 10:05:00 |       0
 1 |    1 |   4 | 12/01/2018 10:04:55 | 12/01/2018 10:04:55 |       0
 1 |    0 |   5 | 12/01/2018 10:03:40 | 12/01/2018 10:02:10 |       1
 1 |    1 |   6 | 12/01/2018 10:01:30 | 12/01/2018 10:01:30 |       0
 1 |    0 |   7 | 12/01/2018 10:01:15 | 12/01/2018 10:01:15 |       0

Then if I understand your comment correctly:

I'm more interested in the groups with the value 1, so if the latest code group is a 0 ( considering it is not a couple minutes long ) then ignore that ID completely.

Rows can be filtered and get last row using:

SELECT TOP 1 
       id, code, grp, MAX(dt) maxDT, MIN(dt) minDT,
       DATEDIFF(minute, MIN(dt), MAX(dt)) as minutes
FROM (
      SELECT id, code, dt,
             SUM(rst) OVER (PARTITION BY id ORDER BY id, dt DESC) grp
      FROM   (SELECT id, code, dt,
                     IIF(COALESCE(LAG(code) OVER (PARTITION BY id ORDER BY id, dt DESC), -1) <> code, 1, 0) rst
              FROM   tbl) x
      ) y
GROUP BY id, code, grp
HAVING   (code = 0 AND DATEDIFF(minute, MIN(dt), MAX(dt)) >= 2)
         OR code = 1
ORDER BY id, grp DESC;
id | code | grp | maxDT               | minDT               | minutes
-: | ---: | --: | :------------------ | :------------------ | ------:
 1 |    1 |   6 | 12/01/2018 10:01:30 | 12/01/2018 10:01:30 |       0

dbfiddle here