Sql-server – How to group records by a two-value column ignoring small gaps

gaps-and-islandssql serversql-server-2008

I have a subquery with a column named code that only has 1 and 0 values and an ID column because I query a lot of codes for each ID. I want to group by said column (and by ID) and by ocurrence but want to ignore groups with less than 3 records, so that bigger groups get merged.

I use ROW_NUMBER() - ROW_NUMBER() over (partition ...) to generate a group number which I use to group records by, but sometimes the records get mixed up caused by a coincidence in the sequence of the row number and the partitioned row number. This is an example

code | row_number | row_number_partition_by_code | group_number
------------------------------------------------------------------
  0  |     1      |             1                |     0
  0  |     2      |             2                |     0
  1  |     3      |             1                |     2
  1  |     4      |             2                |     2
  0  |     5      |             3                |     2
  1  |     6      |             3                |     3
  0  |     7      |             4                |     3
  0  |     8      |             5                |     3
  0  |     9      |             6                |     3
  0  |     10     |             7                |     3
  1  |     11     |             4                |     7
  0  |     12     |             1                |     0

As you can see, some groups contain both codes!

After this, I do

GROUP BY group_number 
HAVING COUNT(1) > 2

to eliminate small groups, and generate another group number to merge adjacent groups.

How can I group these values homogeneously?

More Details

All records are ordered by ID and by date (the later in descending order) in every ROW_NUMBER().

The final result I expect is the time in minutes of the latest code group for each ID, and merging groups that are separated by less than 3 minutes.

I'm more interested in the groups with the value 1, so if the latest code group is a 0 ( considering it is not a couple minutes long ) then ignore that ID completely.

Usually, there is one record per minute. Sometimes it can take hours without a record for any ID

Best Answer

I've set up next example according to your data and your explanations:

All records are ordered by ID and by date (the later in descending order)

CREATE TABLE tbl (id int, code int, dt datetime);

INSERT INTO tbl VALUES
  (1, 0, '20180112 10:10:40'),
  (1, 0, '20180112 10:09:10'),
  (1, 1, '20180112 10:08:15'),
  (1, 1, '20180112 10:06:00'),
  (1, 0, '20180112 10:05:00'),
  (1, 1, '20180112 10:04:55'),
  (1, 0, '20180112 10:03:40'),
  (1, 0, '20180112 10:03:30'),
  (1, 0, '20180112 10:03:20'),
  (1, 0, '20180112 10:02:10'),
  (1, 1, '20180112 10:01:30'),
  (1, 0, '20180112 10:01:15');

IMHO this is more about GROUPING AND WINDOW than GAPS AND ISLANDS. You can set the groups in this way:

SELECT id, code, dt,
       SUM(rst) OVER (PARTITION BY id ORDER BY id, dt DESC) grp
FROM   (SELECT id, code, dt,
               IIF(COALESCE(LAG(code) OVER (PARTITION BY id ORDER BY id, dt DESC), -1) <> code, 1, 0) rst
        FROM   tbl) x;

id | code | dt                  | grp
-: | ---: | :------------------ | --:
 1 |    0 | 12/01/2018 10:10:40 |   1
 1 |    0 | 12/01/2018 10:09:10 |   1
 1 |    1 | 12/01/2018 10:08:15 |   2
 1 |    1 | 12/01/2018 10:06:00 |   2
 1 |    0 | 12/01/2018 10:05:00 |   3
 1 |    1 | 12/01/2018 10:04:55 |   4
 1 |    0 | 12/01/2018 10:03:40 |   5
 1 |    0 | 12/01/2018 10:03:30 |   5
 1 |    0 | 12/01/2018 10:03:20 |   5
 1 |    0 | 12/01/2018 10:02:10 |   5
 1 |    1 | 12/01/2018 10:01:30 |   6
 1 |    0 | 12/01/2018 10:01:15 |   7

Once you have the groups you can calculate time difference of each group:

SELECT id, code, grp, MAX(dt) maxDT, MIN(dt) minDT,
       DATEDIFF(minute, MIN(dt), MAX(dt)) as minutes
FROM (
      SELECT id, code, dt,
             SUM(rst) OVER (PARTITION BY id ORDER BY id, dt DESC) grp
      FROM   (SELECT id, code, dt,
                     IIF(COALESCE(LAG(code) OVER (PARTITION BY id ORDER BY id, dt DESC), -1) <> code, 1, 0) rst
              FROM   tbl) x
      ) y
GROUP BY id, code, grp
ORDER BY id, grp;

id | code | grp | maxDT               | minDT               | minutes
-: | ---: | --: | :------------------ | :------------------ | ------:
 1 |    0 |   1 | 12/01/2018 10:10:40 | 12/01/2018 10:09:10 |       1
 1 |    1 |   2 | 12/01/2018 10:08:15 | 12/01/2018 10:06:00 |       2
 1 |    0 |   3 | 12/01/2018 10:05:00 | 12/01/2018 10:05:00 |       0
 1 |    1 |   4 | 12/01/2018 10:04:55 | 12/01/2018 10:04:55 |       0
 1 |    0 |   5 | 12/01/2018 10:03:40 | 12/01/2018 10:02:10 |       1
 1 |    1 |   6 | 12/01/2018 10:01:30 | 12/01/2018 10:01:30 |       0
 1 |    0 |   7 | 12/01/2018 10:01:15 | 12/01/2018 10:01:15 |       0

Then if I understand your comment correctly:

I'm more interested in the groups with the value 1, so if the latest code group is a 0 ( considering it is not a couple minutes long ) then ignore that ID completely.

Rows can be filtered and get last row using:

SELECT TOP 1 
       id, code, grp, MAX(dt) maxDT, MIN(dt) minDT,
       DATEDIFF(minute, MIN(dt), MAX(dt)) as minutes
FROM (
      SELECT id, code, dt,
             SUM(rst) OVER (PARTITION BY id ORDER BY id, dt DESC) grp
      FROM   (SELECT id, code, dt,
                     IIF(COALESCE(LAG(code) OVER (PARTITION BY id ORDER BY id, dt DESC), -1) <> code, 1, 0) rst
              FROM   tbl) x
      ) y
GROUP BY id, code, grp
HAVING   (code = 0 AND DATEDIFF(minute, MIN(dt), MAX(dt)) >= 2)
         OR code = 1
ORDER BY id, grp DESC;

id | code | grp | maxDT               | minDT               | minutes
-: | ---: | --: | :------------------ | :------------------ | ------:
 1 |    1 |   6 | 12/01/2018 10:01:30 | 12/01/2018 10:01:30 |       0

dbfiddle here

Related Solutions

Sql-server – How to write windowing query which sums a column to create discrete buckets

I am not sure what type of performance you are looking for, but if CLR or external app is not an option, a cursor is all that is left. On my aged laptop I get through 1,000,000 rows in about 100 seconds using the following solution. The nice thing about it is that it scales linearly, so I would be looking at a little about 20 minutes to run through the entire thing. With a decent server you will be faster, but not an order of magnitude, so it would still take several minutes to complete this. If this is a one off process, you probably can afford the slowness. If you need to run this as a report or similar regularly, you might want to store the values in the same table un update them as new rows get added, e.g. in a trigger.

Anyway, here is the code:

IF OBJECT_ID('dbo.MyTable') IS NOT NULL DROP TABLE dbo.MyTable;

CREATE TABLE dbo.MyTable(
 Id INT IDENTITY(1,1) PRIMARY KEY CLUSTERED,
 v NUMERIC(5,3) DEFAULT ABS(CHECKSUM(NEWID())%100)/100.0
);


MERGE dbo.MyTable T
USING (SELECT TOP(1000000) 1 X FROM sys.system_internals_partition_columns A,sys.system_internals_partition_columns B,sys.system_internals_partition_columns C,sys.system_internals_partition_columns D)X
ON(1=0)
WHEN NOT MATCHED THEN
INSERT DEFAULT VALUES;

--SELECT * FROM dbo.MyTable

DECLARE @st DATETIME2 = SYSUTCDATETIME();
DECLARE cur CURSOR FAST_FORWARD FOR
  SELECT Id,v FROM dbo.MyTable
  ORDER BY Id;

DECLARE @id INT;
DECLARE @v NUMERIC(5,3);
DECLARE @running_total NUMERIC(6,3) = 0;
DECLARE @bucket INT = 1;

CREATE TABLE #t(
 id INT PRIMARY KEY CLUSTERED,
 v NUMERIC(5,3),
 bucket INT,
 running_total NUMERIC(6,3)
);

OPEN cur;
WHILE(1=1)
BEGIN
  FETCH NEXT FROM cur INTO @id,@v;
  IF(@@FETCH_STATUS <> 0) BREAK;
  IF(@running_total + @v > 1)
  BEGIN
    SET @running_total = 0;
    SET @bucket += 1;
  END;
  SET @running_total += @v;
  INSERT INTO #t(id,v,bucket,running_total)
  VALUES(@id,@v,@bucket, @running_total);
END;
CLOSE cur;
DEALLOCATE cur;
SELECT DATEDIFF(SECOND,@st,SYSUTCDATETIME());
SELECT * FROM #t;

GO 
DROP TABLE #t;

It drops and recreates the table MyTable, fills it with 1000000 rows and then goes to work.

The cursor copies each row into a temp table while running the calculations. At the end the select returns the calculated results. You might be a little faster if you don't copy the data around but do an in-place update instead.

If you have an option to upgrade to SQL 2012 you can look at the new window-spool supported moving window aggregates, that should give you better performance.

On a side note, if you have an assembly installed with permission_set=safe, you can do more bad stuff to a server with standard T-SQL than with the assembly, so I would keep working on removing that barrier - You have a good use case here where CLR really would help you.

Sql-server – How to merge similar records with different validity dates

If this is a table of back-to-back ranges only, your case can be treated as a classic "gaps and islands" problem, where you just need to isolate islands of consecutive ranges and then "condense" them by taking the minimum [from] and the maximum [to] per island.

There is an established method of solving this using two ROW_NUMBER calls:

WITH islands AS
(
  SELECT
    id,
    data,
    [from],
    [to],
    island = ROW_NUMBER() OVER (PARTITION BY id       ORDER BY [from])
           - ROW_NUMBER() OVER (PARTITION BY id, data ORDER BY [from])
  FROM
    #mergeTest
)
SELECT
  id,
  data,
  [from] = MIN([from]),
  [to]   = MAX([to])
FROM
  islands
GROUP BY
  id,
  data,
  island
;

This query will work in as low version as SQL Server 2005.

Best Answer

Related Solutions

Sql-server – How to write windowing query which sums a column to create discrete buckets

Sql-server – How to merge similar records with different validity dates

Related Question