Sql-server – Gaps and Islands – Finding the Nearest Island

gaps-and-islandssql serversql-server-2012sql-server-2016

I am working with the following scenario where I have temporal data that falls into islands and gaps. Every once in a while I need to associate an event that falls within an existing gap to its nearest island based on the time of the event.

To demonstrate, let's say I've got the following data defining my time periods:

This data is contiguous except for a gap which exists between IDs 2 and 7, for the time period 2017-07-26 00:03:00 through 2017-07-26 00:07:00.

In order to identify the nearest island, I am currently splitting the gap into two periods as follows:

If I have an event that falls within this gap, the GapWindowStart/End times will determine which island I need to associate the event with. So, for instance, if I have an event that occurs at 2017-07-26 00:03:20, I would associate that event with ID 2 and conversely if I had an event occur at 2017-07-26 00:05:35 I would associated that event with ID 7.

The most efficient way I've been able to code my approach, thus far, is to assemble the gaps using Itzik Ben-Gan's 3rd solution from SQL Server MVP Deep Dives book via the ROW_NUMBER window function and then split the gaps per a CROSS APPLY statement which acts like a simple UNPIVOT operation.

Here is the db<>fiddle plan of the approach I'm using to assemble the nearest island set.

With the nearest islands identified, I use an event's event time to identify the nearest island to associate said event with. Because these islands are volatile throughout the day, I cannot make a static master table, but instead have to rely on constructing everything at run-time when the events are encountered.

Here is a db<>fiddle plan showing what NearestIsland value should be used against a random event time.

Are there any better ways to figure out the nearest island for a given event that would normally fall into a gap? For instance, is there a more efficient method to identify the gaps or a more efficient way to identify the nearest island? Am I even going about this in the best logical way? There's nothing critical about this question, but I'm always trying to figure out if there's a "better" approach to things and I think this problem lends itself to some creativity so I'd love to see other performant options.

The current environment I'm working on is SQL 2012, but we'll be migrating to a SQL 2016 environment shortly, so I'm open to pretty much anything.

Code underlying the second db<>fiddle link is as follows:

-- Creation of Test Data
CREATE TABLE #tmp
(
      ID            INT PRIMARY KEY CLUSTERED
    , WindowStart   DATETIME2
    , WindowEnd     DATETIME2
)

-- Create contiguous data set
INSERT INTO #tmp
SELECT    ID
        , DATEADD(HOUR, ID, CAST('0001-01-01' AS DATETIME2))
        , DATEADD(HOUR, ID + 1, CAST('0001-01-01' AS DATETIME2))
FROM
(
    SELECT TOP (1500000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ID
    --SELECT TOP (87591200) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ID  -- Swap line with above for larger dataset
    FROM master.sys.configurations t1
    CROSS JOIN master.sys.configurations t2
    CROSS JOIN master.sys.configurations t3
    CROSS JOIN master.sys.configurations t4
    CROSS JOIN master.sys.configurations t5
) x


--DELETE 1000000 random records to create random gaps
DELETE FROM #tmp
WHERE ID IN (
    SELECT TOP 1000000 ID
    --SELECT TOP 77591200 ID -- Swap line with above for larger dataset
    FROM #tmp
    ORDER BY NEWID()
)


-- Create RandomEvent Times
CREATE TABLE #tmpEvent
(
    EventTime DATETIME2
)

INSERT INTO #tmpEvent
SELECT DATEADD(SECOND, X.RandomNum, Y.minWindowEnd) AS EventDate
FROM (VALUES (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))) AS X(RandomNum)
    CROSS JOIN (SELECT MIN(WindowEnd) AS minWindowEnd FROM #tmp) AS Y


SET STATISTICS XML ON
SET STATISTICS IO ON

--Desired Output Format - Best Execution I've found so far
;WITH rankIslands AS (
    SELECT    ID
            , WindowStart
            , WindowEnd
            , ROW_NUMBER() OVER (ORDER BY WindowStart) AS rnk
    FROM    #tmp
), rankGapsJoined AS (
    SELECT    t1.ID AS NearestIslandID_Lower
            , t1.WindowEnd AS GapStart_Lower
            , DATEADD(MINUTE, (DATEDIFF(MINUTE, t1.WindowEnd, t2.WindowStart) / 2), t1.WindowEnd) AS GapEnd_Lower
            , t2.ID AS NearestIslandID_Higher
            , DATEADD(MINUTE, -1 * (DATEDIFF(MINUTE, t1.WindowEnd, t2.WindowStart) / 2), t2.WindowStart) AS GapStart_Higher
            , t2.WindowStart AS GapEnd_Higher
    FROM rankIslands t1 INNER JOIN rankIslands t2
        ON t1.rnk + 1 = t2.rnk
            AND t1.WindowEnd <> t2.WindowStart
), NearestIsland AS (
    SELECT  xa.*
    FROM    rankGapsJoined t1
            CROSS APPLY ( VALUES (t1.NearestIslandID_Lower, t1.GapStart_Lower, t1.GapEnd_Lower)
                                ,(t1.NearestIslandID_Higher, t1.GapStart_Higher, t1.GapEnd_Higher) ) AS xa (NearestIslandId, GapStart, GapEnd)
)
-- Only return records that fall into the Gaps
SELECT e.EventTime, ni.*
FROM    #tmpEvent e INNER JOIN NearestIsland ni
                ON e.EventTime > ni.GapStart
                AND e.EventTime <= ni.GapEnd

SET STATISTICS XML OFF
SET STATISTICS IO OFF


DROP TABLE #tmp
DROP TABLE #tmpEvent

Questions: (@MaxVernon)

Is the desired outcome a table containing the gaps?
Or are you attempting to assign incoming rows to the nearest neighbor?
Or are you looking to reproduce the exact output you show in your example?

Answer:

In short, Yes, Yes and No. The desired outcome is to identify any (other/more) efficient way to identify the nearest island for an event time that would normally fall within a gap. I tried to expand the question to show what a desirable final outcome would be.

Best Answer

There are a lot of different questions here. When it comes to generating the full result set (the mapping of times to IDs), what you have is the way that I would do it, although I'd add a nonclustered index on WindowStart that includes WindowEnd. SQL Server can scan through the covering index, find the next ID and WindowStart values using LEAD() (or the dual ROW_NUMBER() approach if you prefer), and add two rows using the midway point between times if the next WindowStart doesn't match the current WindowEnd.

I prepped the same data that you did for your "big" data set, but in a different way so it would finish faster on my machine:

CREATE TABLE tmp_181900
(
      ID            INT PRIMARY KEY CLUSTERED
    , WindowStart   DATETIME2
    , WindowEnd     DATETIME2
);

-- Create contiguous data set
INSERT INTO tmp_181900 WITH (TABLOCK)
SELECT    ID
        , DATEADD(HOUR, ID, CAST('0001-01-01' AS DATETIME2))
        , DATEADD(HOUR, ID + 1, CAST('0001-01-01' AS DATETIME2))
FROM
(
    SELECT TOP (87591200) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ID  -- Swap line with above for larger dataset
    FROM master.sys.configurations t1
    CROSS JOIN master.sys.configurations t2
    CROSS JOIN master.sys.configurations t3
    CROSS JOIN master.sys.configurations t4
    CROSS JOIN master.sys.configurations t5
) x;

CREATE TABLE tmp
(
      ID            INT PRIMARY KEY CLUSTERED
    , WindowStart   DATETIME2
    , WindowEnd     DATETIME2
);

-- TABLESAMPLE would be faster, but I assume that you can't randomly sample at the page level
INSERT INTO tmp WITH (TABLOCK)
SELECT *
FROM tmp_181900
WHERE RIGHT(BINARY_CHECKSUM(ID, NEWID()), 3) < 115; -- keep 11.5% of rows

DROP TABLE tmp_181900;

The following code implements the algorithm that I described:

SELECT t2.*
FROM
(
    SELECT 
      ID
    , WindowStart
    , WindowEnd
    , LEAD(ID) OVER (ORDER BY WindowStart) Next_Id
    , LEAD(WindowStart) OVER (ORDER BY WindowStart) Next_WindowStart
    FROM tmp
) t
CROSS APPLY (
    SELECT DATEADD(MINUTE, 0.5 * DATEDIFF(MINUTE, WindowEnd, Next_WindowStart), WindowEnd)
) ca (midpoint_time)
CROSS APPLY (
    SELECT ID, WindowEnd, ca.midpoint_time
    UNION ALL
    SELECT Next_ID, ca.midpoint_time, Next_WindowStart
) t2 (NearestIslandId, GapStart, GapEnd)
WHERE t.WindowStart <> t.Next_WindowStart
    AND t2.GapStart <> t2.GapEnd;

This has a nice, clean plan without sorting that performs similarly to what you have:

If the requirement is actually to find the nearest island for a small subset of rows, such as ten in your example, then it's possible to write much more efficient code using the index. The idea here is to find the previous and the next row from the table for each row in tmpEvent and to do a little math to find the closest one. If there are N rows in tmpEvent then this code will do at most 2 * N index seeks. It's so fast that STATISTICS TIME can't detect anything:

(10 row(s) affected)

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.

Here's the code that I used, which I think matches your logic pretty well. I commented each piece:

SELECT e.EventTime
, CASE WHEN ca2.use_previous = 1 THEN previous_event.ID ELSE later_event.ID END NEAREST_ID
, CASE WHEN ca2.use_previous = 1 THEN previous_event.WindowStart ELSE later_event.WindowStart END NEAREST_WindowStart
, CASE WHEN ca2.use_previous = 1 THEN previous_event.WindowEnd ELSE later_event.WindowEnd END NEAREST_WindowEnd
FROM tmpEvent e
OUTER APPLY ( -- find the previous island, including exact matches
    SELECT TOP 1 t.ID, t.WindowStart, t.WindowEnd
    FROM tmp t
    WHERE t.WindowStart < e.EventTime
    ORDER BY t.WindowStart DESC
) previous_event 
OUTER APPLY ( -- find the next island
    SELECT TOP 1 t.ID, t.WindowStart, t.WindowEnd
    FROM tmp t
    WHERE previous_event.WindowEnd < e.EventTime -- only do this seek if not an exact match
    AND t.WindowStart >= e.EventTime
    ORDER BY t.WindowStart ASC
) later_event
CROSS APPLY ( -- calculate differences between times so we can reuse them
    SELECT DATEDIFF_BIG(SECOND, previous_event.WindowEnd, e.EventTime) DIFF_S_TO_PREVIOUS
    , DATEDIFF_BIG(SECOND, e.EventTime, later_event.WindowStart) DIFF_S_TO_NEXT
) ca
CROSS APPLY ( -- figure out if the previous event is the closest
    SELECT CASE WHEN 
        ca.DIFF_S_TO_PREVIOUS <= 0 -- the event matches exactly
        OR ca.DIFF_S_TO_NEXT IS NULL -- no ending event
        OR ca.DIFF_S_TO_PREVIOUS < ca.DIFF_S_TO_NEXT -- previous is closer than later
    THEN 1
    ELSE 0
    END
) ca2 (use_previous);

Here's the result set, which will be different for you because we're generating random data:

And here's the query plan:

As another test I put 10k rows in the tmpEvent table and even returned them to the client. On my system this was fine, but of course you could see different performance:

(10000 row(s) affected)

Table 'tmp'. Scan count 18864, logical reads 60419, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'tmpEvent'. Scan count 1, logical reads 22, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 47 ms, elapsed time = 131 ms.

Related Solutions

Sql-server – Gaps and islands: client solution vs T-SQL query

And a 1 second solution...

;WITH cteSource(StartedAt, FinishedAt)
AS (
    SELECT      s.StartedAt,
            e.FinishedAt
    FROM        (
                SELECT  StartedAt,
                    ROW_NUMBER() OVER (ORDER BY StartedAt) AS rn
                FROM    dbo.Tasks
            ) AS s
    INNER JOIN  (
                SELECT  FinishedAt,
                    ROW_NUMBER() OVER (ORDER BY FinishedAt) + 1 AS rn
                FROM    dbo.Tasks
            ) AS e ON e.rn = s.rn
    WHERE       s.StartedAt > e.FinishedAt

    UNION ALL

    SELECT  MIN(StartedAt),
        MAX(FinishedAt)
    FROM    dbo.Tasks
), cteGrouped(theTime, grp)
AS (
    SELECT  u.theTime,
        (ROW_NUMBER() OVER (ORDER BY u.theTime) - 1) / 2
    FROM    cteSource AS s
    UNPIVOT (
            theTime
            FOR theColumn IN (s.StartedAt, s.FinishedAt)
        ) AS u
)
SELECT      MIN(theTime),
        MAX(theTime)
FROM        cteGrouped
GROUP BY    grp
ORDER BY    grp

Mysql – identify gaps and islands

There's probably something more elegant than this, but since you are doing a kind of pivot (transforming from columns to rows, i.e. metadata to data) it is never going to be pretty. Assuming a table like:

create table t
( ts timestamp not null
, EngOilP_sd smallint not null
, CompOilLVL_sd smallint not null
) engine = innodb;

insert into t (ts, EngOilP_sd, CompOilLVL_sd)
values 
 ('2015-06-24 20:28:07',0 ,0 )
,('2015-06-24 20:30:20',1 ,0 )
,('2015-06-24 20:36:47',1 ,0 )
,('2015-06-24 20:41:11',1 ,0 )
,('2015-06-24 20:43:29',1 ,0 )
,('2015-06-24 20:45:42',0 ,0 )
,('2015-06-24 20:47:51',0 ,0 )
,('2015-06-24 20:49:59',0 ,1 )
,('2015-06-24 20:52:01',0 ,1 )
,('2015-06-24 20:54:17',0 ,0 );

Since we are going to reference the "pivot" several times we create a view for it:

create view t_pivot as 
    select 'EngOilP_sd' as sd, ts, EngOilP_sd as val from t 
    union all 
    select 'CompOilLVL_sd', ts, CompOilLVL_sd from t;

The start time (using @Pauls comment as the definition) is the earliest time where each attribute is 1:

select sd, min(case val when 1 then ts end) as start
from t_pivot     
group by sd

The stop time is the earliest time after that where each attribute is 0:

select x.sd, x.start, min(case y.val when 0 then y.ts end) as stop
from (
    select sd, min(case val when 1 then ts end) as start
    from t_pivot     
    group by sd 
) as x 
join t_pivot as y
    on x.sd = y.sd
   and y.ts > x.start 
group by x.sd, x.start

Adding another level of nesting makes it convenient to determine the duration:

select sd, timediff(stop, start) as duration, start, stop 
from ( 
    select x.sd, x.start, min(case y.val when 0 then y.ts end) as stop
    from (
        select sd, min(case val when 1 then ts end) as start
        from t_pivot     
        group by sd 
    ) as x 
    join t_pivot as y
        on x.sd = y.sd
       and y.ts > x.start 
    group by x.sd, x.start 
) as z 
order by sd desc;

+---------------+----------+---------------------+---------------------+
| sd            | duration | start               | stop                |
+---------------+----------+---------------------+---------------------+
| EngOilP_sd    | 00:15:22 | 2015-06-24 20:30:20 | 2015-06-24 20:45:42 |
| CompOilLVL_sd | 00:04:18 | 2015-06-24 20:49:59 | 2015-06-24 20:54:17 |
+---------------+----------+---------------------+---------------------+
2 rows in set (0.00 sec)

Best Answer

Related Solutions

Sql-server – Gaps and islands: client solution vs T-SQL query

Mysql – identify gaps and islands

Related Question