Sql-server – Gaps and Islands – Finding the Nearest Island

gaps-and-islandssql serversql-server-2012sql-server-2016

I am working with the following scenario where I have temporal data that falls into islands and gaps. Every once in a while I need to associate an event that falls within an existing gap to its nearest island based on the time of the event.

To demonstrate, let's say I've got the following data defining my time periods:

enter image description here

This data is contiguous except for a gap which exists between IDs 2 and 7, for the time period 2017-07-26 00:03:00 through 2017-07-26 00:07:00.

In order to identify the nearest island, I am currently splitting the gap into two periods as follows:

enter image description here

If I have an event that falls within this gap, the GapWindowStart/End times will determine which island I need to associate the event with. So, for instance, if I have an event that occurs at 2017-07-26 00:03:20, I would associate that event with ID 2 and conversely if I had an event occur at 2017-07-26 00:05:35 I would associated that event with ID 7.

The most efficient way I've been able to code my approach, thus far, is to assemble the gaps using Itzik Ben-Gan's 3rd solution from SQL Server MVP Deep Dives book via the ROW_NUMBER window function and then split the gaps per a CROSS APPLY statement which acts like a simple UNPIVOT operation.

Here is the db<>fiddle plan of the approach I'm using to assemble the nearest island set.

With the nearest islands identified, I use an event's event time to identify the nearest island to associate said event with. Because these islands are volatile throughout the day, I cannot make a static master table, but instead have to rely on constructing everything at run-time when the events are encountered.

Here is a db<>fiddle plan showing what NearestIsland value should be used against a random event time.

Are there any better ways to figure out the nearest island for a given event that would normally fall into a gap? For instance, is there a more efficient method to identify the gaps or a more efficient way to identify the nearest island? Am I even going about this in the best logical way? There's nothing critical about this question, but I'm always trying to figure out if there's a "better" approach to things and I think this problem lends itself to some creativity so I'd love to see other performant options.

The current environment I'm working on is SQL 2012, but we'll be migrating to a SQL 2016 environment shortly, so I'm open to pretty much anything.

Code underlying the second db<>fiddle link is as follows:

-- Creation of Test Data
CREATE TABLE #tmp
(
      ID            INT PRIMARY KEY CLUSTERED
    , WindowStart   DATETIME2
    , WindowEnd     DATETIME2
)

-- Create contiguous data set
INSERT INTO #tmp
SELECT    ID
        , DATEADD(HOUR, ID, CAST('0001-01-01' AS DATETIME2))
        , DATEADD(HOUR, ID + 1, CAST('0001-01-01' AS DATETIME2))
FROM
(
    SELECT TOP (1500000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ID
    --SELECT TOP (87591200) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ID  -- Swap line with above for larger dataset
    FROM master.sys.configurations t1
    CROSS JOIN master.sys.configurations t2
    CROSS JOIN master.sys.configurations t3
    CROSS JOIN master.sys.configurations t4
    CROSS JOIN master.sys.configurations t5
) x


--DELETE 1000000 random records to create random gaps
DELETE FROM #tmp
WHERE ID IN (
    SELECT TOP 1000000 ID
    --SELECT TOP 77591200 ID -- Swap line with above for larger dataset
    FROM #tmp
    ORDER BY NEWID()
)


-- Create RandomEvent Times
CREATE TABLE #tmpEvent
(
    EventTime DATETIME2
)

INSERT INTO #tmpEvent
SELECT DATEADD(SECOND, X.RandomNum, Y.minWindowEnd) AS EventDate
FROM (VALUES (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))
           , (ABS(CHECKSUM(NEWID())))) AS X(RandomNum)
    CROSS JOIN (SELECT MIN(WindowEnd) AS minWindowEnd FROM #tmp) AS Y


SET STATISTICS XML ON
SET STATISTICS IO ON

--Desired Output Format - Best Execution I've found so far
;WITH rankIslands AS (
    SELECT    ID
            , WindowStart
            , WindowEnd
            , ROW_NUMBER() OVER (ORDER BY WindowStart) AS rnk
    FROM    #tmp
), rankGapsJoined AS (
    SELECT    t1.ID AS NearestIslandID_Lower
            , t1.WindowEnd AS GapStart_Lower
            , DATEADD(MINUTE, (DATEDIFF(MINUTE, t1.WindowEnd, t2.WindowStart) / 2), t1.WindowEnd) AS GapEnd_Lower
            , t2.ID AS NearestIslandID_Higher
            , DATEADD(MINUTE, -1 * (DATEDIFF(MINUTE, t1.WindowEnd, t2.WindowStart) / 2), t2.WindowStart) AS GapStart_Higher
            , t2.WindowStart AS GapEnd_Higher
    FROM rankIslands t1 INNER JOIN rankIslands t2
        ON t1.rnk + 1 = t2.rnk
            AND t1.WindowEnd <> t2.WindowStart
), NearestIsland AS (
    SELECT  xa.*
    FROM    rankGapsJoined t1
            CROSS APPLY ( VALUES (t1.NearestIslandID_Lower, t1.GapStart_Lower, t1.GapEnd_Lower)
                                ,(t1.NearestIslandID_Higher, t1.GapStart_Higher, t1.GapEnd_Higher) ) AS xa (NearestIslandId, GapStart, GapEnd)
)
-- Only return records that fall into the Gaps
SELECT e.EventTime, ni.*
FROM    #tmpEvent e INNER JOIN NearestIsland ni
                ON e.EventTime > ni.GapStart
                AND e.EventTime <= ni.GapEnd

SET STATISTICS XML OFF
SET STATISTICS IO OFF


DROP TABLE #tmp
DROP TABLE #tmpEvent

Questions: (@MaxVernon)

  • Is the desired outcome a table containing the gaps?

  • Or are you attempting to assign incoming rows to the nearest neighbor?

  • Or are you looking to reproduce the exact output you show in your example?

Answer:

In short, Yes, Yes and No. The desired outcome is to identify any (other/more) efficient way to identify the nearest island for an event time that would normally fall within a gap. I tried to expand the question to show what a desirable final outcome would be.

Best Answer

There are a lot of different questions here. When it comes to generating the full result set (the mapping of times to IDs), what you have is the way that I would do it, although I'd add a nonclustered index on WindowStart that includes WindowEnd. SQL Server can scan through the covering index, find the next ID and WindowStart values using LEAD() (or the dual ROW_NUMBER() approach if you prefer), and add two rows using the midway point between times if the next WindowStart doesn't match the current WindowEnd.

I prepped the same data that you did for your "big" data set, but in a different way so it would finish faster on my machine:

CREATE TABLE tmp_181900
(
      ID            INT PRIMARY KEY CLUSTERED
    , WindowStart   DATETIME2
    , WindowEnd     DATETIME2
);

-- Create contiguous data set
INSERT INTO tmp_181900 WITH (TABLOCK)
SELECT    ID
        , DATEADD(HOUR, ID, CAST('0001-01-01' AS DATETIME2))
        , DATEADD(HOUR, ID + 1, CAST('0001-01-01' AS DATETIME2))
FROM
(
    SELECT TOP (87591200) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS ID  -- Swap line with above for larger dataset
    FROM master.sys.configurations t1
    CROSS JOIN master.sys.configurations t2
    CROSS JOIN master.sys.configurations t3
    CROSS JOIN master.sys.configurations t4
    CROSS JOIN master.sys.configurations t5
) x;

CREATE TABLE tmp
(
      ID            INT PRIMARY KEY CLUSTERED
    , WindowStart   DATETIME2
    , WindowEnd     DATETIME2
);

-- TABLESAMPLE would be faster, but I assume that you can't randomly sample at the page level
INSERT INTO tmp WITH (TABLOCK)
SELECT *
FROM tmp_181900
WHERE RIGHT(BINARY_CHECKSUM(ID, NEWID()), 3) < 115; -- keep 11.5% of rows

DROP TABLE tmp_181900;

The following code implements the algorithm that I described:

SELECT t2.*
FROM
(
    SELECT 
      ID
    , WindowStart
    , WindowEnd
    , LEAD(ID) OVER (ORDER BY WindowStart) Next_Id
    , LEAD(WindowStart) OVER (ORDER BY WindowStart) Next_WindowStart
    FROM tmp
) t
CROSS APPLY (
    SELECT DATEADD(MINUTE, 0.5 * DATEDIFF(MINUTE, WindowEnd, Next_WindowStart), WindowEnd)
) ca (midpoint_time)
CROSS APPLY (
    SELECT ID, WindowEnd, ca.midpoint_time
    UNION ALL
    SELECT Next_ID, ca.midpoint_time, Next_WindowStart
) t2 (NearestIslandId, GapStart, GapEnd)
WHERE t.WindowStart <> t.Next_WindowStart
    AND t2.GapStart <> t2.GapEnd;

This has a nice, clean plan without sorting that performs similarly to what you have:

DOP 1 plan for all rows

If the requirement is actually to find the nearest island for a small subset of rows, such as ten in your example, then it's possible to write much more efficient code using the index. The idea here is to find the previous and the next row from the table for each row in tmpEvent and to do a little math to find the closest one. If there are N rows in tmpEvent then this code will do at most 2 * N index seeks. It's so fast that STATISTICS TIME can't detect anything:

(10 row(s) affected)

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.

Here's the code that I used, which I think matches your logic pretty well. I commented each piece:

SELECT e.EventTime
, CASE WHEN ca2.use_previous = 1 THEN previous_event.ID ELSE later_event.ID END NEAREST_ID
, CASE WHEN ca2.use_previous = 1 THEN previous_event.WindowStart ELSE later_event.WindowStart END NEAREST_WindowStart
, CASE WHEN ca2.use_previous = 1 THEN previous_event.WindowEnd ELSE later_event.WindowEnd END NEAREST_WindowEnd
FROM tmpEvent e
OUTER APPLY ( -- find the previous island, including exact matches
    SELECT TOP 1 t.ID, t.WindowStart, t.WindowEnd
    FROM tmp t
    WHERE t.WindowStart < e.EventTime
    ORDER BY t.WindowStart DESC
) previous_event 
OUTER APPLY ( -- find the next island
    SELECT TOP 1 t.ID, t.WindowStart, t.WindowEnd
    FROM tmp t
    WHERE previous_event.WindowEnd < e.EventTime -- only do this seek if not an exact match
    AND t.WindowStart >= e.EventTime
    ORDER BY t.WindowStart ASC
) later_event
CROSS APPLY ( -- calculate differences between times so we can reuse them
    SELECT DATEDIFF_BIG(SECOND, previous_event.WindowEnd, e.EventTime) DIFF_S_TO_PREVIOUS
    , DATEDIFF_BIG(SECOND, e.EventTime, later_event.WindowStart) DIFF_S_TO_NEXT
) ca
CROSS APPLY ( -- figure out if the previous event is the closest
    SELECT CASE WHEN 
        ca.DIFF_S_TO_PREVIOUS <= 0 -- the event matches exactly
        OR ca.DIFF_S_TO_NEXT IS NULL -- no ending event
        OR ca.DIFF_S_TO_PREVIOUS < ca.DIFF_S_TO_NEXT -- previous is closer than later
    THEN 1
    ELSE 0
    END
) ca2 (use_previous);

Here's the result set, which will be different for you because we're generating random data:

enter image description here

And here's the query plan:

enter image description here

As another test I put 10k rows in the tmpEvent table and even returned them to the client. On my system this was fine, but of course you could see different performance:

(10000 row(s) affected)

Table 'tmp'. Scan count 18864, logical reads 60419, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'tmpEvent'. Scan count 1, logical reads 22, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 47 ms, elapsed time = 131 ms.