Sql-server – Gaps and islands with GPS breadcrumb data

gaps-and-islandssql serversql-server-2017

I've got a gaps and islands (GAI) problem that I can't seem to crack to save my life. Most of the GAI problems I've seen involve data in the form of:

[Id] [StartDateTime] [EndDateTime]

But mine in this case comes from GPS breadcrumbs/data points, so we don't have a start and end, just a timestamp.

What I want to do is the following:

Given a dataset that contains a truck number, a location, and a datetime, if a row shares a truck, and location and is within 1 hour of another row, then join together.

Trucks can run 24/7 (team drivers).

My dummy data would look like this (the duplicate rows are intentional as they are a feature of the real data):

DECLARE @TruckLoc TABLE
(
    TruckId VARCHAR(50),
    LocationName VARCHAR(12),
    LocDateTime DATETIME
)
;

INSERT INTO @TruckLoc (TruckId, LocationName, LocDateTime)
VALUES
('MonsterTruck', 'Home', '2019-10-22 10:27:40.000'),
('MonsterTruck', 'Home', '2019-10-22 10:27:40.000'),
('MonsterTruck', 'Home', '2019-10-22 10:33:10.000'),
('MonsterTruck', 'Home', '2019-10-22 11:22:40.000'),
('MonsterTruck', 'Home', '2019-10-22 11:22:40.000'),
('MonsterTruck', 'Home', '2019-10-22 11:22:40.000'),
('MonsterTruck', 'Gas Station', '2019-10-22 12:43:00.000'),
('MonsterTruck', 'Gas Station', '2019-10-22 13:13:00.000'),
('MonsterTruck', 'Gas Station', '2019-10-22 13:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 16:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 16:49:00.000'),
('MonsterTruck', 'Home', '2019-10-22 17:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 20:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 20:56:00.000')

The output that I'm attempting to get is the following:

Monster Truck | Home | 2019-10-22 12:57:40.000 | 2019-10-22 11:22:40.000
Monster Truck | Gas Station | 2019-10-22 12:43:00.000 | 2019-10-22 13:43:00.000
Monster Truck | Home | 2019-10-22 16:43:00.000 | 2019-10-22 17:43:00.000
Monster Truck | Home | 2019-10-22 20:43:00.000 | 2019-10-22 20:56:00.000

What I've tried:

I've tried everything. In my real statement, I have them ranked by truck and datetime. But that's the furthest success I've had. I've gone up to try to use LAG/JOIN functions and CTEs but the crux of the problem is that I don't need to look at the previous record, but I need to look at all the previous records that satisfy a condition concerning other previous records i.e each record is within an hour of the preceding one.

Any help on this is greatly appreciated.

Best Answer

This was a tricky one. I've used PostgreSQL for the fiddle - you should be able to "translate" into SQL Server dialect - with a bit of modification, it'll work on any server that has Common Table Expressions.

I'll outline the steps. Full code as well as data are on the fiddle as well as at the end of this answer. There are a few "fossils" - i.e. extra fields that helped in the problem solving process that I haven't deleted, but they should be easy to clear up.

The first thing to do (a bit like relationships) is to establish boundaries.

Find the upper limits of the 1 hour differences between tranches of measurements - start by using the LEAD() function:

SELECT 
  truck_id, 
  location_name, 
  loc_date_time,
  LEAD (loc_date_time, 1) OVER (PARTITION BY truck_id, location_name ORDER BY loc_date_time) AS the_lead
FROM truck_loc
ORDER BY loc_date_time;

Result (for brevity, only 3 records are shown):

truck_id      location_name loc_date_time           the_lead
MonsterTruck       Home 2019-10-22 10:27:40 2019-10-22 10:27:40
MonsterTruck       Home 2019-10-22 10:27:40 2019-10-22 10:33:10
MonsterTruck       Home 2019-10-22 10:33:10 2019-10-22 11:22:40

Then, find the gaps which are greater than or equal to 1 hour:

SELECT t1.truck_id, t1.location_name, t1.loc_date_time
FROM
(
  SELECT 
    truck_id, 
    location_name, 
    loc_date_time,
    LEAD (loc_date_time, 1) OVER (PARTITION BY truck_id ORDER BY loc_date_time) AS the_lead
  FROM truck_loc
) AS t1
WHERE t1.the_lead >= t1.loc_date_time + INTERVAL '1 HOUR'
OR t1.the_lead IS NULL
ORDER BY t1.loc_date_time;

The OR t1.the_lead IS NULL test is there because the last LEAD() of a series like this is always NULL.

Result:

truck_id        location_name   loc_date_time
MonsterTruck             Home       2019-10-22 11:22:40
MonsterTruck      Gas Station       2019-10-22 13:43:00
MonsterTruck             Home       2019-10-22 17:43:00
MonsterTruck             Home       2019-10-22 20:56:00

Inspection shows that this is correct.

Then, you basically do the same again for LAG() - the code is in the fiddle (extra fields are "fossils")!

substr  location_name   loc_date_time           t           tlag            diff
Mon          Home   2019-10-22 10:27:40 1571740060      
Mon   Gas Station   2019-10-22 12:43:00 1571748180      
Mon          Home   2019-10-22 16:43:00 1571762580  1571743360  19220
Mon          Home   2019-10-22 20:43:00 1571776980  1571766180  10800

So, now you form 2 CTEs (Common Table Expressions) from the two queries above and run the following SQL:

SELECT 
  SUBSTR(c1.truck_id, 1, 3) AS tid_1, SUBSTR(c2.truck_id, 1, 3) AS tid_2, 
  SUBSTR(c1.location_name, 1, 3) as loc_1, SUBSTR(c2.location_name, 1, 3) AS loc_2,
  c2.loc_date_time AS ldt_1, c1.loc_date_time AS ldt_2,
  (EXTRACT(EPOCH FROM c1.loc_date_time) - EXTRACT(EPOCH FROM c2.loc_date_time)) AS diff
FROM cte1 c1
JOIN cte2 c2
  ON c1.truck_id = c2.truck_id AND
     c1.location_name = c2.location_name
WHERE ABS(EXTRACT(EPOCH FROM c1.loc_date_time) - EXTRACT(EPOCH FROM c2.loc_date_time)) <= 3600
ORDER BY c2.loc_date_time;

Result:

tid_1   tid_2   loc_1   loc_2    ldt_1                   ldt_2                  diff
  Mon     Mon     Hom     Hom    2019-10-22 10:27:40     2019-10-22 11:22:40    3300
  Mon     Mon     Gas     Gas   2019-10-22 12:43:00     2019-10-22 13:43:00 3600
  Mon     Mon     Hom     Hom   2019-10-22 16:43:00     2019-10-22 17:43:00 3600
  Mon     Mon     Hom     Hom   2019-10-22 20:43:00     2019-10-22 20:56:00 780

Which is the desired result! Full SQL below. It requires a bit of tidying up, but I'll leave that to you - I've left the extraneous stuff in so that it will (should!) be easier to follow my thought processes as the solution evolved.

In this thread, even though I didn't answer the question asked, this might prove helpful in getting the hang of LAG() and LEAD(). Also you could take a look at the accepted answer in that same thread.

=============== Full SQL and DDL and DML for data ================

SQL:

WITH cte1 AS
(
  SELECT t1.truck_id, t1.location_name, t1.loc_date_time
  FROM
  (
    SELECT 
      truck_id, 
      location_name, 
      loc_date_time,
      LEAD (loc_date_time, 1) OVER (PARTITION BY truck_id ORDER BY loc_date_time) AS the_lead
    FROM truck_loc
  ) AS t1
  WHERE t1.the_lead >= t1.loc_date_time + INTERVAL '1 HOUR' -- https://stackoverflow.com/a/13828231/470530
  OR t1.the_lead IS NULL
  ORDER BY t1.loc_date_time
),
cte2 AS
(
  SELECT t1.truck_id, t1.location_name, t1.loc_date_time,
         EXTRACT(EPOCH FROM t1.loc_date_time) AS t, EXTRACT(EPOCH FROM t1.the_lag) AS tlag,
         EXTRACT(EPOCH FROM t1.loc_date_time) - EXTRACT(EPOCH FROM t1.the_lag) AS diff
  FROM
  (
    SELECT 
      truck_id, 
      location_name, 
      loc_date_time,
      LAG (loc_date_time, 1) OVER (PARTITION BY truck_id, location_name ORDER BY loc_date_time) AS the_lag
    FROM truck_loc
  ) AS t1
  WHERE t1.the_lag <= t1.loc_date_time - INTERVAL '1 HOUR' -- https://stackoverflow.com/a/13828231/470530
  OR t1.the_lag IS NULL
  ORDER BY t1.loc_date_time
)
SELECT 
  SUBSTR(c1.truck_id, 1, 3) AS tid_1, SUBSTR(c2.truck_id, 1, 3) AS tid_2, 
  SUBSTR(c1.location_name, 1, 3) as loc_1, SUBSTR(c2.location_name, 1, 3) AS loc_2,
  c2.loc_date_time AS ldt_1, c1.loc_date_time AS ldt_2,
  (EXTRACT(EPOCH FROM c1.loc_date_time) - EXTRACT(EPOCH FROM c2.loc_date_time)) AS diff
FROM cte1 c1
JOIN cte2 c2
  ON c1.truck_id = c2.truck_id AND
     c1.location_name = c2.location_name
WHERE ABS(EXTRACT(EPOCH FROM c1.loc_date_time) - EXTRACT(EPOCH FROM c2.loc_date_time)) <= 3600
ORDER BY c2.loc_date_time;

DDL:

CREATE TABLE truck_loc
(
    truck_id VARCHAR(50),
    location_name VARCHAR(12),
    loc_date_time TIMESTAMP
);

DML:

INSERT INTO truck_loc (truck_id, location_name, loc_date_time)
VALUES
('MonsterTruck', 'Home', '2019-10-22 10:27:40.000'),
('MonsterTruck', 'Home', '2019-10-22 10:27:40.000'),
('MonsterTruck', 'Home', '2019-10-22 10:33:10.000'),
('MonsterTruck', 'Home', '2019-10-22 11:22:40.000'),
('MonsterTruck', 'Home', '2019-10-22 11:22:40.000'),
('MonsterTruck', 'Home', '2019-10-22 11:22:40.000'),
('MonsterTruck', 'Gas Station', '2019-10-22 12:43:00.000'),
('MonsterTruck', 'Gas Station', '2019-10-22 13:13:00.000'),
('MonsterTruck', 'Gas Station', '2019-10-22 13:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 16:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 16:49:00.000'),
('MonsterTruck', 'Home', '2019-10-22 17:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 20:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 20:56:00.000');