I've got a gaps and islands (GAI) problem that I can't seem to crack to save my life. Most of the GAI problems I've seen involve data in the form of:
[Id] [StartDateTime] [EndDateTime]
But mine in this case comes from GPS breadcrumbs/data points, so we don't have a start and end, just a timestamp.
What I want to do is the following:
Given a dataset that contains a truck number, a location, and a datetime, if a row shares a truck, and location and is within 1 hour of another row, then join together.
Trucks can run 24/7 (team drivers).
My dummy data would look like this (the duplicate rows are intentional as they are a feature of the real data):
DECLARE @TruckLoc TABLE
(
TruckId VARCHAR(50),
LocationName VARCHAR(12),
LocDateTime DATETIME
)
;
INSERT INTO @TruckLoc (TruckId, LocationName, LocDateTime)
VALUES
('MonsterTruck', 'Home', '2019-10-22 10:27:40.000'),
('MonsterTruck', 'Home', '2019-10-22 10:27:40.000'),
('MonsterTruck', 'Home', '2019-10-22 10:33:10.000'),
('MonsterTruck', 'Home', '2019-10-22 11:22:40.000'),
('MonsterTruck', 'Home', '2019-10-22 11:22:40.000'),
('MonsterTruck', 'Home', '2019-10-22 11:22:40.000'),
('MonsterTruck', 'Gas Station', '2019-10-22 12:43:00.000'),
('MonsterTruck', 'Gas Station', '2019-10-22 13:13:00.000'),
('MonsterTruck', 'Gas Station', '2019-10-22 13:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 16:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 16:49:00.000'),
('MonsterTruck', 'Home', '2019-10-22 17:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 20:43:00.000'),
('MonsterTruck', 'Home', '2019-10-22 20:56:00.000')
The output that I'm attempting to get is the following:
Monster Truck | Home | 2019-10-22 12:57:40.000 | 2019-10-22 11:22:40.000
Monster Truck | Gas Station | 2019-10-22 12:43:00.000 | 2019-10-22 13:43:00.000
Monster Truck | Home | 2019-10-22 16:43:00.000 | 2019-10-22 17:43:00.000
Monster Truck | Home | 2019-10-22 20:43:00.000 | 2019-10-22 20:56:00.000
What I've tried:
I've tried everything. In my real statement, I have them ranked by truck and datetime. But that's the furthest success I've had. I've gone up to try to use LAG/JOIN
functions and CTEs but the crux of the problem is that I don't need to look at the previous record, but I need to look at all the previous records that satisfy a condition concerning other previous records i.e each record is within an hour of the preceding one.
Any help on this is greatly appreciated.
Best Answer
This was a tricky one. I've used PostgreSQL for the fiddle - you should be able to "translate" into SQL Server dialect - with a bit of modification, it'll work on any server that has Common Table Expressions.
I'll outline the steps. Full code as well as data are on the fiddle as well as at the end of this answer. There are a few "fossils" - i.e. extra fields that helped in the problem solving process that I haven't deleted, but they should be easy to clear up.
The first thing to do (a bit like relationships) is to establish boundaries.
Find the upper limits of the 1 hour differences between tranches of measurements - start by using the
LEAD()
function:Result (for brevity, only 3 records are shown):
Then, find the gaps which are greater than or equal to 1 hour:
The
OR t1.the_lead IS NULL
test is there because the lastLEAD()
of a series like this is alwaysNULL
.Result:
Inspection shows that this is correct.
Then, you basically do the same again for
LAG()
- the code is in the fiddle (extra fields are "fossils")!So, now you form 2 CTEs (Common Table Expressions) from the two queries above and run the following SQL:
Result:
Which is the desired result! Full SQL below. It requires a bit of tidying up, but I'll leave that to you - I've left the extraneous stuff in so that it will (should!) be easier to follow my thought processes as the solution evolved.
In this thread, even though I didn't answer the question asked, this might prove helpful in getting the hang of
LAG()
andLEAD()
. Also you could take a look at the accepted answer in that same thread.=============== Full SQL and DDL and DML for data ================
SQL:
DDL:
DML: