Postgresql – Gaps and Islands Across Fields With Start and End Timestamps

gaps-and-islandspostgresql

I am working with GPS data which I want to group by locations (to six decimal places of the coordinate value) and generate from and until times for.

For simplicity in this example I'm making the coordinates pre-formatted string values.

I am using Postgresql.

CREATE TABLE locations
(
    location_id serial PRIMARY KEY,
    latitude VARCHAR (10),
    longitude VARCHAR (10),
    generated_at TIMESTAMP
);
INSERT INTO locations (latitude, longitude, generated_at)
VALUES
('51.123456', '-1.123456', '2020-04-27 17:00:00.000'),
('51.123456', '-1.123456', '2020-04-27 17:05:00.000'),
('51.654321', '-1.654321', '2020-04-27 17:10:00.000'),
('51.654321', '-1.654321', '2020-04-27 17:15:00.000'),
('51.654321', '-1.654321', '2020-04-27 17:20:00.000'),
('51.444555', '-1.444555', '2020-04-27 17:25:00.000'),
('51.123456', '-1.123456', '2020-04-27 17:30:00.000'),
('51.123456', '-1.123456', '2020-04-27 17:35:00.000'),
('51.123456', '-1.123456', '2020-04-27 17:40:00.000'),
('51.444555', '-1.444555', '2020-04-27 17:45:00.000'),
('51.444555', '-1.444555', '2020-04-27 17:50:00.000'),
('51.654321', '-1.654321', '2020-04-27 17:55:00.000');

What I would like to get back from the query would be:

'51.123456' | '-1.123456' | '2020-04-27 17:00:00.000' | '2020-04-27 17:10:00.000'
'51.654321' | '-1.654321' | '2020-04-27 17:10:00.000' | '2020-04-27 17:25:00.000'
'51.444555' | '-1.444555' | '2020-04-27 17:25:00.000' | '2020-04-27 17:30:00.000'
'51.123456' | '-1.123456' | '2020-04-27 17:30:00.000' | '2020-04-27 17:45:00.000'
'51.444555' | '-1.444555' | '2020-04-27 17:45:00.000' | '2020-04-27 17:55:00.000'
'51.654321' | '-1.654321' | '2020-04-27 17:55:00.000' | 

The idea is to group by 'locations' and a new location is defined as being when the coordinates change from the previously reported location. If the coordinates are the same as an earlier location then that is still a new location.

The third column is the timestamp when the updated location was first reported and the fourth is the timestamp when the next location was first reported.

I can get part-way there with this:

SELECT
    latitude,
    longitude,
    generated_at
FROM (
    SELECT l.*,
    LAG (latitude) OVER w AS lagLatitude,
    LAG (longitude) OVER w AS lagLongitude
    from locations l
    WINDOW w as (ORDER BY generated_at) 
) x WHERE
    latitude is distinct from lagLatitude
or longitude is distinct from lagLongitude;

That will give me the first three columns I need but not the last one where the timestamp is from the next record.

Any help at this point would be wonderful because I am now just going round and round in circles and think it can be done but have no idea how.

Best Answer

Inevitably I worked this out after posting and taking a bit of a break. I've also created a dbFiddle for it.

The solution appears to be to tidy up my part-solution so that it uses a common table expression (CTE):

WITH base_cte AS (
    SELECT 
        latitude,
        longitude,
        generated_at AS arrived_at,
        LAG (latitude) OVER w AS lagLatitude,
        LAG (longitude) OVER w AS lagLongitude
    FROM locations
    WINDOW w AS (ORDER BY generated_at)
)
SELECT
    latitude,
    longitude,
    arrived_at
FROM base_cte WHERE
    latitude IS DISTINCT FROM lagLatitude
    or longitude IS DISTINCT FROM lagLongitude;

Then it became more obvious that I could add a second CTE which uses LEAD:

WITH grouped_cte AS (
    WITH base_cte AS (
        SELECT 
            latitude,
            longitude,
            generated_at AS arrived_at,
            LAG (latitude) OVER w AS lagLatitude,
            LAG (longitude) OVER w AS lagLongitude
        FROM locations
        WINDOW w AS (ORDER BY generated_at)
    )
    SELECT
        latitude,
        longitude,
        arrived_at
    FROM base_cte WHERE
        latitude IS DISTINCT FROM lagLatitude
        or longitude IS DISTINCT FROM lagLongitude
    )
SELECT
    latitude,
    longitude,
    arrived_at,
    LEAD(arrived_at, 1) OVER (
        ORDER BY arrived_at
    ) left_at
FROM grouped_cte;

This outputs:

latitude    longitude   arrived_at              left_at
51.123456   -1.123456   2020-04-27 17:00:00     2020-04-27 17:10:00
51.654321   -1.654321   2020-04-27 17:10:00     2020-04-27 17:25:00
51.444555   -1.444555   2020-04-27 17:25:00     2020-04-27 17:30:00
51.123456   -1.123456   2020-04-27 17:30:00     2020-04-27 17:45:00
51.444555   -1.444555   2020-04-27 17:45:00     2020-04-27 17:55:00
51.654321   -1.654321   2020-04-27 17:55:00