Postgresql – select nearby rows within a timestamp

postgresqlselect

Edit: added create/insert sql at the bottom

I need to extract data from a table (in postgres), where I want to look at nearby records within a timespan

The table looks like this:

 objectid |    keyword     |    date_created     |       ip        
----------+----------------+---------------------+-----------------
        1 | guard          | 2014-09-03 12:29:00 | 123.123.123.123
        2 | dog            | 2014-09-03 12:30:00 | 123.123.123.123
        3 | police         | 2014-09-03 12:40:00 | 123.123.123.123
        4 | swan           | 2014-09-03 12:45:00 | 123.123.123.123
        5 | goose          | 2014-09-03 12:46:00 | 123.123.123.123
        6 | golf           | 2014-09-04 03:15:00 | 123.123.123.123
        7 | police         | 2014-11-03 12:45:00 | 11.11.11.11
        8 | police hunt    | 2014-11-03 12:46:00 | 11.11.11.11
        9 | police officer | 2014-11-03 12:47:00 | 11.11.11.11
       10 | police dog     | 2014-11-03 12:48:00 | 11.11.11.11
       11 | house          | 2014-11-03 12:58:00 | 11.11.11.11
       12 | police hunt    | 2013-02-03 03:01:00 | 11.11.11.11
       13 | police         | 2013-02-03 03:05:00 | 11.11.11.11
       14 | police office  | 2013-02-03 03:07:00 | 11.11.11.11
       15 | wolf           | 2013-02-03 06:28:00 | 11.11.11.11
       16 | fox            | 2013-02-03 06:48:00 | 11.11.11.11
(16 rows)

I would like to extract nearby records for specific keywords,
e.g. I want to extract the one nearest records for the keyword "police", if the nearest record is < 30 min from the datestamp of police and has the same ip,
so above that would return

    2 | dog            | 2014-09-03 12:30:00 | 123.123.123.123
    3 | police         | 2014-09-03 12:40:00 | 123.123.123.123
    4 | swan           | 2014-09-03 12:45:00 | 123.123.123.123
    7 | police         | 2014-11-03 12:45:00 | 11.11.11.11
    8 | police hunt    | 2014-11-03 12:46:00 | 11.11.11.11
   12 | police hunt    | 2013-02-03 03:01:00 | 11.11.11.11
   13 | police         | 2013-02-03 03:05:00 | 11.11.11.11
   14 | police office  | 2013-02-03 03:07:00 | 11.11.11.11

I am really just interested in the keywords for now, if it makes a difference, so for "police" I be happy with a list:
dog, swan, police hunt, police office

Also, all records (not just the one closest) within the 30 min timespan could be interesting, so i would like to look at that as well, e.g.
for "police" that would return

    1 | guard          | 2014-09-03 12:29:00 | 123.123.123.123
    2 | dog            | 2014-09-03 12:30:00 | 123.123.123.123
    3 | police         | 2014-09-03 12:40:00 | 123.123.123.123
    4 | swan           | 2014-09-03 12:45:00 | 123.123.123.123
    5 | goose          | 2014-09-03 12:46:00 | 123.123.123.123
    7 | police         | 2014-11-03 12:45:00 | 11.11.11.11
    8 | police hunt    | 2014-11-03 12:46:00 | 11.11.11.11
    9 | police officer | 2014-11-03 12:47:00 | 11.11.11.11
   10 | police dog     | 2014-11-03 12:48:00 | 11.11.11.11
   11 | house          | 2014-11-03 12:58:00 | 11.11.11.11
   12 | police hunt    | 2013-02-03 03:01:00 | 11.11.11.11
   13 | police         | 2013-02-03 03:05:00 | 11.11.11.11
   14 | police office  | 2013-02-03 03:07:00 | 11.11.11.11

I have a huge list (10000) of keywords i want to this for, although if that is tricky in sql I could create a separate sql script for each of them.

I am really stumped how to start on this

Create/insert:


drop table tmp_data;
CREATE TABLE tmp_data (
  objectid SERIAL PRIMARY KEY,
  keyword varchar(80) NOT NULL,
  date_created TIMESTAMP without time zone default current_timestamp,
  ip varchar(80) NOT NULL
);

insert into tmp_data (keyword, date_created, ip) values ('guard','2014-09-03 12:29', '123.123.123.123');
insert into tmp_data (keyword, date_created, ip) values ('dog','2014-09-03 12:30', '123.123.123.123');
insert into tmp_data (keyword, date_created, ip) values ('police','2014-09-03 12:40', '123.123.123.123');
insert into tmp_data (keyword, date_created, ip) values ('swan','2014-09-03 12:45', '123.123.123.123');
insert into tmp_data (keyword, date_created, ip) values ('goose','2014-09-03 12:46', '123.123.123.123');
insert into tmp_data (keyword, date_created, ip) values ('golf','2014-09-04 03:15', '123.123.123.123');

insert into tmp_data (keyword, date_created, ip) values ('police','2014-11-03 12:45', '11.11.11.11');
insert into tmp_data (keyword, date_created, ip) values ('police hunt','2014-11-03 12:46', '11.11.11.11');
insert into tmp_data (keyword, date_created, ip) values ('police officer','2014-11-03 12:47', '11.11.11.11');
insert into tmp_data (keyword, date_created, ip) values ('police dog','2014-11-03 12:48', '11.11.11.11');
insert into tmp_data (keyword, date_created, ip) values ('house','2014-11-03 12:58', '11.11.11.11');

insert into tmp_data (keyword, date_created, ip) values ('police hunt','2013-02-03 3:01', '11.11.11.11');
insert into tmp_data (keyword, date_created, ip) values ('police','2013-02-03 3:05', '11.11.11.11');
insert into tmp_data (keyword, date_created, ip) values ('police office','2013-02-03 3:07', '11.11.11.11');
insert into tmp_data (keyword, date_created, ip) values ('wolf','2013-02-03 06:28', '11.11.11.11');
insert into tmp_data (keyword, date_created, ip) values ('fox','2013-02-03 06:48', '11.11.11.11');

Best Answer

For adjacent records:

WITH cte AS ( SELECT objectid, keyword, date_created, ip,
                     LAG(keyword) OVER (PARTITION BY ip ORDER BY date_created) lag_kw,
                     LEAD(keyword) OVER (PARTITION BY ip ORDER BY date_created) lead_kw,
                     LAG(date_created) OVER (PARTITION BY ip ORDER BY date_created) lag_ts,
                     LEAD(date_created) OVER (PARTITION BY ip ORDER BY date_created) lead_ts
              FROM tmp_data )
SELECT DISTINCT objectid, keyword, date_created, ip
FROM cte
WHERE keyword = 'police'
   OR ( lag_kw = 'police' AND lag_ts > date_created - INTERVAL '30 MINUTE')
   OR ( lead_kw = 'police' AND lead_ts < date_created + INTERVAL '30 MINUTE')
ORDER BY objectid;

For all records:

WITH cte AS ( SELECT date_created, ip
              FROM tmp_data
              WHERE keyword = 'police')
SELECT tmp_data.*
FROM tmp_data
JOIN cte ON tmp_data.ip = cte.ip
        AND tmp_data.date_created BETWEEN cte.date_created - INTERVAL '30 MINUTE'
                                      AND cte.date_created + INTERVAL '30 MINUTE'
ORDER BY objectid;

fiddle