PostgreSQL – Finding Missing Timestamps Grouped by Key

gaps-and-islandsgroup byperformancepostgresqlquery-performance

I have a table that has a timestamp, some data and a identifying key for the data source:

create table test_data (
    id serial primary key,
    key text,
    timestamp timestamp with time zone    
);
INSERT INTO test_data
    (key, timestamp)
VALUES
    ('Source_A', '2018-03-15 01:07:06.603029+00'),
    ('Source_B', '2018-03-15 10:00:01.603029+00'),
    ('Source_A', '2018-03-15 11:05:06.603029+00'),
    ('Source_B', '2018-03-15 15:09:06.603029+00'),
    ('Source_B', '2018-03-15 16:09:06.603029+00');

I want to find the number of missing hours in the data grouped by each data source. I've got this code that works for a single group:

SELECT 
COUNT(hours)-1 AS missing_hours, 
'Source_A' AS key
FROM GENERATE_SERIES('2018-03-15', '2018-03-16', INTERVAL '1 hour') AS hours
  WHERE hours NOT IN 
  ( SELECT TO_TIMESTAMP(FLOOR((EXTRACT('epoch' FROM timestamp) / 3600 )) * 3600) AS time_bit 
   FROM test_data
   WHERE key = 'Source_A'
   GROUP BY time_bit)

Running this give me:

missing_hours,  key
22,             Source_A

I'm struggling to figure out how I can group by key and then get the number of missing hours for all data sources:

missing_hours,  key
22,             Source_A
21,             Source_B

Any ideas? This will be running on monthly partitioned tables with ~50 million rows each so I don't want to have it be too expensive. The single key query runs in about 2 secs.

Best Answer

One approach would be to count the hours for each key and substract that from total hours in given period.

WITH period as (
  SELECT COUNT(*) as total_hours 
  FROM GENERATE_SERIES('2018-03-15', '2018-03-16', INTERVAL '1 hour') gs
),
key_counts as (
  SELECT key, COUNT(*) as hours
  FROM (
    SELECT distinct key, date_trunc('hour', timestamp)
    FROM test_data
    --apply period limit here
  ) kq
  GROUP BY KEY
)

SELECT key, total_hours-hours as missing_hours 
FROM 
  period,
  key_counts

Related Solutions

MySQL – Finding Contiguous Ranges in Grouped Data

The answer with the variables is going to be more efficient but here is an answer with pure SQL:

select 
    a.id_user, 
    a.id_ringtype, 
    a.number      as min,
    min(b.number) as max
from 
    rings as a 
  join rings as b 
    on  a.id_user = b.id_user 
    and a.id_ringtype = b.id_ringtype 
    and a.number <= b.number 
where not exists 
      ( select 1 
        from rings as c 
        where c.id_user = a.id_user 
          and c.id_ringtype = a.id_ringtype 
          and c.number = a.number - 1
      )
  and not exists 
      ( select 1 
        from rings as d 
        where d.id_user = b.id_user 
          and d.id_ringtype = b.id_ringtype 
          and d.number = b.number + 1
      ) 
group by 
    a.id_user, 
    a.id_ringtype, 
    a.number ;

Efficiency will depend on many factors (mainly distibution of data) but an index on (id_user, id_ringtype, number) is essential for this query.

MySQL – Finding Gaps in Date Ranges with Overlapping Dates

If overlaps are only partial, (i.e., a range may partially overlap another, but no range is a subset of another range), I think the following query will do what you want:

SELECT t1.office_type_id, t1.state_id, t1.district_id, t1.office_class,
        t1.term_end, MIN(t2.term_begin) next_begin
FROM terms t1 JOIN terms t2  
  ON t1.office_type_id=t2.office_type_id AND 
     t1.state_id=t2.state_id AND 
     (t1.district_id=t2.district_id OR 
       (t1.district_id IS NULL AND t2.district_id IS NULL)) AND 
     t1.office_class=t2.office_class 
WHERE t1.term_begin < t2.term_begin 
GROUP BY t1.office_type_id, t1.state_id, t1.district_id, t1.office_class, 
         t1.term_end 
HAVING t1.term_end < next_begin - INTERVAL 1 DAY;

If ranges may fully overlap, I suggest to create a view that removes such subranges:

CREATE VIEW terms1 AS
SELECT DISTINCT * FROM terms t3 
WHERE (office_type_id, state_id, office_class) NOT IN
     (SELECT office_type_id, state_id, office_class FROM terms t4 
      WHERE ((t4.term_begin < t3.term_begin AND t3.term_end <=t4.term_end) OR 
             (t4.term_begin = t3.term_begin AND t3.term_end < t4.term_end)) 
        AND (t3.district_id = t4.district_id OR 
             (t3.district_id IS NULL and t4.district_id IS NULL)));

Then you can use this view instead of the table in the query above.

(And life would have been a bit simpler if you had picked some other value than NULL for the default district_id).

Related Question