PostgreSQL – Finding Missing Timestamps Grouped by Key

gaps-and-islandsgroup byperformancepostgresqlquery-performance

I have a table that has a timestamp, some data and a identifying key for the data source:

create table test_data (
    id serial primary key,
    key text,
    timestamp timestamp with time zone    
);
INSERT INTO test_data
    (key, timestamp)
VALUES
    ('Source_A', '2018-03-15 01:07:06.603029+00'),
    ('Source_B', '2018-03-15 10:00:01.603029+00'),
    ('Source_A', '2018-03-15 11:05:06.603029+00'),
    ('Source_B', '2018-03-15 15:09:06.603029+00'),
    ('Source_B', '2018-03-15 16:09:06.603029+00');

I want to find the number of missing hours in the data grouped by each data source. I've got this code that works for a single group:

SELECT 
COUNT(hours)-1 AS missing_hours, 
'Source_A' AS key
FROM GENERATE_SERIES('2018-03-15', '2018-03-16', INTERVAL '1 hour') AS hours
  WHERE hours NOT IN 
  ( SELECT TO_TIMESTAMP(FLOOR((EXTRACT('epoch' FROM timestamp) / 3600 )) * 3600) AS time_bit 
   FROM test_data
   WHERE key = 'Source_A'
   GROUP BY time_bit)

Running this give me:

missing_hours,  key
22,             Source_A

I'm struggling to figure out how I can group by key and then get the number of missing hours for all data sources:

missing_hours,  key
22,             Source_A
21,             Source_B 

Any ideas? This will be running on monthly partitioned tables with ~50 million rows each so I don't want to have it be too expensive. The single key query runs in about 2 secs.

Best Answer

One approach would be to count the hours for each key and substract that from total hours in given period.

WITH period as (
  SELECT COUNT(*) as total_hours 
  FROM GENERATE_SERIES('2018-03-15', '2018-03-16', INTERVAL '1 hour') gs
),
key_counts as (
  SELECT key, COUNT(*) as hours
  FROM (
    SELECT distinct key, date_trunc('hour', timestamp)
    FROM test_data
    --apply period limit here
  ) kq
  GROUP BY KEY
)

SELECT key, total_hours-hours as missing_hours 
FROM 
  period,
  key_counts

dbfiddle