PostgreSQL – How to Calculate Time Delta Between Two Related Rows

postgresql

I have a table with a lot of entries in pairs, related to one another by a field name, each entry has a timestamp timestamp_received. I'd like to get the delta of the timestamp per name

table:

|id|name|timestamp_received|
|1|A|2016-09-21 11:11:00|
|2|A|2016-09-21 11:11:10|
|3|B|2016-09-21 11:11:20|
|4|B|2016-09-21 12:12:21|

desired output:

|name|delta|
|A|00:00:10|
|B|01:01:01|

What I have tried so far:

SELECT name, timestamp_received - LAG(timestamp_received, 1) OVER (PARTITION BY name ORDER BY id ASC) delta
FROM message;

Which does not produce the expected output

SELECT fpm_left.name, age(fpm_right.timestamp_received, fpm_left.timestamp_received) 
FROM
  (SELECT * FROM message WHERE id IN (SELECT min(id) AS id FROM message GROUP BY name)) fpm_left
JOIN (SELECT * FROM message WHERE IN in (SELECT max(id) AS id FROM message GROUP BY name)) fpm_right
    ON fpm_left.name = fpm_right.name AND fpm_left.id != fpm_right.id;

Which produces the expected output, but seems a little innefficient

Can this be done more elegantly?

Best Answer

You can use the solution with the window function, you just need to filter out the "other" row. This can e.g. be done using Postgres' distinct on ()

SELECT distinct on (name) 
       name, 
       timestamp_received - LAG(timestamp_received, 1) OVER (PARTITION BY name ORDER BY id ASC) as delta 
FROM message
order by name, timestamp_received desc;

This will always pick exactly one row per name (the one with the latest timestamp_received), even if there are more then two rows per name.

Get qualifying rows only

One way ..

WITH x AS (
   SELECT *
         ,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
   FROM   tbl
   )
, y AS (
   SELECT *
   FROM   x
   WHERE  x.rn = 1
   AND    date_out >= '2012-10-01'
   AND    date_out <  '2012-11-01'
   )
, z AS (
   SELECT x.*
   FROM   x
   JOIN   y USING (reference)
   WHERE  x.rn = 2
   AND    x.centreid = 1
   )
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM   y
JOIN   z USING (reference)

UNION  ALL
SELECT taskid, reference, centreid, date_out
FROM   z
ORDER  BY reference, date_out;

Another way:

WITH x AS (
   SELECT *
         ,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
   FROM   tbl
   )
,y AS (
    SELECT x.*, y.taskid AS taskid2, y.centreid AS centreid2, y.date_out AS date_out2
    FROM   x
    JOIN   x y USING (reference)
    WHERE  x.rn = 1
    AND    x.date_out >= '2012-10-01'
    AND    x.date_out <  '2012-11-01'
    AND    y.rn = 2
    AND    y.centreid = 1
    )
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM   y

UNION  ALL
SELECT y.taskid2, y.reference, y.centreid2, y.date_out2
FROM   y
ORDER  BY reference, date_out;

I'd expect the second one to be faster. Depends on your data distribution. Test with EXPLAIN ANALYZE.

Get all rows for qualifying references

WITH x AS (
   SELECT *
         ,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
   FROM   tbl
   )
,y AS (
    SELECT reference
    FROM   x
    JOIN   x y USING (reference)
    WHERE  x.rn = 1
    AND    x.date_out >= '2012-10-01'
    AND    x.date_out <  '2012-11-01'
    AND    y.rn = 2
    AND    y.centreid = 1
    )
SELECT *
FROM   tbl
JOIN   y USING (reference)
ORDER  BY reference, date_out;

-> sqlfiddle

Answer to follow-up in comment

Separate groups if more than 30 days between entries.

WITH a AS (
   SELECT *
         ,lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) AS last_date
         ,CASE WHEN date_out > 
                   (lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) - 30)
               THEN 0 ELSE 1
          END AS step
   FROM   tbl
   )
,b AS (
   SELECT *
         ,sum(step) OVER (PARTITION BY reference ORDER BY date_out DESC) AS grp
   FROM a
   )
 ,c AS (
   SELECT *
         ,row_number() OVER (PARTITION BY reference, grp ORDER BY date_out DESC) AS rn
   FROM b
   )
,d AS (
   SELECT reference, grp
   FROM   c
   JOIN   c d USING (reference, grp)
   WHERE  c.rn = 1
   AND    c.date_out >= '2012-10-01'
   AND    c.date_out <  '2012-11-01'
   AND    d.rn = 2
   AND    d.centreid = 1
   )
SELECT b.taskid, b.reference, b.centreid, b.date_out
FROM   b
JOIN   d USING (reference, grp)
ORDER  BY reference, date_out

-> sqlfiddle

But while pure SQL is a beaut .. I would solve this procedurally in a plpgsql function. Very similar to this recent answer on SO. Would be faster, because it can be done in a single table scan.

PostgreSQL – How to Find Unmatched Rows Between Two Tables Dynamically

I derive from your example that you only want to match tables with the same structure (compatible row types).

Base query

First of all, your base query is needlessly complex and possibly incorrect. Consider instead:

(TABLE a EXCEPT ALL TABLE b)  -- ALL keeps duplicate rows and is faster
UNION ALL
(TABLE b EXCEPT ALL TABLE a);

I doubt you want to fold duplicates. Typically, tables do not hold complete duplicates to begin with, and omitting the ALL key word would instruct Postgres to try and fold duplicates in every step, which is just a waste of time. Details:

Select rows which are not present in other table

Even if completely duplicate rows are possible, it might be best to return all of them to avoid misleading answers. If you want to fold duplicates in the result anyway, a single UNION does the job:

(TABLE a EXCEPT ALL TABLE b)
UNION
(TABLE b EXCEPT ALL TABLE a);

However, a query with NATURAL FULL OUTER JOIN is more efficient, doing almost the same:

SELECT *
FROM   a NATURAL FULL OUTER JOIN b
WHERE  a IS NULL OR b IS NULL;

There are two subtle, exotic corner cases:

This returns rows that are all NULL to begin with. It would be an odd table design that allows such rows, though.
This does not return duplicate rows that are not matched with the same number of duplicate rows in the other table: a single match is enough to eliminate all - which is subtly different from folding duplicates in the result! Have a look at the fiddle below. Again, it would be an odd table design that allows completely duplicate rows.

Function

What you have so far does not work for several reasons.

To return actual table rows dynamically (and not just a count or a text representation), you need to use a polymorphic type.

Since the second table is bound to have a compatible row type (as per my assumption) it's enough to hand in just the table name for that.

CREATE OR REPLACE FUNCTION f_tbl_diff(_tbl1_type ANYELEMENT, _tbl2 text)
  RETURNS SETOF ANYELEMENT AS
$func$
BEGIN
   RETURN QUERY EXECUTE format('
      SELECT *
      FROM   %1$s NATURAL FULL OUTER JOIN %2$I
      WHERE  %1$s IS NULL OR %2$I IS NULL'
    , pg_typeof(_tbl1_type), _tbl2);
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM f_tbl_diff(NULL::a, 'b');

Note the special syntax for the first argument! We hand in an actual row defining the return type, not just a table name. Related answer on SO with ample details (scroll to the last chapter):

Refactor a PL/pgSQL function to return the output of various SELECT queries

This returns all rows that have no exact, complete match in the other table

SQL Fiddle also demonstrating exotic corner cases.

About the dynamic SQL:

SQL injection in Postgres functions vs prepared queries