PostgreSQL – Window Function to Merge Rows Meeting a Condition

gaps-and-islandspostgresqlwindow functions

I have the following schema

DROP TABLE IF EXISTS messages;
DROP TABLE IF EXISTS chats;

CREATE TABLE chats (
  client integer NOT NULL,
  provider integer NOT NULL,
  PRIMARY KEY (client, provider)
);

CREATE TABLE messages (
  id serial PRIMARY KEY,
  client integer NOT NULL,
  provider integer NOT NULL,
  sender integer NOT NULL CHECK(sender = client OR sender = provider),
  created_at timestamp with time zone NOT NULL,
  FOREIGN KEY (client, provider) REFERENCES chats(client, provider) ON UPDATE CASCADE ON DELETE CASCADE
);

INSERT INTO chats VALUES (1,2), (3,2), (1,3);
INSERT INTO messages (client, provider, sender, created_at) VALUES
  (1, 2, 1, '2017-06-13 17:00:00+0'),
  (1, 2, 1, '2017-06-13 17:00:10+0'),
  (1, 2, 1, '2017-06-13 17:01:00+0'),
  (1, 2, 2, '2017-06-13 17:10:00+0'),
  (1, 2, 2, '2017-06-13 17:10:10+0'),
  (1, 2, 2, '2017-06-13 17:10:20+0'),
  (1, 2, 1, '2017-06-13 17:11:00+0'),
  (1, 2, 2, '2017-06-13 17:11:10+0');

INSERT INTO messages (client, provider, sender, created_at) VALUES
  (3, 2, 2, '2017-06-13 17:05:00+0'),
  (3, 2, 2, '2017-06-13 17:05:10+0'),
  (3, 2, 2, '2017-06-13 17:05:20+0'),
  (3, 2, 3, '2017-06-13 17:12:00+0'),
  (3, 2, 2, '2017-06-13 17:12:10+0'),
  (3, 2, 2, '2017-06-13 17:12:15+0'),
  (3, 2, 3, '2017-06-13 17:12:30+0'),
  (3, 2, 3, '2017-06-13 17:14:00+0');

INSERT INTO messages (client, provider, sender, created_at) VALUES
  (1, 3, 1, '2017-06-13 17:05:00+0'),
  (1, 3, 1, '2017-06-13 17:05:10+0');

What I want is to calculate the average time a provider takes to answer to a client with the following rules:

  • compare the time from the first message of a group of messages of client to next message of the provider
  • if the client messages and never receives an answer, get the time as CURRENT_TIMESTAMP - {the first message of the client}

These are obvious (?)

  • if the provider messages first, don't consider it
  • if the client messages last, don't consider this time

I thought starting with grouping a thread of messages by the same sender as a single entity. By this I mean:

Turning this:

postgres=# table messages order by client, provider, created_at;
 id | client | provider | sender |       created_at
----+--------+----------+--------+------------------------
  1 |      1 |        2 |      1 | 2017-06-13 13:00:00-04
  2 |      1 |        2 |      1 | 2017-06-13 13:00:10-04
  3 |      1 |        2 |      1 | 2017-06-13 13:01:00-04
--
  4 |      1 |        2 |      2 | 2017-06-13 13:10:00-04
  5 |      1 |        2 |      2 | 2017-06-13 13:10:10-04
  6 |      1 |        2 |      2 | 2017-06-13 13:10:20-04
--
  7 |      1 |        2 |      1 | 2017-06-13 13:11:00-04
--
  8 |      1 |        2 |      2 | 2017-06-13 13:11:10-04
--
 17 |      1 |        3 |      1 | 2017-06-13 13:05:00-04
 18 |      1 |        3 |      1 | 2017-06-13 13:05:10-04
--
  9 |      3 |        2 |      2 | 2017-06-13 13:05:00-04
 10 |      3 |        2 |      2 | 2017-06-13 13:05:10-04
 11 |      3 |        2 |      2 | 2017-06-13 13:05:20-04
--
 12 |      3 |        2 |      3 | 2017-06-13 13:12:00-04
--
 13 |      3 |        2 |      2 | 2017-06-13 13:12:10-04
 14 |      3 |        2 |      2 | 2017-06-13 13:12:15-04
--
 15 |      3 |        2 |      3 | 2017-06-13 13:12:30-04
 16 |      3 |        2 |      3 | 2017-06-13 13:14:00-04
(18 rows)

Into something like:

 client | provider | sender |    first_created_at
--------+----------+--------+------------------------
      1 |        2 |      1 | 2017-06-13 13:00:00-04
      1 |        2 |      2 | 2017-06-13 13:10:00-04
      1 |        2 |      1 | 2017-06-13 13:11:00-04
      1 |        2 |      2 | 2017-06-13 13:11:10-04
      1 |        3 |      1 | 2017-06-13 13:05:00-04
      3 |        2 |      2 | 2017-06-13 13:05:00-04
      3 |        2 |      3 | 2017-06-13 13:12:00-04
      3 |        2 |      2 | 2017-06-13 13:12:10-04
      3 |        2 |      3 | 2017-06-13 13:12:30-04

Separating the messages in groups of sequential messages by the same sender and keeping the first created_at.

After that I think the rest of the query should be relatively easy. I think this is done with window functions, but I'm not sure how to get it working.

After getting this, the idea would be to subtract each first_created_at of the provider from the previous first_created_at (which would be of the client), and average this interval for each provider.


EXAMPLE RESULTS

For the chat client = 1, provider = 2

postgres=# SELECT * FROM messages WHERE (client, provider) = (1,2) ORDER BY created_at;
 id | client | provider | sender |       created_at
----+--------+----------+--------+------------------------
  1 |      1 |        2 |      1 | 2017-06-13 13:00:00-04
  2 |      1 |        2 |      1 | 2017-06-13 13:00:10-04
  3 |      1 |        2 |      1 | 2017-06-13 13:01:00-04
--
  4 |      1 |        2 |      2 | 2017-06-13 13:10:00-04
  5 |      1 |        2 |      2 | 2017-06-13 13:10:10-04
  6 |      1 |        2 |      2 | 2017-06-13 13:10:20-04
--
  7 |      1 |        2 |      1 | 2017-06-13 13:11:00-04
--
  8 |      1 |        2 |      2 | 2017-06-13 13:11:10-04
(8 rows)
  • The group 1 (where sender = client) gives created_at = 2017-06-13 13:00:00-04
  • The group 2 (where sender = provider) gives created_at = 2017-06-13 13:10:00-04
  • The group 3 (where sender = client) gives created_at = 2017-06-13 13:11:00-04
  • The group 4 (where sender = provider) gives created_at = 2017-06-13 13:11:10-04

That gives us the intervals:

  • group2.created_at - group1.created_at = 00:10:00
  • group4.created_at - group3.created_at = 00:00:10

Then, for the chat client = 3, provider = 2

postgres=# SELECT * FROM messages WHERE (client, provider) = (3,2) ORDER BY created_at;
 id | client | provider | sender |       created_at
----+--------+----------+--------+------------------------
  9 |      3 |        2 |      2 | 2017-06-13 13:05:00-04
 10 |      3 |        2 |      2 | 2017-06-13 13:05:10-04
 11 |      3 |        2 |      2 | 2017-06-13 13:05:20-04
--
 12 |      3 |        2 |      3 | 2017-06-13 13:12:00-04
--
 13 |      3 |        2 |      2 | 2017-06-13 13:12:10-04
 14 |      3 |        2 |      2 | 2017-06-13 13:12:15-04
--
 15 |      3 |        2 |      3 | 2017-06-13 13:12:30-04
 16 |      3 |        2 |      3 | 2017-06-13 13:14:00-04
(8 rows)
  • The group 1 (where sender = provider) gives created_at = 2017-06-13 13:05:00-04
  • The group 2 (where sender = client) gives created_at = 2017-06-13 13:12:00-04
  • The group 3 (where sender = provider) gives created_at = 2017-06-13 13:12:10-04
  • The group 4 (where sender = client) gives created_at = 2017-06-13 13:12:30-04

That gives us the intervals:

  • We don't subtract group 1 because we have no previous client group to use
  • group3.created_at - group2.created_at = 00:00:10
  • We don't subtract group 4 because we have no following provider group to use

With these two conversations we can get the average the three intervals for the provider 2 = 00:03:26.666667.

Finally, for the chat client = 1, provider = 3

postgres=# SELECT * FROM messages WHERE (client, provider) = (1,3) ORDER BY created_at;
 id | client | provider | sender |       created_at
----+--------+----------+--------+------------------------
 17 |      1 |        3 |      1 | 2017-06-13 13:05:00-04
 18 |      1 |        3 |      1 | 2017-06-13 13:05:10-04
(2 rows)
  • The group 1 (where sender = client) gives created_at = 2017-06-13 13:05:00-04

That gives us the intervals:

  • The provider never answered, so we use CURRENT_TIMESTAMP - group1.created_at = 01:00:00 (this times obviously varies over time)

Best Answer

I'm still kind of confused at what you want, but I think you want this..

SELECT client, provider, sender, min(created_at) AS first_created_at
FROM (
  SELECT count(CASE WHEN is_reset THEN 1 END) OVER (ORDER BY created_at) AS grp,
    client,
    provider,
    sender,
    created_at
  FROM (
    SELECT
      (client,provider,sender) <> lag( (client,provider,sender) ) OVER (ORDER BY created_at) AS is_reset,
      client,
      provider,
      sender,
      created_at
    FROM messages
  ) AS t1
) AS t2
GROUP BY grp, client, provider, sender
HAVING count(*) > 1;

Breaking that down we first create an is_reset using PostgreSQL's row-comparison feature. We could have just written this other ways.

SELECT
  (client,provider,sender) <> lag( (client,provider,sender) ) OVER (ORDER BY created_at) AS is_reset,
  client,
  provider,
  sender,
  created_at
FROM messages

 is_reset | client | provider | sender |       created_at       
----------+--------+----------+--------+------------------------
          |      1 |        2 |      1 | 2017-06-13 12:00:00-05
 f        |      1 |        2 |      1 | 2017-06-13 12:00:10-05
 f        |      1 |        2 |      1 | 2017-06-13 12:01:00-05
 t        |      3 |        2 |      2 | 2017-06-13 12:05:00-05
 t        |      1 |        3 |      1 | 2017-06-13 12:05:00-05
 f        |      1 |        3 |      1 | 2017-06-13 12:05:10-05
 t        |      3 |        2 |      2 | 2017-06-13 12:05:10-05
 f        |      3 |        2 |      2 | 2017-06-13 12:05:20-05
 t        |      1 |        2 |      2 | 2017-06-13 12:10:00-05
 f        |      1 |        2 |      2 | 2017-06-13 12:10:10-05
 f        |      1 |        2 |      2 | 2017-06-13 12:10:20-05
 t        |      1 |        2 |      1 | 2017-06-13 12:11:00-05
 t        |      1 |        2 |      2 | 2017-06-13 12:11:10-05
 t        |      3 |        2 |      3 | 2017-06-13 12:12:00-05
 t        |      3 |        2 |      2 | 2017-06-13 12:12:10-05
 f        |      3 |        2 |      2 | 2017-06-13 12:12:15-05
 t        |      3 |        2 |      3 | 2017-06-13 12:12:30-05
 f        |      3 |        2 |      3 | 2017-06-13 12:14:00-05
(18 rows)

Then we count() to get groups.

SELECT count(CASE WHEN is_reset THEN 1 END) OVER (ORDER BY created_at) AS grp,
  client,
  provider,
  sender,
  created_at
FROM (
  SELECT
    (client,provider,sender) <> lag( (client,provider,sender) ) OVER (ORDER BY created_at) AS is_reset,
    client,
    provider,
    sender,
    created_at
  FROM messages
) AS t1;
 grp | client | provider | sender |       created_at       
-----+--------+----------+--------+------------------------
   0 |      1 |        2 |      1 | 2017-06-13 12:00:00-05
   0 |      1 |        2 |      1 | 2017-06-13 12:00:10-05
   0 |      1 |        2 |      1 | 2017-06-13 12:01:00-05
   2 |      3 |        2 |      2 | 2017-06-13 12:05:00-05
   2 |      1 |        3 |      1 | 2017-06-13 12:05:00-05
   3 |      1 |        3 |      1 | 2017-06-13 12:05:10-05
   3 |      3 |        2 |      2 | 2017-06-13 12:05:10-05
   3 |      3 |        2 |      2 | 2017-06-13 12:05:20-05
   4 |      1 |        2 |      2 | 2017-06-13 12:10:00-05
   4 |      1 |        2 |      2 | 2017-06-13 12:10:10-05
   4 |      1 |        2 |      2 | 2017-06-13 12:10:20-05
   5 |      1 |        2 |      1 | 2017-06-13 12:11:00-05
   6 |      1 |        2 |      2 | 2017-06-13 12:11:10-05
   7 |      3 |        2 |      3 | 2017-06-13 12:12:00-05
   8 |      3 |        2 |      2 | 2017-06-13 12:12:10-05
   8 |      3 |        2 |      2 | 2017-06-13 12:12:15-05
   9 |      3 |        2 |      3 | 2017-06-13 12:12:30-05
   9 |      3 |        2 |      3 | 2017-06-13 12:14:00-05
(18 rows)

And, now we GROUP BY grp, and make sure there are at least two messages (with HAVING) which I assume takes care of your, "compare the time from the first message of a group of messages of client to next message of the provider".

I'm not sure of your other criteria or what you want. You did a great job showing data, perhaps you should continue the trend and show what you want to exclude on your data. Also, please, in the future, draw out a desired result form that data if you can.