Postgresql – Calculate difference of days between dates in different rows, based on consecutive value of a column, for each customer_id

gaps-and-islandspostgresql

I have a table like this one, that shows payments from each client, for different dates:

|  stream_datetime | customer_id | order_status | rn |
|:----------------:|:-----------:|:------------:|:--:|
| 04/06/20 11:19AM |      1      |   completed  |  1 |
| 05/06/20 10:54AM |      1      |   completed  |  2 |
| 06/06/20 10:59AM |      1      |   completed  |  3 |
| 08/06/20 09:27AM |      2      |    failed    |  1 |
| 09/06/20 11:02AM |      2      |    failed    |  2 |
| 01/11/20 05:59PM |      3      |   completed  |  1 |
| 02/11/20 05:59PM |      3      |   completed  |  2 |
| 03/11/20 10:01AM |      3      |   cancelled  |  3 |
| 04/11/20 09:20AM |      3      |   completed  |  4 |
| 05/11/20 10:25AM |      3      |   completed  |  5 |
| 01/13/20 03:29PM |      4      |   completed  |  1 |
| 02/13/20 03:29PM |      4      |   completed  |  2 |
| 03/13/20 03:29PM |      4      |   cancelled  |  3 |
| 04/13/20 03:29PM |      4      |   completed  |  4 |
| 05/13/20 03:29PM |      4      |   completed  |  5 |
| 06/13/20 03:29PM |      4      |   completed  |  6 |
| 07/13/20 03:29PM |      4      |   completed  |  7 |
| 08/13/20 03:29PM |      4      |   cancelled  |  8 |
| 06/20/20 03:29PM |      5      |    failed    |  1 |
| 07/20/20 03:29PM |      5      |   completed  |  2 |
| 08/20/20 03:29PM |      5      |   completed  |  3 |
| 09/20/20 03:29PM |      5      |    failed    |  4 |
| 10/20/20 03:29PM |      5      |   completed  |  5 |

I want to calculate the difference of days a client canceled his plan.

The challenge here is the fact that the client can cancel more than one time, so client 4 must
be counted as a churned client two times, but client 3 would be counted as a churned client
only one time.

I want to consider only clients that has an order_status = completed followed by (not necessarily in the next month) an order_status = cancelled.

I want too to create a column called purchase_day that keeps the day of the payment.

Obs.: the column rn means row number for the specific client.


EDITED:
Sorry. I made some mistakes and writing the question.

Maybe the order_status = cancelled come first than order_status = completed. This is due a mistake in the business, but it can happen. If this happen, so we cannot consider as a churned client.

So this is my expected result (now this is ok):

|   purchase_day   | customer_id | lifetime|
|:----------------:|:-----------:|:-------:|
| 01/11/20 05:59PM |      3      |   60    |
| 01/13/20 03:29PM |      4      |   60    |
| 04/13/20 03:29PM |      4      |   122   |

As you can see:

  • client 1 never cancelled (so he doesn't need to be in the result)
  • client 2 never cancelled (so he doesn't need to be in the result)
  • client 3 cancelled one time (his lifetime is equal 60 days)
  • client 4 cancelled two times (one lifetime of 60 days and another lifetime of 122 days)
  • client 5 never cancelled (so he doesn't need to be in the result)

Best Answer

Okay, had to do this in three steps:

  1. To mark which records we want to keep (first record for each customer, records where the status changed)
  2. To get the prior row's stream_datetime and calculate the the difference between
  3. Select only the rows where the order_status was 'cancelled'.
SELECT
  customer_id
 ,purchase_day
 ,lifetime
FROM
(
  SELECT
    customer_id
   ,order_status
   ,LAG(stream_datetime) OVER (PARTITION BY customer_id ORDER BY stream_datetime) AS purchase_day
   ,CAST(stream_datetime AS DATE) - LAG(CAST(stream_datetime AS DATE)) OVER (PARTITION BY customer_id ORDER BY stream_datetime) AS lifetime
  FROM
    (
      SELECT
        customer_id
       ,stream_datetime
       ,order_status
       ,CASE
          WHEN customer_id <> LAG(customer_id,1,-1) OVER (PARTITION BY customer_id ORDER BY stream_datetime) THEN 'Y'
          WHEN order_status <> LAG(order_status,1,'') OVER (PARTITION BY customer_id ORDER BY stream_datetime) THEN 'Y'
          ELSE 'N'
        END AS change_ind
      FROM
        Payment
      WHERE
        order_status IN ('completed','cancelled')  
    ) payment_summary
  WHERE
    change_ind = 'Y'
) payment_change
WHERE
  order_status = 'cancelled'
    AND purchase_day IS NOT NULL

Results in:

customer_id     purchase_day            lifetime
3               2020-01-11T17:59:00Z    60
4               2020-01-13T15:29:00Z    60
4               2020-04-13T15:29:00Z    122

You can test with this Fiddle.

I'd recommend running the subqueries by themselves to get a feel for how each step works and so you can be confident in the results.

Your query worked pretty well!. But I forgot to tell something: maybe the order_status = cancelled come first than order_status = completed. This is due a mistake in the business, but it can happen. If this happen, so we cannot consider as a churned client.

Ok, this is easy enough to fix given how things are set up. If the cancel has been entered erroneously, purchase_day will be NULL (we already ignore instances with two consecutive statuses), so we can just change the final WHERE to:

WHERE
  order_status = 'cancelled'
    AND purchase_day IS NOT NULL

Updated fiddle for you to test.