Postgresql – Concatenate string based on a condition

postgresqlwindow functions

I want to assign a value to a new column based on a condition for other rows associated with a user_id.

A user can see the same step with different devices at different times. However, only the first step must be taken into consideration.

e.g. step2 was seen by both mobile and desktop, but was first seen by mobile at 2021-03-16 14:03:16.
Likewise step4 was seen by both desktop and tablet, but first desktop.

To sum up, each step first seen by

step1 = mobile
step2 = mobile
step3 = mobile
step4 = desktop

Since the only device change happened at step4, then assign desktop > mobile to all records for that user_id.

How can I get the first step per user and device as in the expected result below?

Sample data:

+---------+-------+---------------------+---------+
| user_id | step  |     created_at      | device  |
+---------+-------+---------------------+---------+
| user1   | step1 | 2021-03-16 14:03:16 | mobile  |
| user1   | step2 | 2021-03-16 14:04:07 | mobile  |
| user1   | step2 | 2021-03-16 14:03:47 | desktop |
| user1   | step3 | 2021-03-16 14:03:55 | mobile  |
| user1   | step3 | 2021-03-16 14:04:00 | mobile  |
| user1   | step1 | 2021-03-16 14:04:02 | desktop |
| user1   | step2 | 2021-03-16 14:03:16 | mobile  |
| user1   | step3 | 2021-03-16 14:04:07 | mobile  |
| user1   | step4 | 2021-03-16 14:04:08 | desktop |
| user1   | step4 | 2021-03-16 14:04:09 | tablet  |
+---------+-------+---------------------+---------+

The expected result:

+---------+-------+---------------------+---------+---------------------+
| user_id | step  |     created_at      | device  | device_concatenated |
+---------+-------+---------------------+---------+---------------------+
| user1   | step1 | 2021-03-16 14:03:16 | mobile  | mobile > desktop    |
| user1   | step2 | 2021-03-16 14:03:16 | mobile  | mobile > desktop    |
| user1   | step3 | 2021-03-16 14:03:55 | mobile  | mobile > desktop    |
| user1   | step4 | 2021-03-16 14:04:08 | desktop | mobile > desktop    |
+---------+-------+---------------------+---------+---------------------+

Best Answer

Given the data as is (as of 11:38 UTC, 2021-04-08), this might help (see PostgreSQL fiddle here):

CREATE TABLE tab
(
  user_id TEXT NOT NULL,
  step    TEXT NOT NULL,
  created_at TIMESTAMP NOT NULL,
  device  TEXT NOT NULL
);

Populate it (slightly rearranged from original data to put it in order - user_id, step & created_at - makes it easier to figure out what's going on):

INSERT INTO tab 
VALUES

( 'user1', 'step1', '2021-03-16 14:03:16', 'mobile'),
( 'user1', 'step1', '2021-03-16 14:04:02', 'desktop'),


( 'user1', 'step2', '2021-03-16 14:03:16', 'mobile'),
( 'user1', 'step2', '2021-03-16 14:03:47', 'desktop'),
( 'user1', 'step2', '2021-03-16 14:04:07', 'mobile'),


( 'user1', 'step3', '2021-03-16 14:03:55', 'mobile'),
( 'user1', 'step3', '2021-03-16 14:04:00', 'mobile'),
( 'user1', 'step3', '2021-03-16 14:04:07', 'mobile'),

( 'user1', 'step4', '2021-03-16 14:04:08', 'desktop'),
( 'user1', 'step4', '2021-03-16 14:04:09', 'tablet');

I'll show the steps that led me to the solution - might be helpful (remove surplus field as appropriate for your requirements):

SELECT 
  user_id, step, created_at, device, 
  FIRST_VALUE(device)     OVER (PARTITION BY user_id, step
                            ORDER BY     user_id, step, created_at),
  FIRST_VALUE(created_at) OVER (PARTITION BY user_id, step
                            ORDER BY     user_id, step, created_at),
  LEAD(device)            OVER (PARTITION BY user_id, step
                            ORDER BY     user_id, step, created_at),
  CONCAT(
  (FIRST_VALUE(device)     OVER (PARTITION BY user_id, step
                            ORDER BY     user_id, step, created_at)),
  ' > ',
  (LEAD(device)            OVER (PARTITION BY user_id, step
                            ORDER BY     user_id, step, created_at)))
  AS device_concatenated,
  ROW_NUMBER() OVER (PARTITION BY user_id, step
                            ORDER BY     user_id, step, created_at) AS rn
FROM tab
ORDER BY user_id, step, created_at;

Result (better viewed on fiddle):

user_id step    created_at  device  first_value first_value lead    device_concatenated rn
user1   step1   2021-03-16 14:03:16 mobile  mobile  2021-03-16 14:03:16 desktop mobile > desktop    1
user1   step1   2021-03-16 14:04:02 desktop mobile  2021-03-16 14:03:16     mobile >    2
user1   step2   2021-03-16 14:03:16 mobile  mobile  2021-03-16 14:03:16 desktop mobile > desktop    1
user1   step2   2021-03-16 14:03:47 desktop mobile  2021-03-16 14:03:16 mobile  mobile > mobile 2
user1   step2   2021-03-16 14:04:07 mobile  mobile  2021-03-16 14:03:16     mobile >    3
user1   step3   2021-03-16 14:03:55 mobile  mobile  2021-03-16 14:03:55 mobile  mobile > mobile 1
user1   step3   2021-03-16 14:04:00 mobile  mobile  2021-03-16 14:03:55 mobile  mobile > mobile 2
user1   step3   2021-03-16 14:04:07 mobile  mobile  2021-03-16 14:03:55     mobile >    3
user1   step4   2021-03-16 14:04:08 desktop desktop 2021-03-16 14:04:08 tablet  desktop > tablet    1
user1   step4   2021-03-16 14:04:09 tablet  desktop 2021-03-16 14:04:08     desktop >   2

Then, we use the above result as a sub-query as follows:

SELECT user_id, step, created_ts, device, dc AS device_concatenated
FROM
(
  SELECT 
    user_id, step, created_at, device, 
    FIRST_VALUE(device)     OVER (PARTITION BY user_id, step
                                  ORDER BY     user_id, step, created_at),
    FIRST_VALUE(created_at) OVER (PARTITION BY user_id, step
                                  ORDER BY     user_id, step, created_at) AS created_ts,
    LEAD(device)            OVER (PARTITION BY user_id, step
                                  ORDER BY     user_id, step, created_at),
    CONCAT(
    (FIRST_VALUE(device)   OVER (PARTITION BY user_id, step
                                 ORDER BY     user_id, step, created_at)),
    ' > ',
    (LEAD(device)            OVER (PARTITION BY user_id, step
                            ORDER BY     user_id, step, created_at)))
  AS dc,
  ROW_NUMBER() OVER (PARTITION BY user_id, step
                            ORDER BY     user_id, step, created_at) AS rn
  FROM tab
  ORDER BY user_id, step, created_at
) AS t
WHERE rn = 1
ORDER BY user_id, step, device, created_ts;

Result:

user_id   step  created_ts             device   device_concatenated
  user1  step1  2021-03-16 14:03:16    mobile       mobile > desktop
  user1  step2  2021-03-16 14:03:16    mobile       mobile > desktop
  user1  step3  2021-03-16 14:03:55    mobile        mobile > mobile
  user1  step4  2021-03-16 14:04:08   desktop       desktop > tablet

Which is pretty close to the desired result. To obtain the desired result, we need to know how/why for step3, mobile > desktop is valid and mobile > mobile is not? Same goes for step4 - why is desktop > tablet not correct? There isn't even a single entry for mobile for step4... On receiving an explanation, I will edit as appropriate!