PostgreSQL – List Chronological Pageviews Without Adjacent Duplicates

gaps-and-islandspostgresqlpostgresql-9.4

I've got a log of pageviews. Multiple pageviews with the same path may occur next to each other if say the user refreshes the page.

To see the user's journey through the site I'd like to select pageviews for that user but eliminating duplicate adjacent pageviews for clarity.

Example simplified table (time is really a timestamp):

CREATE TABLE pageviews(user_id, time, path)
AS VALUES
  ( 1, 1, '/home' ),
  ( 2, 2, '/home' ),
  ( 1, 3, '/profile' ),
  ( 1, 4, '/profile' ),
  ( 2, 5, '/dashboard' ),
  ( 1, 6, '/home' ),
  ( 2, 7, '/profile' ),
  ( 1, 8, '/profile' );

Now assume there's going to be a WHERE user_id = 1 clause and ORDER BY time ASC so we have:

( 1, 1, '/home' ),
( 1, 3, '/profile' ),
( 1, 4, '/profile' ),
( 1, 6, '/home' ),
( 1, 8, '/profile' );

But what I'd like is the duplicate adjacent /profile to be ignored like so (a count of the "grouped" paths would be nice too):

| user_id | time | path     | count |
| ------- | ---- | -------- | ----- |
| 1       | 1    | /home    | 1     |
| 1       | 4    | /profile | 2     |
| 1       | 6    | /home    | 1     |
| 1       | 8    | /profile | 1     |

A simple GROUP BY wouldn't do it as /profile is still in there twice, I only want to "group" adjacent equal values of path.

Also I'm not sure whether the first or last of the equal values should be returned, I don't really mind so whichever makes sense / makes for the simplest query. I'll probably flip it to be DESC in real use but this is how I've written the tables.

Reference I found this post that does basically what I want (included below for permanence). So I'm interested to hear 2 improvements:

  • Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.
  • Can a count of the "grouped" values be added easily (possibly using a window function)?

So any elegant query that can achieve both of those things would be ⭐️ I couldn't find any other articles or questions about this but I feel it must be a common thing to do so any other links welcome.

SELECT w1.day, w1.rainy
FROM (SELECT
    w2.day, w2.rainy,
    lead(w2.rainy) OVER (ORDER BY w2.day DESC) as prev_rainy
    FROM weather w2
    ORDER BY w2.day DESC
) as w1
WHERE w1.rainy IS DISTINCT FROM w1.prev_rainy
ORDER BY w1.day DESC;

Best Answer

SELECT user_id, max(time) AS time, path, count(*)
FROM (
  SELECT *,
    count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
  FROM (
    SELECT *,
      CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
    FROM pageviews
  ) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;

First what you need to do is create something you can group by, for this we compute is_resets. For more information about why it's in a CASE, see this question In what case is a count(x or null) needed in Gaps and Islands?

SELECT *,
  CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
FROM pageviews
 user_id | time |    path    | is_reset 
---------+------+------------+----------
       1 |    1 | /home      |         
       1 |    3 | /profile   |        1
       1 |    4 | /profile   |         
       1 |    6 | /home      |        1
       1 |    8 | /profile   |        1
       2 |    2 | /home      |         
       2 |    5 | /dashboard |        1
       2 |    7 | /profile   |        1
(8 rows)

Then we count them to create groups.

SELECT *,
  count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
FROM (
  SELECT *,
    CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
  FROM pageviews
) AS t1
 user_id | time |    path    | is_reset | grp 
---------+------+------------+----------+-----
       1 |    1 | /home      |          |   0
       1 |    3 | /profile   |        1 |   1
       1 |    4 | /profile   |          |   1
       1 |    6 | /home      |        1 |   2
       1 |    8 | /profile   |        1 |   3
       2 |    2 | /home      |          |   0
       2 |    5 | /dashboard |        1 |   1
       2 |    7 | /profile   |        1 |   2
(8 rows)

Now we group by the user_id, and grp. We select the max(time) because you didn't specify which time to choose in your question.

SELECT user_id, max(time) AS time, path, count(*)
FROM (
  SELECT *,
    count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
  FROM (
    SELECT *,
      CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
    FROM pageviews
  ) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;

 user_id | time |    path    | count 
---------+------+------------+-------
       1 |    1 | /home      |     1
       1 |    4 | /profile   |     2
       1 |    6 | /home      |     1
       1 |    8 | /profile   |     1
       2 |    2 | /home      |     1
       2 |    5 | /dashboard |     1
       2 |    7 | /profile   |     1
(7 rows)

Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.

This problem can also be solved with dense_rank, as noted. You may want to see my answer to this question for a good write up on how to do that.