PostgreSQL – List Chronological Pageviews Without Adjacent Duplicates

gaps-and-islandspostgresqlpostgresql-9.4

I've got a log of pageviews. Multiple pageviews with the same path may occur next to each other if say the user refreshes the page.

To see the user's journey through the site I'd like to select pageviews for that user but eliminating duplicate adjacent pageviews for clarity.

Example simplified table (time is really a timestamp):

CREATE TABLE pageviews(user_id, time, path)
AS VALUES
  ( 1, 1, '/home' ),
  ( 2, 2, '/home' ),
  ( 1, 3, '/profile' ),
  ( 1, 4, '/profile' ),
  ( 2, 5, '/dashboard' ),
  ( 1, 6, '/home' ),
  ( 2, 7, '/profile' ),
  ( 1, 8, '/profile' );

Now assume there's going to be a WHERE user_id = 1 clause and ORDER BY time ASC so we have:

( 1, 1, '/home' ),
( 1, 3, '/profile' ),
( 1, 4, '/profile' ),
( 1, 6, '/home' ),
( 1, 8, '/profile' );

But what I'd like is the duplicate adjacent /profile to be ignored like so (a count of the "grouped" paths would be nice too):

| user_id | time | path     | count |
| ------- | ---- | -------- | ----- |
| 1       | 1    | /home    | 1     |
| 1       | 4    | /profile | 2     |
| 1       | 6    | /home    | 1     |
| 1       | 8    | /profile | 1     |

A simple GROUP BY wouldn't do it as /profile is still in there twice, I only want to "group" adjacent equal values of path.

Also I'm not sure whether the first or last of the equal values should be returned, I don't really mind so whichever makes sense / makes for the simplest query. I'll probably flip it to be DESC in real use but this is how I've written the tables.

Reference I found this post that does basically what I want (included below for permanence). So I'm interested to hear 2 improvements:

Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.
Can a count of the "grouped" values be added easily (possibly using a window function)?

So any elegant query that can achieve both of those things would be ⭐️ I couldn't find any other articles or questions about this but I feel it must be a common thing to do so any other links welcome.

SELECT w1.day, w1.rainy
FROM (SELECT
    w2.day, w2.rainy,
    lead(w2.rainy) OVER (ORDER BY w2.day DESC) as prev_rainy
    FROM weather w2
    ORDER BY w2.day DESC
) as w1
WHERE w1.rainy IS DISTINCT FROM w1.prev_rainy
ORDER BY w1.day DESC;

Best Answer

SELECT user_id, max(time) AS time, path, count(*)
FROM (
  SELECT *,
    count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
  FROM (
    SELECT *,
      CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
    FROM pageviews
  ) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;

First what you need to do is create something you can group by, for this we compute is_resets. For more information about why it's in a CASE, see this question In what case is a count(x or null) needed in Gaps and Islands?

SELECT *,
  CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
FROM pageviews
 user_id | time |    path    | is_reset 
---------+------+------------+----------
       1 |    1 | /home      |         
       1 |    3 | /profile   |        1
       1 |    4 | /profile   |         
       1 |    6 | /home      |        1
       1 |    8 | /profile   |        1
       2 |    2 | /home      |         
       2 |    5 | /dashboard |        1
       2 |    7 | /profile   |        1
(8 rows)

Then we count them to create groups.

SELECT *,
  count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
FROM (
  SELECT *,
    CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
  FROM pageviews
) AS t1
 user_id | time |    path    | is_reset | grp 
---------+------+------------+----------+-----
       1 |    1 | /home      |          |   0
       1 |    3 | /profile   |        1 |   1
       1 |    4 | /profile   |          |   1
       1 |    6 | /home      |        1 |   2
       1 |    8 | /profile   |        1 |   3
       2 |    2 | /home      |          |   0
       2 |    5 | /dashboard |        1 |   1
       2 |    7 | /profile   |        1 |   2
(8 rows)

Now we group by the user_id, and grp. We select the max(time) because you didn't specify which time to choose in your question.

SELECT user_id, max(time) AS time, path, count(*)
FROM (
  SELECT *,
    count(is_reset) OVER (PARTITION BY user_id ORDER BY time) AS grp
  FROM (
    SELECT *,
      CASE WHEN path <> lag(path) OVER (PARTITION BY user_id ORDER BY time) THEN 1 END AS is_reset
    FROM pageviews
  ) AS t1
) AS t2
GROUP BY user_id, grp, path
ORDER BY user_id, grp;

 user_id | time |    path    | count 
---------+------+------------+-------
       1 |    1 | /home      |     1
       1 |    4 | /profile   |     2
       1 |    6 | /home      |     1
       1 |    8 | /profile   |     1
       2 |    2 | /home      |     1
       2 |    5 | /dashboard |     1
       2 |    7 | /profile   |     1
(7 rows)

Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.

This problem can also be solved with dense_rank, as noted. You may want to see my answer to this question for a good write up on how to do that.

Solving “Gaps and Islands” with row_number() and dense_rank()?

Related Solutions

PostgreSQL – How to Find Consecutive Free Numbers Using Window Functions

This is a gaps-and-islands problem. Assuming there are no gaps or duplicates in the same id_set set:

WITH partitioned AS (
  SELECT
    *,
    number - ROW_NUMBER() OVER (PARTITION BY id_set) AS grp
  FROM atable
  WHERE status = 'FREE'
),
counted AS (
  SELECT
    *,
    COUNT(*) OVER (PARTITION BY id_set, grp) AS cnt
  FROM partitioned
)
SELECT
  id_set,
  number
FROM counted
WHERE cnt >= 3
;

Here's a SQL Fiddle demo^* link for this query: http://sqlfiddle.com/#!1/a2633/1.

UPDATE

To return only one set, you could add in one more round of ranking:

WITH partitioned AS (
  SELECT
    *,
    number - ROW_NUMBER() OVER (PARTITION BY id_set) AS grp
  FROM atable
  WHERE status = 'FREE'
),
counted AS (
  SELECT
    *,
    COUNT(*) OVER (PARTITION BY id_set, grp) AS cnt
  FROM partitioned
),
ranked AS (
  SELECT
    *,
    RANK() OVER (ORDER BY id_set, grp) AS rnk
  FROM counted
  WHERE cnt >= 3
)
SELECT
  id_set,
  number
FROM ranked
WHERE rnk = 1
;

Here's a demo for this one too: http://sqlfiddle.com/#!1/a2633/2.

If you ever need to make it one set per id_set, change the RANK() call like this:

RANK() OVER (PARTITION BY id_set ORDER BY grp) AS rnk

Additionally, you could make the query return the smallest matching set (i.e. first try to return the first set of exactly three consecutive numbers if it exists, otherwise four, five etc.), like this:

RANK() OVER (ORDER BY cnt, id_set, grp) AS rnk

or like this (one per id_set):

RANK() OVER (PARTITION BY id_set ORDER BY cnt, grp) AS rnk

_{* The SQL Fiddle demos linked in this answer use the 9.1.8 instance as the 9.2.1 one doesn't appear to be working at the moment.}

PostgreSQL – Show Only Duplicate Values

That can be done with a simple SELECT statement.

SELECT count(*) AS count_ , favorite_color_ 
FROM person_ 
GROUP BY favorite_color_
HAVING count(*) > 1
ORDER BY count_ DESC
;

The count(*) command is an aggregate function supported by Postgres and gives the number of rows with duplicate values in a column specified by the GROUP BY clause.

To omit the values without any duplicates, add the HAVING clause. A HAVING is similar to a WHERE but applies to each row being generated to represent the grouping.

To get results to sort with most frequent duplicate values at the top, use the ORDER BY clause. We reference the column alias count_ that we created in the first line with the AS clause.

Best Answer

Related Solutions

PostgreSQL – How to Find Consecutive Free Numbers Using Window Functions

PostgreSQL – Show Only Duplicate Values

Related Question