I've got a log of pageviews. Multiple pageviews with the same path may occur next to each other if say the user refreshes the page.
To see the user's journey through the site I'd like to select pageviews for that user but eliminating duplicate adjacent pageviews for clarity.
Example simplified table (time is really a timestamp):
CREATE TABLE pageviews(user_id, time, path)
AS VALUES
( 1, 1, '/home' ),
( 2, 2, '/home' ),
( 1, 3, '/profile' ),
( 1, 4, '/profile' ),
( 2, 5, '/dashboard' ),
( 1, 6, '/home' ),
( 2, 7, '/profile' ),
( 1, 8, '/profile' );
Now assume there's going to be a WHERE user_id = 1
clause and ORDER BY time ASC
so we have:
( 1, 1, '/home' ),
( 1, 3, '/profile' ),
( 1, 4, '/profile' ),
( 1, 6, '/home' ),
( 1, 8, '/profile' );
But what I'd like is the duplicate adjacent /profile
to be ignored like so (a count of the "grouped" paths would be nice too):
| user_id | time | path | count | | ------- | ---- | -------- | ----- | | 1 | 1 | /home | 1 | | 1 | 4 | /profile | 2 | | 1 | 6 | /home | 1 | | 1 | 8 | /profile | 1 |
A simple GROUP BY
wouldn't do it as /profile
is still in there twice, I only want to "group" adjacent equal values of path
.
Also I'm not sure whether the first or last of the equal values should be returned, I don't really mind so whichever makes sense / makes for the simplest query. I'll probably flip it to be DESC
in real use but this is how I've written the tables.
Reference I found this post that does basically what I want (included below for permanence). So I'm interested to hear 2 improvements:
- Can that query be simplified (for PG 9.4)? One comment mentions DENSE_RANK and WITHIN GROUP to eliminate subquery but I couldn't figure out how to do that.
- Can a count of the "grouped" values be added easily (possibly using a window function)?
So any elegant query that can achieve both of those things would be ⭐️ I couldn't find any other articles or questions about this but I feel it must be a common thing to do so any other links welcome.
SELECT w1.day, w1.rainy
FROM (SELECT
w2.day, w2.rainy,
lead(w2.rainy) OVER (ORDER BY w2.day DESC) as prev_rainy
FROM weather w2
ORDER BY w2.day DESC
) as w1
WHERE w1.rainy IS DISTINCT FROM w1.prev_rainy
ORDER BY w1.day DESC;
Best Answer
First what you need to do is create something you can group by, for this we compute
is_resets
. For more information about why it's in aCASE
, see this question In what case is a count(x or null) needed in Gaps and Islands?Then we count them to create groups.
Now we group by the user_id, and grp. We select the
max(time)
because you didn't specify which time to choose in your question.This problem can also be solved with dense_rank, as noted. You may want to see my answer to this question for a good write up on how to do that.