Postgresql – How to get aggregate data from a dynamic number of related rows in adjacent table

aggregatepostgresqlpostgresql-9.4

EDIT: Unknowing of the rule that prohibits cross-posting, I also asked this on Stackoverflow and chose an answer over there. Since there's another (fully working) answer in this thread though, I won't delete it. But for the solution I chose, check out this thread – https://stackoverflow.com/questions/52024244/how-to-get-aggregate-data-from-a-dynamic-number-of-related-rows-in-adjacent-tabl

I have a table of matches played, roughly looking like this:

player_id | match_id | result | opponent_rank
----------------------------------------------
82        | 2847     |   w    |   42
82        | 3733     |   w    |  185
82        | 4348     |   l    |   10
82        | 5237     |   w    |  732
82        | 5363     |   w    |   83
82        | 7274     |   w    |    6
51        | 2347     |   w    |   39
51        | 3746     |   w    |  394
51        | 5037     |   l    |   90
...       | ...      |  ...   |  ...

To get all the winning streaks (not just top streak by any player), I use this query:

SELECT player.tag, s.streak, match.date, s.player_id, s.match_id FROM (
    SELECT streaks.streak, streaks.player_id, streaks.match_id FROM (
        SELECT w1.player_id, max(w1.match_id) AS match_id, count(*) AS streak FROM (
            SELECT w2.player_id, w2.match_id, w2.win, w2.date, sum(w2.grp) OVER w AS grp FROM (
                SELECT m.player_id, m.match_id, m.win, m.date, (m.win = false AND LAG(m.win, 1, true) OVER w = true)::integer AS grp FROM matches_m AS m
                WHERE matches_m.opponent_position<'100'
                    WINDOW w AS (PARTITION BY m.player_id ORDER BY m.date, m.match_id)
                    ) AS w2
                    WINDOW w AS (PARTITION BY w2.player_id ORDER BY w2.date, w2.match_id)
                ) AS w1
            WHERE w1.win = true
            GROUP BY w1.player_id, w1.grp
            ORDER BY w1.player_id DESC, count(*) DESC
        ) AS streaks
    ORDER BY streaks.streak DESC
    LIMIT 100
    ) AS s
LEFT JOIN player ON player.id = s.player_id
LEFT JOIN match ON match.id = s.match_id

And the result looks like this (note that this is not a fixed table/view, as the query above can be extended by certain parameters such as nationality, date range, ranking of players, etc):

player_id | match_id | streak
-------------------------------
82        | 3733     |  2
82        | 7274     |  3
51        | 3746     |  2
...       | ...      |  ...

What I want to add now is a bunch of aggregate data to provide details about the winning streaks. For starters, I'd like to know the average rank of the opponents during each those streaks. Other data are the duration of the streak in time, first and last date, opponent name who ended the streak or if it's still ongoing, and so on. I've tried various things – CTE, some elaborate joins, unions, or adding them in as lag functions in the existing code. But I'm completely stuck how to solve this.

As is obvious from the code, my SQL skills are very basic, so please excuse any mistakes or inefficient statements. Also new to DBA so let me know if my question can be phrased better. For complete context, I'm using Postgres 9.4 on Debian, the matches_m table is a materialized view with 550k lines (query takes 2.5s right now). The data comes from http://aligulac.com/about/db/, I just mirror it to create the aforementioned view.

Best Answer

For a beginner in sql you sure picked some fairly advanced concept to start with ;-) Your question seems to boil down to what is known as island and gaps problems. I find it easiest to handle these kinds of problems by creating a group for consecutive events. One trick to accomplish this is to use two enumerations and calculate the difference:

row_number() over ( partition by player_id
                    order by match_id )

row_number() over ( partition by player_id, result
                    order by match_id )

if the difference between first and second grp changes it means that a change in result occurred. Since you will probably use this grp for several things I put it in a CTE:

with t as (
  select player_id, match_id, result, opponent_rank
       , row_number() over ( partition by player_id
                             order by match_id )
         -
         row_number() over ( partition by player_id, result
                             order by match_id ) as grp
  from matches
)
select player_id, match_id, result, opponent_rank, grp
     , count(1) over (partition by player_id, grp
                      order by match_id) as streak
     , avg(opponent_rank) over (partition by player_id, grp) as avg_rnk                 
from t
where result = 'w' and player_id = 82
order by player_id, match_id;

I believe this should give you a start, so I'll stop there. Welcome to the forum btw.