Postgresql – How to get aggregate data from a dynamic number of related rows in adjacent table

aggregatepostgresqlpostgresql-9.4

EDIT: Unknowing of the rule that prohibits cross-posting, I also asked this on Stackoverflow and chose an answer over there. Since there's another (fully working) answer in this thread though, I won't delete it. But for the solution I chose, check out this thread – https://stackoverflow.com/questions/52024244/how-to-get-aggregate-data-from-a-dynamic-number-of-related-rows-in-adjacent-tabl

I have a table of matches played, roughly looking like this:

player_id | match_id | result | opponent_rank
----------------------------------------------
82        | 2847     |   w    |   42
82        | 3733     |   w    |  185
82        | 4348     |   l    |   10
82        | 5237     |   w    |  732
82        | 5363     |   w    |   83
82        | 7274     |   w    |    6
51        | 2347     |   w    |   39
51        | 3746     |   w    |  394
51        | 5037     |   l    |   90
...       | ...      |  ...   |  ...

To get all the winning streaks (not just top streak by any player), I use this query:

SELECT player.tag, s.streak, match.date, s.player_id, s.match_id FROM (
    SELECT streaks.streak, streaks.player_id, streaks.match_id FROM (
        SELECT w1.player_id, max(w1.match_id) AS match_id, count(*) AS streak FROM (
            SELECT w2.player_id, w2.match_id, w2.win, w2.date, sum(w2.grp) OVER w AS grp FROM (
                SELECT m.player_id, m.match_id, m.win, m.date, (m.win = false AND LAG(m.win, 1, true) OVER w = true)::integer AS grp FROM matches_m AS m
                WHERE matches_m.opponent_position<'100'
                    WINDOW w AS (PARTITION BY m.player_id ORDER BY m.date, m.match_id)
                    ) AS w2
                    WINDOW w AS (PARTITION BY w2.player_id ORDER BY w2.date, w2.match_id)
                ) AS w1
            WHERE w1.win = true
            GROUP BY w1.player_id, w1.grp
            ORDER BY w1.player_id DESC, count(*) DESC
        ) AS streaks
    ORDER BY streaks.streak DESC
    LIMIT 100
    ) AS s
LEFT JOIN player ON player.id = s.player_id
LEFT JOIN match ON match.id = s.match_id

And the result looks like this (note that this is not a fixed table/view, as the query above can be extended by certain parameters such as nationality, date range, ranking of players, etc):

player_id | match_id | streak
-------------------------------
82        | 3733     |  2
82        | 7274     |  3
51        | 3746     |  2
...       | ...      |  ...

What I want to add now is a bunch of aggregate data to provide details about the winning streaks. For starters, I'd like to know the average rank of the opponents during each those streaks. Other data are the duration of the streak in time, first and last date, opponent name who ended the streak or if it's still ongoing, and so on. I've tried various things – CTE, some elaborate joins, unions, or adding them in as lag functions in the existing code. But I'm completely stuck how to solve this.

As is obvious from the code, my SQL skills are very basic, so please excuse any mistakes or inefficient statements. Also new to DBA so let me know if my question can be phrased better. For complete context, I'm using Postgres 9.4 on Debian, the matches_m table is a materialized view with 550k lines (query takes 2.5s right now). The data comes from http://aligulac.com/about/db/, I just mirror it to create the aforementioned view.

Best Answer

For a beginner in sql you sure picked some fairly advanced concept to start with ;-) Your question seems to boil down to what is known as island and gaps problems. I find it easiest to handle these kinds of problems by creating a group for consecutive events. One trick to accomplish this is to use two enumerations and calculate the difference:

row_number() over ( partition by player_id
                    order by match_id )

row_number() over ( partition by player_id, result
                    order by match_id )

if the difference between first and second grp changes it means that a change in result occurred. Since you will probably use this grp for several things I put it in a CTE:

with t as (
  select player_id, match_id, result, opponent_rank
       , row_number() over ( partition by player_id
                             order by match_id )
         -
         row_number() over ( partition by player_id, result
                             order by match_id ) as grp
  from matches
)
select player_id, match_id, result, opponent_rank, grp
     , count(1) over (partition by player_id, grp
                      order by match_id) as streak
     , avg(opponent_rank) over (partition by player_id, grp) as avg_rnk                 
from t
where result = 'w' and player_id = 82
order by player_id, match_id;

I believe this should give you a start, so I'll stop there. Welcome to the forum btw.

Window function?

A window function (count(*) over ()) does not seem to be what you want, since you don't have unaggregated rows.
You could add to the inner subquery:

count(*) OVER ()

.. to get the count of distinct landing_path_id, which is one other possible number that might be of interest. But that doesn't seem to be what you meant by "the total number of rows from that records select".
Or you could add to the inner subquery:

sum(count(*)) OVER ()

.. to get the total count with every landing_path_id redundantly, but that would seem pointless. Just mentioning that to demonstrate it's possible to run a window function over the result of an aggregate function in a single pass. Details for that:

Updated question

Your result, just without total_count in the records subquery. Now accounting for the LIMIT in the inner SELECT. Even though a maximum of 10 distinct landing_path_id is selected, all qualifying landing_path_id are counted.

To get both in one scan and reuse count and sum separately I introduce a CTE:

WITH cte AS (
  SELECT sum(entrances) AS entrances
       , count(*) over () AS total_count
  FROM   report_la
  WHERE  profile_id = 3777614
  GROUP  BY landing_path_id
  LIMIT  10
  )
SELECT row_to_json(selected_records)::text AS data
FROM  (   
   SELECT (SELECT total_count FROM cte LIMIT 1) AS total_count
        , array_to_json(array_agg(row_to_json(records))) AS data
   FROM  (SELECT entrances FROM cte) records
   ) selected_records;

If you don't care about the attribute name, you can have that cheaper with a subquery:

SELECT row_to_json(selected_records)::text AS data
FROM  (   
   SELECT min(total_count) AS total_count
        , array_to_json(array_agg(row_to_json(ROW(entrances)))) AS data
   FROM (
      SELECT sum(entrances) AS entrances
           , count(*) over () AS total_count  -- shouldn't show up in result
      FROM   report_la
      WHERE  profile_id = 3777614
      GROUP  BY landing_path_id
      LIMIT  1
      ) records
   ) selected_records;

You get the default attribute name f1 instead of entrances, since the ROW expression does not preserve the column name.

If you need a certain attribute name, you could cast the row to a registered type. (Ab-)using a TEMP TABLE to register my row type for the session:

CREATE TEMP TABLE rec1 (entrances bigint);

...
        , array_to_json(array_agg(row_to_json(ROW(entrances)::rec1))) AS data
...

This would be a bit faster than the CTE. Or, more verbose but without cast:

...
        , array_to_json(array_agg(row_to_json(
                   (SELECT x FROM (SELECT records.entrances) x)))) AS data
...

Detailed explanation in this related answer:

Select columns inside json_agg

SQL Fiddle.

PostgreSQL – Sort by Number of Related Rows in Referencing Table

Exclude users without emails

Assuming we only want users that actually have emails. Users without emails are ignored. The reason I went with this assumption at first is that all your queries do that already:

LEFT JOIN emails on users.id = emails.user_id
WHERE emails.email LIKE 'a' || '%%'

By adding a WHERE condition on emails.email you effectively convert your LEFT JOIN to a plain [INNER] JOIN and exclude users without emails. Detailed explanation:

Query with LEFT JOIN not returning rows for count of 0

2nd query rewritten

Your 2nd query does not work as advertised, results are not "descending by number of emails". You have to nest the result of count() in another CTE or subquery and run dense_rank() on it. You cannot nest window functions in the same query level.

SELECT u.name, e2.*
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT user_id, id AS e_id, e_mail
           , count(*) OVER (PARTITION BY user_id) AS n_emails          
      FROM   emails
      WHERE  email LIKE 'a' || '%'  -- one % is enough
      ) e1
   ) e2
JOIN   users u ON u.id = e2.user_id
WHERE  rnk < 3
ORDER  BY rnk;

This should be fastest if the predicate is selective enough (selects only a small fraction of all emails). Two window functions with rows sorted differently have their price, too.

A major point is to run the subquery on emails only - which is possible if the preliminary assumption holds.

3rd query improved

If, on the other hand, the predicate WHERE e.email LIKE 'a' || '%' is not very selective, your 3rd query is probably faster, even if it reads from the table twice - but the second time only desired rows. Also improved:

SELECT e.user_id, u.name,
       e.id AS e_id, e.e_mail, sq.n_emails
FROM  (
   SELECT user_id, count(*) AS n_emails
   FROM   emails
   WHERE  email LIKE 'a' || '%'
   GROUP  BY user_id
   ORDER  BY count(*) DESC, user_id  -- break ties
   LIMIT  2  OFFSET 0
   ) sq
JOIN   emails e USING (user_id)
JOIN   users  u ON u.id = e.user_id
WHERE  e.email LIKE 'a' || '%'
ORDER  BY sq.n_emails DESC;

Include users without emails

You can either include the users table in the inner query again, similar to what you had before. But you have to pull the filter on email into the join condition!

SELECT u.name, e2.*
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT u.id AS user_id, u.name, e.id AS e_id
           , count(e.user_id) OVER (PARTITION BY u.id) AS n_emails          
      FROM   users u
      LEFT   JOIN emails e ON e.user_id = u.id
                          AND e.email LIKE 'a' || '%'  -- !!!
      ) e1
   ) e2
WHERE  rnk < 3
ORDER  BY rnk;

Which will be a bit more expensive.

Since you retrieve users with the most emails first, users without emails will rarely be in the result. To optimize performance, you could use a UNION ALL with LIMIT:

(  -- parentheses required
SELECT u.name, e2.user_id, e2.e_id, e2.e_mail, e2.n_emails
FROM  (
   SELECT *, dense_rank() OVER (ORDER BY n_emails, users.id) AS rnk
   FROM  (
      SELECT user_id, id AS e_id, e_mail
           , count(*) OVER (PARTITION BY user_id) AS n_emails          
      FROM   emails
      WHERE  email LIKE 'a' || '%'  -- one % is enough
      ) e1
   ) e2
JOIN   users u ON u.id = e2.user_id
WHERE  rnk < 3      -- adapt to paging!
ORDER  BY rnk
)
UNION ALL
(    
SELECT u.name, u.user_id, NULL AS e_id, NULL AS e_mail, 0 AS n_emails  
FROM   users       u
LEFT   JOIN emails e ON e.user_id = u.id
                    AND e.email LIKE 'a' || '%'
WHERE  e.e.user_id IS NULL
)
OFFSET 0      -- adapt to paging!
LIMIT  2      -- adapt to paging!

Detailed explanation:

Optimize a query on two big tables

`MATERIALIZED VIEW`

I would consider materializing the result for two reasons:

Subsequent queries are much faster.
You don't have to operate on a moving target. You speak of paging, and if users get new emails between pages, your whole sort order may be moot.

Build a MV from the 2nd query without LIMIT (REFRESH MATERIALIZED VIEW), then return the first page etc. It's a matter of policy, when you refresh the MV again.

Best Answer

Related Solutions

PostgreSQL – Return Total Number of Rows and Selected Aggregated Data