PostgreSQL Gaps and Islands – Show Current Row Win Streak

gaps-and-islandspostgresql

I need to show the running win/loss streak per row in a query, so given the table below, the query should return the "expected" column. I've tried some approaches with window functions, but no success.

create table matches (player text, dt date,  is_winner boolean, expected integer )
insert into matches values
('A', '2019-01-01', TRUE, 0),
('A', '2019-01-03', TRUE, 1),
('A', '2019-01-04', TRUE, 2),
('A', '2019-01-09', FALSE, 0),
('A', '2019-01-10', FALSE, -1),
('A', '2019-01-15', TRUE, 0);

player  dt          is_winner   expected
A       2019-01-01  true        0
A       2019-01-03  true        1
A       2019-01-04  true        2
A       2019-01-09  false       0
A       2019-01-10  false       -1
A       2019-01-15  true        0

The logic is:

Resets to 0 when winning after a loss, or losing after a win.
Increments after a win, but not if it's case 1.
Decrements after a loss, but not if it's case 1.

Any insights on how to tackle this are welcome. My last resort would be a function with a loop called by every row.

Best Answer

I've done this in stages using CTEs so that you can see how it's done as the queries progress. Each CTE adds a column in the output in order to show you progress.

It's pretty much self-documenting with the CTE names, to be honest.

with lags as (
  select player,
         dt,
         is_winner,
         lag(is_winner) OVER (partition by player ORDER BY dt ASC) as prev_is_winner,
         expected
  from matches
),
     group_changes as (
       select lags.*,
              case
                when prev_is_winner <> is_winner or prev_is_winner is null
                  then 1
                else 0
                end as is_new_group
       from lags
     ),
     groups_numbered as (
       select *,
              sum(is_new_group)
                  over (partition by player order by dt, is_winner desc) as streak_group
       from group_changes
     ),
     expected_in_groups as (
       select groups_numbered.*,
              row_number()
                  over (partition by player,streak_group
                    order by dt asc, streak_group asc) - 1 as expected_unsigned
       from groups_numbered
     )
select expected_in_groups.*, case when is_winner = 't' then expected_unsigned else expected_unsigned * -1 end as actual
from expected_in_groups
order by player asc, dt asc;

DB Fiddle Link (I added an extra row just to make sure it was working at a certain point)

Basically:

lags CTE: use LAG() to get the previous result relative to the current row.
group_changes CTE: Detect whether the previous streak, whether win or loss, has ended
groups_numbered CTE: Give each streak a number
expected_in_groups CTE: Number the rows in the group
Final select: negate the loss streaks

+--------+------------+-----------+----------------+----------+--------------+--------------+-------------------+--------+
| player |     dt     | is_winner | prev_is_winner | expected | is_new_group | streak_group | expected_unsigned | actual |
+--------+------------+-----------+----------------+----------+--------------+--------------+-------------------+--------+
| A      | 2019-01-01 | t         |                |        0 |            1 |            1 |                 0 |      0 |
| A      | 2019-01-03 | t         | t              |        1 |            0 |            1 |                 1 |      1 |
| A      | 2019-01-04 | t         | t              |        2 |            0 |            1 |                 2 |      2 |
| A      | 2019-01-09 | f         | t              |        0 |            1 |            2 |                 0 |      0 |
| A      | 2019-01-10 | f         | f              |       -1 |            0 |            2 |                 1 |     -1 |
| A      | 2019-01-11 | f         | f              |       -2 |            0 |            2 |                 2 |     -2 |
| A      | 2019-01-15 | t         | f              |        0 |            1 |            3 |                 0 |      0 |
+--------+------------+-----------+----------------+----------+--------------+--------------+-------------------+--------+

Schema

The translated schema could look like this:

CREATE TABLE log (
  id serial PRIMARY KEY
, dst_port int
, src_ip inet
, dst_ip inet
);
CREATE INDEX ON log (dst_port);
CREATE INDEX ON log (src_ip);

I moved to dst_port int to the 2nd position to optimize alignment / padding:

Configuring PostgreSQL for read performance

Now we can use standard window functions (not possible in MySQL).

Step 1: Fold groups of consecutive `dst_ip` for same (`dst_port`)

One special difficulty: The aggregate function min() / max() are not yet implemented for inet in Postgres 9.4. Both are in the upcoming Postgres 9.5!

So I substituted with DISTINCT ON in the first step:

Select first row in each GROUP BY group?

SELECT DISTINCT ON (dst_port, ip_grp)
       dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM  (
   SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                        ORDER BY dst_ip) AS ip_grp
   FROM   log
   ORDER  BY dst_port, dst_ip
   ) sub
ORDER  BY dst_port, ip_grp, dst_ip;

Result as desired - with a count of rows (could be upper IP as well).

You can subtract/add integer from/to the inet type. By subtracting the row_number() all consecutive rows get the same grp - the value of grp is irrelevant, just the fast that it's the same per partition (dst_port).

Then we can GROUP BY ... - or in this special case DISTINCT ON dst_port, ip_grp. I use another window function to get the count ip_ct in the same step: count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct.

Note that consecutive IPs can cross byte boundaries (see my comment to question).

Detailed explanation for this technique:

Select longest continuous sequence

Step 2: Fold groups of consecutive `dst_port` for same `(dst_ip, ip_ct)`

SELECT dst_ip, ip_ct, min(dst_port) AS dst_port, count(*) AS port_ct
FROM  (
   SELECT *, dst_port - row_number() OVER (PARTITION BY dst_ip, ip_ct
                                           ORDER BY dst_port) AS port_grp
   FROM  (
      SELECT DISTINCT ON (dst_port, ip_grp)
             dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
      FROM  (
         SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                              ORDER BY dst_ip) AS ip_grp
         FROM   log
         ORDER  BY dst_port, dst_ip
         ) sub1
      ORDER  BY dst_port, ip_grp, dst_ip
      ) sub2
   ) sub3
GROUP  BY 1, 2, port_grp
ORDER  BY 1, 3, 2;

Basically, repeat the same logic like in the first step, applied to the result of the first step.
But now you have to group on ip_ct additionally. And this time, you can use the simpler min(dst_port), since the port number is a plain integer.

SQL Fiddle demonstrating all.

Best Answer

Related Solutions

T-sql – Using Row_Number to find consecutive row count

MySQL – How to Group by Maximum Consecutive Row

Schema

Step 1: Fold groups of consecutive dst_ip for same (dst_port)

Step 2: Fold groups of consecutive dst_port for same (dst_ip, ip_ct)

Related Question

Step 1: Fold groups of consecutive `dst_ip` for same (`dst_port`)

Step 2: Fold groups of consecutive `dst_port` for same `(dst_ip, ip_ct)`