The general name for this type of query is "gaps and islands". One approach below. If you can have duplicates in the source data you might need dense_rank
rather than row_number
WITH DATA(C) AS
(
SELECT 724 UNION ALL
SELECT 727 UNION ALL
SELECT 728 UNION ALL
SELECT 729 UNION ALL
SELECT 735 UNION ALL
SELECT 737 UNION ALL
SELECT 743 UNION ALL
SELECT 744 UNION ALL
SELECT 747 UNION ALL
SELECT 749
), T1 AS
(
SELECT C,
C - ROW_NUMBER() OVER (ORDER BY C) AS Grp
FROM DATA)
SELECT C,
ROW_NUMBER() OVER (PARTITION BY Grp ORDER BY C) AS Consecutive
FROM T1
Returns
C Consecutive
----------- --------------------
724 1
727 1
728 2
729 3
735 1
737 1
743 1
744 2
747 1
749 1
In Postgres (tested with v9.3) you can use the dedicated inet
data type, to store IPv4 addresses with only 7 bytes (or IPv6 with 19 bytes) and with automatic integrity checks and dedicated functions and type casts etc.
Schema
The translated schema could look like this:
CREATE TABLE log (
id serial PRIMARY KEY
, dst_port int
, src_ip inet
, dst_ip inet
);
CREATE INDEX ON log (dst_port);
CREATE INDEX ON log (src_ip);
I moved to dst_port int
to the 2nd position to optimize alignment / padding:
Now we can use standard window functions (not possible in MySQL).
Step 1: Fold groups of consecutive dst_ip
for same (dst_port
)
One special difficulty: The aggregate function min()
/ max()
are not yet implemented for inet
in Postgres 9.4. Both are in the upcoming Postgres 9.5!
So I substituted with DISTINCT ON
in the first step:
SELECT DISTINCT ON (dst_port, ip_grp)
dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM (
SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
ORDER BY dst_ip) AS ip_grp
FROM log
ORDER BY dst_port, dst_ip
) sub
ORDER BY dst_port, ip_grp, dst_ip;
Result as desired - with a count of rows (could be upper IP as well).
You can subtract/add integer
from/to the inet
type. By subtracting the row_number()
all consecutive rows get the same grp
- the value of grp
is irrelevant, just the fast that it's the same per partition (dst_port
).
Then we can GROUP BY ...
- or in this special case DISTINCT ON dst_port, ip_grp
. I use another window function to get the count ip_ct
in the same step: count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct
.
Note that consecutive IPs can cross byte boundaries (see my comment to question).
Detailed explanation for this technique:
Step 2: Fold groups of consecutive dst_port
for same (dst_ip, ip_ct)
SELECT dst_ip, ip_ct, min(dst_port) AS dst_port, count(*) AS port_ct
FROM (
SELECT *, dst_port - row_number() OVER (PARTITION BY dst_ip, ip_ct
ORDER BY dst_port) AS port_grp
FROM (
SELECT DISTINCT ON (dst_port, ip_grp)
dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM (
SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
ORDER BY dst_ip) AS ip_grp
FROM log
ORDER BY dst_port, dst_ip
) sub1
ORDER BY dst_port, ip_grp, dst_ip
) sub2
) sub3
GROUP BY 1, 2, port_grp
ORDER BY 1, 3, 2;
Basically, repeat the same logic like in the first step, applied to the result of the first step.
But now you have to group on ip_ct
additionally. And this time, you can use the simpler min(dst_port)
, since the port number is a plain integer
.
SQL Fiddle demonstrating all.
Best Answer
I've done this in stages using CTEs so that you can see how it's done as the queries progress. Each CTE adds a column in the output in order to show you progress.
It's pretty much self-documenting with the CTE names, to be honest.
DB Fiddle Link (I added an extra row just to make sure it was working at a certain point)
Basically:
lags
CTE: useLAG()
to get the previous result relative to the current row.group_changes
CTE: Detect whether the previous streak, whether win or loss, has endedgroups_numbered
CTE: Give each streak a numberexpected_in_groups
CTE: Number the rows in the groupselect
: negate the loss streaks.