DECLARE @t1 TABLE (T1_key int, T1_Data char(1), T1_ValidUntil datetime)
DECLARE @t2 TABLE (T2_key int, T2_Data char(1), T2_ValidUntil datetime)
INSERT @t1 VALUES (1, 'A', '2000-01-01')
INSERT @t1 VALUES (1, 'B', '2000-06-30')
INSERT @t1 VALUES (2, 'C', '2005-05-31')
INSERT @t1 VALUES (3, 'D', '2004-12-31')
INSERT @t1 VALUES (3, 'E', '2007-04-30')
INSERT @t1 VALUES (3, 'F', '2008-01-31')
INSERT @t2 VALUES (1, 'R', '2002-03-31')
INSERT @t2 VALUES (2, 'S', '2001-06-30')
INSERT @t2 VALUES (2, 'T', '2003-02-28')
INSERT @t2 VALUES (2, 'U', '2005-05-31')
INSERT @t2 VALUES (3, 'V', '2006-09-30')
INSERT @t2 VALUES (3, 'W', '2007-06-30')
SELECT
T1.*, T2x.*
FROM
@t1 T1
CROSS APPLY
(SELECT TOP 1*
FROM @t2
WHERE T1_key = T2_key AND T2_ValidUntil >= T1_ValidUntil
ORDER BY T2_ValidUntil
) T2x
UNION
SELECT
T1x.*, T2.*
FROM
@t2 T2
CROSS APPLY
(SELECT TOP 1*
FROM @t1
WHERE T1_key = T2_key AND T1_ValidUntil >= T2_ValidUntil
ORDER BY T1_ValidUntil
) T1x
In Postgres (tested with v9.3) you can use the dedicated inet
data type, to store IPv4 addresses with only 7 bytes (or IPv6 with 19 bytes) and with automatic integrity checks and dedicated functions and type casts etc.
Schema
The translated schema could look like this:
CREATE TABLE log (
id serial PRIMARY KEY
, dst_port int
, src_ip inet
, dst_ip inet
);
CREATE INDEX ON log (dst_port);
CREATE INDEX ON log (src_ip);
I moved to dst_port int
to the 2nd position to optimize alignment / padding:
Now we can use standard window functions (not possible in MySQL).
Step 1: Fold groups of consecutive dst_ip
for same (dst_port
)
One special difficulty: The aggregate function min()
/ max()
are not yet implemented for inet
in Postgres 9.4. Both are in the upcoming Postgres 9.5!
So I substituted with DISTINCT ON
in the first step:
SELECT DISTINCT ON (dst_port, ip_grp)
dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM (
SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
ORDER BY dst_ip) AS ip_grp
FROM log
ORDER BY dst_port, dst_ip
) sub
ORDER BY dst_port, ip_grp, dst_ip;
Result as desired - with a count of rows (could be upper IP as well).
You can subtract/add integer
from/to the inet
type. By subtracting the row_number()
all consecutive rows get the same grp
- the value of grp
is irrelevant, just the fast that it's the same per partition (dst_port
).
Then we can GROUP BY ...
- or in this special case DISTINCT ON dst_port, ip_grp
. I use another window function to get the count ip_ct
in the same step: count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct
.
Note that consecutive IPs can cross byte boundaries (see my comment to question).
Detailed explanation for this technique:
Step 2: Fold groups of consecutive dst_port
for same (dst_ip, ip_ct)
SELECT dst_ip, ip_ct, min(dst_port) AS dst_port, count(*) AS port_ct
FROM (
SELECT *, dst_port - row_number() OVER (PARTITION BY dst_ip, ip_ct
ORDER BY dst_port) AS port_grp
FROM (
SELECT DISTINCT ON (dst_port, ip_grp)
dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM (
SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
ORDER BY dst_ip) AS ip_grp
FROM log
ORDER BY dst_port, dst_ip
) sub1
ORDER BY dst_port, ip_grp, dst_ip
) sub2
) sub3
GROUP BY 1, 2, port_grp
ORDER BY 1, 3, 2;
Basically, repeat the same logic like in the first step, applied to the result of the first step.
But now you have to group on ip_ct
additionally. And this time, you can use the simpler min(dst_port)
, since the port number is a plain integer
.
SQL Fiddle demonstrating all.
Best Answer
If this is a table of back-to-back ranges only, your case can be treated as a classic "gaps and islands" problem, where you just need to isolate islands of consecutive ranges and then "condense" them by taking the minimum
[from]
and the maximum[to]
per island.There is an established method of solving this using two ROW_NUMBER calls:
This query will work in as low version as SQL Server 2005.