How to Merge Similar Records with Different Validity Dates in SQL Server

gaps-and-islandssql serversql-server-2008-r2

The table I am working on has three components:

An ID column (primary key in another table)
Some data columns
Date valid from/to columns.

Values:

ID   Data From        To  
1    a    2015-01-01  2015-01-05
1    a    2015-01-06  2015-01-10
1    b    2015-01-11  2015-01-15
1    a    2015-01-16  2015-01-20
2    c    2015-01-01  2015-01-05
2    c    2015-01-06  2015-01-10

The table is updated by taking "snapshots" of another data source at some intervals and assigning validity dates to records. The problem is that these snapshots create duplicate entries for records (with different validity dates) that were not changed at all during that interval.

I want to reduce the size of the table by looking for rows with consecutive dates and by merging them and assigning them a single validity period. For example:

ID   Data From        To  
1    a    2015-01-01  2015-01-10
1    b    2015-01-11  2015-01-15
1    a    2015-01-16  2015-01-20
2    c    2015-01-01  2015-01-10

The logic I currently have is:

Select and sort all rows by ID, data fields, and 'valid from' fields (so they are in groups of consecutive rows).
Use a cursor to compare adjacent rows for similarity.
If they are same, then merge rows and change validity period to include both rows.

I understand that cursors are very inefficient (I have a large dataset), so I am looking for other approaches.

Best Answer

If this is a table of back-to-back ranges only, your case can be treated as a classic "gaps and islands" problem, where you just need to isolate islands of consecutive ranges and then "condense" them by taking the minimum [from] and the maximum [to] per island.

There is an established method of solving this using two ROW_NUMBER calls:

WITH islands AS
(
  SELECT
    id,
    data,
    [from],
    [to],
    island = ROW_NUMBER() OVER (PARTITION BY id       ORDER BY [from])
           - ROW_NUMBER() OVER (PARTITION BY id, data ORDER BY [from])
  FROM
    #mergeTest
)
SELECT
  id,
  data,
  [from] = MIN([from]),
  [to]   = MAX([to])
FROM
  islands
GROUP BY
  id,
  data,
  island
;

This query will work in as low version as SQL Server 2005.

Related Solutions

Sql-server – SQL Server: join two tables with “valid-to” dates

DECLARE @t1 TABLE (T1_key int, T1_Data char(1), T1_ValidUntil datetime)
DECLARE @t2 TABLE (T2_key int, T2_Data char(1), T2_ValidUntil datetime)

INSERT @t1 VALUES (1, 'A', '2000-01-01')
INSERT @t1 VALUES (1, 'B', '2000-06-30')
INSERT @t1 VALUES (2, 'C', '2005-05-31')
INSERT @t1 VALUES (3, 'D', '2004-12-31')
INSERT @t1 VALUES (3, 'E', '2007-04-30')
INSERT @t1 VALUES (3, 'F', '2008-01-31')

INSERT @t2 VALUES (1, 'R', '2002-03-31')
INSERT @t2 VALUES (2, 'S', '2001-06-30')
INSERT @t2 VALUES (2, 'T', '2003-02-28')
INSERT @t2 VALUES (2, 'U', '2005-05-31')
INSERT @t2 VALUES (3, 'V', '2006-09-30')
INSERT @t2 VALUES (3, 'W', '2007-06-30')

SELECT
   T1.*, T2x.*
FROM
   @t1 T1
   CROSS APPLY
   (SELECT TOP 1*
   FROM @t2 
   WHERE T1_key = T2_key AND T2_ValidUntil >= T1_ValidUntil
   ORDER BY T2_ValidUntil
   ) T2x
UNION
SELECT
   T1x.*, T2.*
FROM
   @t2 T2
   CROSS APPLY
   (SELECT TOP 1*
   FROM @t1
   WHERE T1_key = T2_key AND T1_ValidUntil >= T2_ValidUntil
   ORDER BY T1_ValidUntil
   ) T1x

MySQL – How to Group by Maximum Consecutive Row

In Postgres (tested with v9.3) you can use the dedicated inet data type, to store IPv4 addresses with only 7 bytes (or IPv6 with 19 bytes) and with automatic integrity checks and dedicated functions and type casts etc.

Schema

The translated schema could look like this:

CREATE TABLE log (
  id serial PRIMARY KEY
, dst_port int
, src_ip inet
, dst_ip inet
);
CREATE INDEX ON log (dst_port);
CREATE INDEX ON log (src_ip);

I moved to dst_port int to the 2nd position to optimize alignment / padding:

Configuring PostgreSQL for read performance

Now we can use standard window functions (not possible in MySQL).

Step 1: Fold groups of consecutive `dst_ip` for same (`dst_port`)

One special difficulty: The aggregate function min() / max() are not yet implemented for inet in Postgres 9.4. Both are in the upcoming Postgres 9.5!

So I substituted with DISTINCT ON in the first step:

Select first row in each GROUP BY group?

SELECT DISTINCT ON (dst_port, ip_grp)
       dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
FROM  (
   SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                        ORDER BY dst_ip) AS ip_grp
   FROM   log
   ORDER  BY dst_port, dst_ip
   ) sub
ORDER  BY dst_port, ip_grp, dst_ip;

Result as desired - with a count of rows (could be upper IP as well).

You can subtract/add integer from/to the inet type. By subtracting the row_number() all consecutive rows get the same grp - the value of grp is irrelevant, just the fast that it's the same per partition (dst_port).

Then we can GROUP BY ... - or in this special case DISTINCT ON dst_port, ip_grp. I use another window function to get the count ip_ct in the same step: count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct.

Note that consecutive IPs can cross byte boundaries (see my comment to question).

Detailed explanation for this technique:

Select longest continuous sequence

Step 2: Fold groups of consecutive `dst_port` for same `(dst_ip, ip_ct)`

SELECT dst_ip, ip_ct, min(dst_port) AS dst_port, count(*) AS port_ct
FROM  (
   SELECT *, dst_port - row_number() OVER (PARTITION BY dst_ip, ip_ct
                                           ORDER BY dst_port) AS port_grp
   FROM  (
      SELECT DISTINCT ON (dst_port, ip_grp)
             dst_ip, count(*) OVER (PARTITION BY dst_port, ip_grp) AS ip_ct, dst_port
      FROM  (
         SELECT dst_ip, dst_port, dst_ip - row_number() OVER (PARTITION BY dst_port
                                                              ORDER BY dst_ip) AS ip_grp
         FROM   log
         ORDER  BY dst_port, dst_ip
         ) sub1
      ORDER  BY dst_port, ip_grp, dst_ip
      ) sub2
   ) sub3
GROUP  BY 1, 2, port_grp
ORDER  BY 1, 3, 2;

Basically, repeat the same logic like in the first step, applied to the result of the first step.
But now you have to group on ip_ct additionally. And this time, you can use the simpler min(dst_port), since the port number is a plain integer.

SQL Fiddle demonstrating all.