How to Merge Similar Records with Different Validity Dates in SQL Server

gaps-and-islandssql serversql-server-2008-r2

The table I am working on has three components:

  1. An ID column (primary key in another table)
  2. Some data columns
  3. Date valid from/to columns.

Values:

ID   Data From        To  
1    a    2015-01-01  2015-01-05
1    a    2015-01-06  2015-01-10
1    b    2015-01-11  2015-01-15
1    a    2015-01-16  2015-01-20
2    c    2015-01-01  2015-01-05
2    c    2015-01-06  2015-01-10

The table is updated by taking "snapshots" of another data source at some intervals and assigning validity dates to records. The problem is that these snapshots create duplicate entries for records (with different validity dates) that were not changed at all during that interval.

I want to reduce the size of the table by looking for rows with consecutive dates and by merging them and assigning them a single validity period. For example:

ID   Data From        To  
1    a    2015-01-01  2015-01-10
1    b    2015-01-11  2015-01-15
1    a    2015-01-16  2015-01-20
2    c    2015-01-01  2015-01-10

The logic I currently have is:

  1. Select and sort all rows by ID, data fields, and 'valid from' fields (so they are in groups of consecutive rows).
  2. Use a cursor to compare adjacent rows for similarity.
  3. If they are same, then merge rows and change validity period to include both rows.

I understand that cursors are very inefficient (I have a large dataset), so I am looking for other approaches.

Best Answer

If this is a table of back-to-back ranges only, your case can be treated as a classic "gaps and islands" problem, where you just need to isolate islands of consecutive ranges and then "condense" them by taking the minimum [from] and the maximum [to] per island.

There is an established method of solving this using two ROW_NUMBER calls:

WITH islands AS
(
  SELECT
    id,
    data,
    [from],
    [to],
    island = ROW_NUMBER() OVER (PARTITION BY id       ORDER BY [from])
           - ROW_NUMBER() OVER (PARTITION BY id, data ORDER BY [from])
  FROM
    #mergeTest
)
SELECT
  id,
  data,
  [from] = MIN([from]),
  [to]   = MAX([to])
FROM
  islands
GROUP BY
  id,
  data,
  island
;

This query will work in as low version as SQL Server 2005.