Sql-server – How to aggregate pairs of id values into groups

aggregategraphsql serversql-server-2012

I have a table with two id columns id1 and id2. Each row shows that 2 id values have been linked together. My actual data has both id columns as varchar(20) but I have used integers in the example below to illustrate the problem.

id1 id2
--- ---
1    2
1    3
3    4
3    5
6    7

I want to group together ids so that all linked ids are grouped together even if there is not a direct link between 2 id values.

id group_id
-- ----
1   1
2   1
3   1
4   1
5   1
6   2
7   2   

I would like to achieve this in sql without using cursors if possible.

All advice gratefully received.

Update: I thought it would be useful to add a bit more background on the data this represents. The data represents records for businesses sourced from different data sources

The main table looks like this

id business_name postal_address business_classification x y group_id

The main table can contain duplicates / similar records which need to be associated. The table with id1 and id2 columns is a result of de-duplication using fuzzy string matching (levenshtein, word matching etc) and spatial proximity matching. I think that this table is actually an undirected graph containing many disconnected sub graphs e.g. group_id 1 and 2.I want to join all ids in the same network and update the group_id in the main table.

I think that the question is now how to identify all disconnected sub graphs on a graph using SQL server Find All Disconnected Sub Graphs – Java

Best Answer

Well this worked for the given sample data but not the cycles scenario / may not work for a more complex data set. Thought I would post anyway as a starter for 10:

DECLARE @t TABLE ( id1 INT, id2 INT )

INSERT INTO @t VALUES
    ( 1, 2 ),
    ( 1, 3 ),
    ( 3, 4 ),
    ( 3, 5 ),
    ( 6, 7 )

    --( 1, 2 ),
    --( 2, 3 ),
    --( 3, 1 )

;WITH cte AS
(
SELECT ROW_NUMBER() OVER( ORDER BY id1 ) group_id, id1 id 
FROM @t a
    CROSS APPLY ( SELECT TOP 1 id2 FROM @t c WHERE a.id1 = c.id1 ) d 
WHERE NOT EXISTS ( SELECT * FROM @t b WHERE a.id1 = b.id2 )
  AND a.id2 = d.id2
UNION ALL
SELECT group_id, t.id2
FROM cte c
    INNER JOIN @t t ON c.id = t.id1
)
SELECT *
FROM cte
ORDER BY 1, 2