Sql-server – How to aggregate pairs of id values into groups

aggregategraphsql serversql-server-2012

I have a table with two id columns id1 and id2. Each row shows that 2 id values have been linked together. My actual data has both id columns as varchar(20) but I have used integers in the example below to illustrate the problem.

I want to group together ids so that all linked ids are grouped together even if there is not a direct link between 2 id values.

id group_id
-- ----
1   1
2   1
3   1
4   1
5   1
6   2
7   2

I would like to achieve this in sql without using cursors if possible.

All advice gratefully received.

Update: I thought it would be useful to add a bit more background on the data this represents. The data represents records for businesses sourced from different data sources

The main table looks like this

id business_name postal_address business_classification x y group_id

The main table can contain duplicates / similar records which need to be associated. The table with id1 and id2 columns is a result of de-duplication using fuzzy string matching (levenshtein, word matching etc) and spatial proximity matching. I think that this table is actually an undirected graph containing many disconnected sub graphs e.g. group_id 1 and 2.I want to join all ids in the same network and update the group_id in the main table.

I think that the question is now how to identify all disconnected sub graphs on a graph using SQL server Find All Disconnected Sub Graphs – Java

Best Answer

Well this worked for the given sample data but not the cycles scenario / may not work for a more complex data set. Thought I would post anyway as a starter for 10:

DECLARE @t TABLE ( id1 INT, id2 INT )

INSERT INTO @t VALUES
    ( 1, 2 ),
    ( 1, 3 ),
    ( 3, 4 ),
    ( 3, 5 ),
    ( 6, 7 )

    --( 1, 2 ),
    --( 2, 3 ),
    --( 3, 1 )

;WITH cte AS
(
SELECT ROW_NUMBER() OVER( ORDER BY id1 ) group_id, id1 id 
FROM @t a
    CROSS APPLY ( SELECT TOP 1 id2 FROM @t c WHERE a.id1 = c.id1 ) d 
WHERE NOT EXISTS ( SELECT * FROM @t b WHERE a.id1 = b.id2 )
  AND a.id2 = d.id2
UNION ALL
SELECT group_id, t.id2
FROM cte c
    INNER JOIN @t t ON c.id = t.id1
)
SELECT *
FROM cte
ORDER BY 1, 2

Related Solutions

Sql-server – Grouping Data Using Joined Tables with Certain Distinct Values

If it is safe to assume that a single order can have only one distinct StoreID, you could resolve your issue by generating the set of distinct OrderID, StoreID pairs from orderLines and join that set instead of the table itself. That way you will not need to use DISTINCT with aggregation:

SELECT 
  `ol`.`storeID`,
  SUM(`o`.`grossValue`) AS 'Total',
  SUM(`o`.`paymentValue`) AS 'paymentTotal'
FROM
  `orders` AS `o`
LEFT JOIN
  (SELECT DISTINCT `OrderID`, `StoreID` FROM `orderLines`) AS `ol`
  ON `o`.`orderID` = `ol`.`orderID`
WHERE
  (`o`.`orderDate` BETWEEN '2015-07-07 00:00:00' AND '2015-07-07 23:59:59')
GROUP
  BY `ol`.`storeID`
;

And, as has already been mentioned, you should probably use INNER JOIN instead of the LEFT JOIN, unless you have orders that do not have order lines but for some reason do have a gross value and a payment (a little strange, but may be there is a reason for that), and you want them included in the results. (They will be represented as a single row with a null Store ID.)

Sql-server – Help with FOR XML EXPLICIT query

You could carve off the data you are interested in into a temp table and work with it there. I've used FOR XML PATH in this example as it's easier to work with:

USE tempdb
GO

IF OBJECT_ID('[dbo].[AUDIT]') IS NOT NULL DROP TABLE [dbo].[AUDIT]
GO

CREATE TABLE [dbo].[AUDIT](

    [UNIQUE_ID] [int], --IDENTITY(1,1) NOT NULL,
    [TABLE_NAME] [nvarchar](50) NOT NULL,
    [TABLE_UNIQUE_ID] [int] NOT NULL,
    [TABLE_USER_ID] [int] NOT NULL,
    [XML] [nvarchar](MAX) NULL,
    [DESCRIPTION] [nvarchar](MAX) NULL,
    [TIMESTAMP] [datetime] NOT NULL,
    [EXPLANATION] [nvarchar](MAX) NULL,
    [DATE_INDEX] [datetime] NOT NULL CONSTRAINT [DF_SYS_ADUIT_DATE_INDEX] DEFAULT ('01/01/1900')

PRIMARY KEY CLUSTERED 
(
    [UNIQUE_ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO



INSERT INTO [dbo].[AUDIT] ( UNIQUE_ID, TABLE_NAME, TABLE_UNIQUE_ID, TABLE_USER_ID, XML, DESCRIPTION, TIMESTAMP, EXPLANATION, DATE_INDEX )
VALUES
    ( 848098, 'MAIN', 500187, 317, '<Record><Data>Original record data and other xml goes here, not important right now</Data></Record>', 'Record was created', '8 Aug 2011 9:45', NULL, '1 Jan 1900' ),
    ( 849901, 'MAIN', 500187, 317, '<Record/>', 'Record locked', '8 Aug 2011 10:01', NULL, '1 Jan 1900' ),
    ( 854269, 'MAIN', 500187, 317, '<Record/>', 'Record unlocked', '8 Aug 2011 10:51', NULL, '3 Aug 2011' ),
    ( 911382, 'MAIN', 500187, 317, '<Record><Data>Change record data and other xml goes here, not important right now</Data></Record>', 'Record changes:', '8 Aug 2011 10:52', NULL, '8 Aug 2011' ),
    ( 927795, 'MAIN', 500187, 317, '<Record/>', 'Record locked', '15 Aug 2011 13:11', NULL, '1 Jan 1900' ),
    ( 931013, 'MAIN', 500187, 317, '<Record><Data>Current record data and other xml goes here, not important right now</Data></Record>', 'Record was exported', '15 Aug 2011 14:13', NULL, '1 Jan 1900' )
GO


IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp

DECLARE @from DATETIME = '8 Aug 2011 00:00', @through DATETIME = '8 Aug 2011 23:59:59:999', @currentRecord int

-- Get all records for that id and date range into a temp table
SELECT
    UNIQUE_ID, 
    TABLE_USER_ID, 
    TABLE_UNIQUE_ID,
    [TIMESTAMP],
    DATE_INDEX,
    [XML],
    [DESCRIPTION]
INTO #tmp
FROM dbo.[AUDIT]
WHERE DATE_INDEX >= @from 
AND DATE_INDEX <= @through
--AND presumably some other criteria to limit the records ...
AND TABLE_UNIQUE_ID = 500187
AND TABLE_USER_ID = 317

-- Add unique index and RowNum column
ALTER TABLE #tmp ADD PRIMARY KEY ( UNIQUE_ID )
ALTER TABLE #tmp ADD RowNum INT

-- Get any associated records (date criteria not used), which don't already exist
INSERT INTO #tmp ( UNIQUE_ID, TABLE_USER_ID, TABLE_UNIQUE_ID, TIMESTAMP, DATE_INDEX, XML, DESCRIPTION )
SELECT
    UNIQUE_ID, 
    TABLE_USER_ID, 
    TABLE_UNIQUE_ID,
    [TIMESTAMP],
    DATE_INDEX,
    [XML],
    [DESCRIPTION]
FROM dbo.[AUDIT] a
WHERE EXISTS ( SELECT * FROM #tmp t WHERE a.TABLE_UNIQUE_ID = t.TABLE_UNIQUE_ID AND a.TABLE_USER_ID = t.TABLE_USER_ID )
  AND NOT EXISTS ( SELECT * FROM #tmp u WHERE a.UNIQUE_ID = u.UNIQUE_ID )


-- Add the row number
;WITH cte AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY TABLE_USER_ID, TABLE_UNIQUE_ID ORDER BY UNIQUE_ID ) x, *  
FROM #tmp
)
UPDATE cte
SET RowNum = x


-- Have a look at the data before we create the xml
SELECT 't' s, * FROM #tmp
GO


SELECT
    (
    SELECT
        UNIQUE_ID AS AUDITID,
        TABLE_USER_ID AS [UID],
        TABLE_UNIQUE_ID AS UNIQUEID,
        CAST( [XML] AS XML ) AS "CURRENTDATA/*"
    FROM #tmp
    WHERE RowNum = ( SELECT MAX(RowNum) FROM #tmp )
    FOR XML PATH(''), TYPE
    ),
    (
    SELECT
        [TIMESTAMP] AS [TIMESTAMP],
        [DESCRIPTION] AS [DESCRIPTION],
        '' AS EXPLANATION
    FROM #tmp
    WHERE RowNum > 1
      --AND RowNum = ( SELECT MAX(RowNum) FROM #tmp )   -- ??
    FOR XML PATH('HISTORY'), TYPE
    ),
    (
    SELECT
        CAST( [XML] AS XML ) AS "DATA/*"
    FROM #tmp
    WHERE RowNum = 1
    FOR XML PATH('ORIGINAL'), TYPE
    )

FOR XML PATH('RECORD'), TYPE
GO

Best Answer

Related Solutions

Sql-server – Grouping Data Using Joined Tables with Certain Distinct Values

Sql-server – Help with FOR XML EXPLICIT query

Related Question