Sql-server – Merge two records in the same table, keeping foreign key relationships intact for both

mergesql serversql-server-2008-r2t-sql

I have a table, called SITES, that has three columns, lets say it looks like this:

ID  Name        Path
1   Google      http://www.google.com/
2   Microsoft   http://www.microsoft.com/

I also have a related table called Logs that looks like this:

ID  SiteID   LogData
1   1        --data--
2   1        --more--
3   2        --other--
...

The SITES table gets populated both by users of the system and also occasionally some batch processes. From the batch process, I don't always have the 'name' available, so a record is created that looks like this in SITES (when searching by Path yields no result), because the batch process is mostly interested in creating data in the Log table.

ID  Name        Path
99              http://arstechnica.com/

The problem situation arises like this:

1) User creates new SITE, but doesn't enter the Path (doesn't know it, or in some cases it doesn't yet exist) so the SITES table looks like this:

ID  Name        Path
1   Google      http://www.google.com/
2   Microsoft   http://www.microsoft.com/
3   Yahoo

2) The batch process comes along and needs to add a log for http://yahoo.com, searches SITES by Path and does not find it, so it makes its own and the result is this:

ID  Name        Path
1   Google      http://www.google.com/
2   Microsoft   http://www.microsoft.com/
3   Yahoo
4               http://yahoo.com

My question is, how can I merge Record 3 and record 4 while preserving the referential integrity of both records? Lets say for the sake of argument, that each record has several related records in both the Log table, as well as some other tables.

I understand that it will be a manual process of identifying records that need to be updated, so let any solution assume that I have audited the list and found all of the "duplicate" records.

Best Answer

This is a very straight-forward data cleanup task. Your solution is good (I'd recommend using a transaction construct as below, just to be safe) -- I've taken the same approach except I used set-based operations so it can be scaled up and automated once you come up with a better duplicate detection algorithm than the rudimentary one I provided.

/* Test setup */
CREATE TABLE [dbo].[Sites]
(
    Id int NOT NULL PRIMARY KEY,
    Name nvarchar(50) NULL,
    Path nvarchar(100) NULL
);

CREATE TABLE [dbo].[Logs]
(
    Id int NOT NULL PRIMARY KEY,
    SiteId int NOT NULL FOREIGN KEY REFERENCES [dbo].[Sites](Id),
    LogData nvarchar(MAX) NULL
);

INSERT INTO [dbo].[Sites](Id, Name, Path)
    VALUES
        (1, N'Google', N'http://www.google.com'),
        (2, N'Microsoft', N'http://www.microsoft.com'),
        (3, N'Yahoo', NULL),
        (4, NULL, N'http://www.yahoo.com');

INSERT INTO [dbo].[Logs](Id, SiteId)
    VALUES (1, 1), (2, 1), (3, 2), (4, 3), (5, 3), (6, 4);


/* Identify potential duplicates (note: very rough) */
SELECT
    s1.Id AS SourceId, s1.Name AS SourceName, s1.Path AS SourcePath,
    s2.Id AS DuplId, s2.Name AS DuplName, s2.Path AS DuplPath
    FROM [dbo].[Sites] s1
    INNER JOIN [dbo].[Sites] s2 ON
        (s2.Path LIKE (N'%' + s1.Name + N'%')) AND
        (s2.Id > s1.Id);


/* Merge duplicates */
DECLARE @duplicates table
(
    SourceId int NOT NULL,
    TargetId int NOT NULL,
    PRIMARY KEY (SourceId, TargetId),
    CHECK (SourceId != TargetId)
);

INSERT INTO @duplicates(SourceId, TargetId)
    VALUES ; /* Edit me! */


SET XACT_ABORT ON;

BEGIN TRANSACTION;

    UPDATE l
        SET SiteId = d.TargetId
        FROM @duplicates d
        INNER JOIN [dbo].[Logs] l ON l.SiteId = d.SourceId;

    UPDATE st
        SET
            Name = COALESCE(ss.Name, st.Name),
            Path = COALESCE(ss.Path, st.Path)
        FROM @duplicates d
        INNER JOIN [dbo].[Sites] ss ON ss.Id = d.SourceId
        INNER JOIN [dbo].[Sites] st ON st.Id = d.TargetId;

    DELETE s
        FROM @duplicates d
        INNER JOIN [dbo].[Sites] s ON s.Id = d.SourceId;

COMMIT TRANSACTION;


/* Validate the site attributes were merged correctly */
SELECT
    s.*
    FROM @duplicates d
    INNER JOIN [dbo].[Sites] s ON s.Id = d.TargetId;