Sql-server – Merge two records in the same table, keeping foreign key relationships intact for both

mergesql serversql-server-2008-r2t-sql

I have a table, called SITES, that has three columns, lets say it looks like this:

ID  Name        Path
1   Google      http://www.google.com/
2   Microsoft   http://www.microsoft.com/

I also have a related table called Logs that looks like this:

ID  SiteID   LogData
1   1        --data--
2   1        --more--
3   2        --other--
...

The SITES table gets populated both by users of the system and also occasionally some batch processes. From the batch process, I don't always have the 'name' available, so a record is created that looks like this in SITES (when searching by Path yields no result), because the batch process is mostly interested in creating data in the Log table.

ID  Name        Path
99              http://arstechnica.com/

The problem situation arises like this:

1) User creates new SITE, but doesn't enter the Path (doesn't know it, or in some cases it doesn't yet exist) so the SITES table looks like this:

ID  Name        Path
1   Google      http://www.google.com/
2   Microsoft   http://www.microsoft.com/
3   Yahoo

2) The batch process comes along and needs to add a log for http://yahoo.com, searches SITES by Path and does not find it, so it makes its own and the result is this:

ID  Name        Path
1   Google      http://www.google.com/
2   Microsoft   http://www.microsoft.com/
3   Yahoo
4               http://yahoo.com

My question is, how can I merge Record 3 and record 4 while preserving the referential integrity of both records? Lets say for the sake of argument, that each record has several related records in both the Log table, as well as some other tables.

I understand that it will be a manual process of identifying records that need to be updated, so let any solution assume that I have audited the list and found all of the "duplicate" records.

Best Answer

This is a very straight-forward data cleanup task. Your solution is good (I'd recommend using a transaction construct as below, just to be safe) -- I've taken the same approach except I used set-based operations so it can be scaled up and automated once you come up with a better duplicate detection algorithm than the rudimentary one I provided.

/* Test setup */
CREATE TABLE [dbo].[Sites]
(
    Id int NOT NULL PRIMARY KEY,
    Name nvarchar(50) NULL,
    Path nvarchar(100) NULL
);

CREATE TABLE [dbo].[Logs]
(
    Id int NOT NULL PRIMARY KEY,
    SiteId int NOT NULL FOREIGN KEY REFERENCES [dbo].[Sites](Id),
    LogData nvarchar(MAX) NULL
);

INSERT INTO [dbo].[Sites](Id, Name, Path)
    VALUES
        (1, N'Google', N'http://www.google.com'),
        (2, N'Microsoft', N'http://www.microsoft.com'),
        (3, N'Yahoo', NULL),
        (4, NULL, N'http://www.yahoo.com');

INSERT INTO [dbo].[Logs](Id, SiteId)
    VALUES (1, 1), (2, 1), (3, 2), (4, 3), (5, 3), (6, 4);


/* Identify potential duplicates (note: very rough) */
SELECT
    s1.Id AS SourceId, s1.Name AS SourceName, s1.Path AS SourcePath,
    s2.Id AS DuplId, s2.Name AS DuplName, s2.Path AS DuplPath
    FROM [dbo].[Sites] s1
    INNER JOIN [dbo].[Sites] s2 ON
        (s2.Path LIKE (N'%' + s1.Name + N'%')) AND
        (s2.Id > s1.Id);


/* Merge duplicates */
DECLARE @duplicates table
(
    SourceId int NOT NULL,
    TargetId int NOT NULL,
    PRIMARY KEY (SourceId, TargetId),
    CHECK (SourceId != TargetId)
);

INSERT INTO @duplicates(SourceId, TargetId)
    VALUES ; /* Edit me! */


SET XACT_ABORT ON;

BEGIN TRANSACTION;

    UPDATE l
        SET SiteId = d.TargetId
        FROM @duplicates d
        INNER JOIN [dbo].[Logs] l ON l.SiteId = d.SourceId;

    UPDATE st
        SET
            Name = COALESCE(ss.Name, st.Name),
            Path = COALESCE(ss.Path, st.Path)
        FROM @duplicates d
        INNER JOIN [dbo].[Sites] ss ON ss.Id = d.SourceId
        INNER JOIN [dbo].[Sites] st ON st.Id = d.TargetId;

    DELETE s
        FROM @duplicates d
        INNER JOIN [dbo].[Sites] s ON s.Id = d.SourceId;

COMMIT TRANSACTION;


/* Validate the site attributes were merged correctly */
SELECT
    s.*
    FROM @duplicates d
    INNER JOIN [dbo].[Sites] s ON s.Id = d.TargetId;

Related Solutions

SQL Server Restore – Automating Database Restoration

Assuming the row in the config table is identified by WHERE DB = @DB:

UPDATE msdb.dbo.db_restconfig 
SET FilGroups_Move_To_FullQuery = STUFF((SELECT ',MOVE N''' + LogicalName 
   +  '''TO N''' + LEFT(physicalname, CHARINDEX('.', REVERSE(physicalname)) - 1)  
   + @path_name 
   + REVERSE(LEFT(REVERSE(physicalname),CHARINDEX('\', REVERSE(physicalname), 1) - 1))  
   + ''''
 FROM msdb.dbo.restorelist_files
 FOR XML PATH, TYPE).value(N'.[1]',N'nvarchar(max)'),1,1,'')
WHERE DB = @DB;

SQL Server – LEFT OUTER JOIN, Grouping, Summing, and Finding 0 or NULL

I think you may want something like the following. Note that there was an error in the SQL Fiddle, so the first query corrects the data so that the desired results are returned!

-- Correct an error in the data
-- The sub-phrases "hub" and "cap" were mapped to the original phrase "map" rather than "hub cap"
UPDATE tbl_search
SET original = 13 /* "hub cap" */
WHERE id IN (14, 15) /* "hub" and "cap" */

-- All phrases that had no results and also had no results for any of the sub-phrases
SELECT s1.phrase, COUNT(*) AS searchAttempts
FROM tbl_search s1
LEFT OUTER JOIN (
    SELECT DISTINCT original 
    FROM tbl_search
    WHERE results > 0
        AND original IS NOT NULL
) subPhrasesWithMatch
    ON subPhrasesWithMatch.original = s1.id
WHERE s1.original IS NULL /* Only original searches */
    AND s1.results = 0 /* Only searches with no results */
    AND subPhrasesWithMatch.original IS NULL /* We didn't match the join to sub-phrases that returned results */
GROUP BY s1.phrase
ORDER BY searchAttempts DESC
--phrase    searchAttempts
--foo bar   3
--map       2
--foo       1

Best Answer

Related Solutions

SQL Server Restore – Automating Database Restoration

SQL Server – LEFT OUTER JOIN, Grouping, Summing, and Finding 0 or NULL

Related Question