I have a table, called SITES, that has three columns, lets say it looks like this:
ID Name Path
1 Google http://www.google.com/
2 Microsoft http://www.microsoft.com/
I also have a related table called Logs that looks like this:
ID SiteID LogData
1 1 --data--
2 1 --more--
3 2 --other--
...
The SITES table gets populated both by users of the system and also occasionally some batch processes. From the batch process, I don't always have the 'name' available, so a record is created that looks like this in SITES (when searching by Path yields no result), because the batch process is mostly interested in creating data in the Log table.
ID Name Path
99 http://arstechnica.com/
The problem situation arises like this:
1) User creates new SITE, but doesn't enter the Path (doesn't know it, or in some cases it doesn't yet exist) so the SITES table looks like this:
ID Name Path
1 Google http://www.google.com/
2 Microsoft http://www.microsoft.com/
3 Yahoo
2) The batch process comes along and needs to add a log for http://yahoo.com, searches SITES by Path and does not find it, so it makes its own and the result is this:
ID Name Path
1 Google http://www.google.com/
2 Microsoft http://www.microsoft.com/
3 Yahoo
4 http://yahoo.com
My question is, how can I merge Record 3 and record 4 while preserving the referential integrity of both records? Lets say for the sake of argument, that each record has several related records in both the Log table, as well as some other tables.
I understand that it will be a manual process of identifying records that need to be updated, so let any solution assume that I have audited the list and found all of the "duplicate" records.
Best Answer
This is a very straight-forward data cleanup task. Your solution is good (I'd recommend using a transaction construct as below, just to be safe) -- I've taken the same approach except I used set-based operations so it can be scaled up and automated once you come up with a better duplicate detection algorithm than the rudimentary one I provided.