The IDENTITY
key in your FINAL_COUNTRIES
table is just fine for a PK. You've created a "synthetic key." I would also add a UQ on { OriginalCountryID, OfficeID }, to speed joins from staged data.
Another option would be to have a Countries
table in your warehouse with only unique values (e.g., "USA" just once) and then a CountriesMap
table relating OfficeID and CountryID to the ID in the warehouse table. This would help performance, as when you get to your fact tables you'll want to use a single INT
to relate to the Countries table, rather than an INT
and a VARCHAR()
- fact tables can get big fast, and it's important to keep them as narrow as you can.
This is a very straight-forward data cleanup task. Your solution is good (I'd recommend using a transaction construct as below, just to be safe) -- I've taken the same approach except I used set-based operations so it can be scaled up and automated once you come up with a better duplicate detection algorithm than the rudimentary one I provided.
/* Test setup */
CREATE TABLE [dbo].[Sites]
(
Id int NOT NULL PRIMARY KEY,
Name nvarchar(50) NULL,
Path nvarchar(100) NULL
);
CREATE TABLE [dbo].[Logs]
(
Id int NOT NULL PRIMARY KEY,
SiteId int NOT NULL FOREIGN KEY REFERENCES [dbo].[Sites](Id),
LogData nvarchar(MAX) NULL
);
INSERT INTO [dbo].[Sites](Id, Name, Path)
VALUES
(1, N'Google', N'http://www.google.com'),
(2, N'Microsoft', N'http://www.microsoft.com'),
(3, N'Yahoo', NULL),
(4, NULL, N'http://www.yahoo.com');
INSERT INTO [dbo].[Logs](Id, SiteId)
VALUES (1, 1), (2, 1), (3, 2), (4, 3), (5, 3), (6, 4);
/* Identify potential duplicates (note: very rough) */
SELECT
s1.Id AS SourceId, s1.Name AS SourceName, s1.Path AS SourcePath,
s2.Id AS DuplId, s2.Name AS DuplName, s2.Path AS DuplPath
FROM [dbo].[Sites] s1
INNER JOIN [dbo].[Sites] s2 ON
(s2.Path LIKE (N'%' + s1.Name + N'%')) AND
(s2.Id > s1.Id);
/* Merge duplicates */
DECLARE @duplicates table
(
SourceId int NOT NULL,
TargetId int NOT NULL,
PRIMARY KEY (SourceId, TargetId),
CHECK (SourceId != TargetId)
);
INSERT INTO @duplicates(SourceId, TargetId)
VALUES ; /* Edit me! */
SET XACT_ABORT ON;
BEGIN TRANSACTION;
UPDATE l
SET SiteId = d.TargetId
FROM @duplicates d
INNER JOIN [dbo].[Logs] l ON l.SiteId = d.SourceId;
UPDATE st
SET
Name = COALESCE(ss.Name, st.Name),
Path = COALESCE(ss.Path, st.Path)
FROM @duplicates d
INNER JOIN [dbo].[Sites] ss ON ss.Id = d.SourceId
INNER JOIN [dbo].[Sites] st ON st.Id = d.TargetId;
DELETE s
FROM @duplicates d
INNER JOIN [dbo].[Sites] s ON s.Id = d.SourceId;
COMMIT TRANSACTION;
/* Validate the site attributes were merged correctly */
SELECT
s.*
FROM @duplicates d
INNER JOIN [dbo].[Sites] s ON s.Id = d.TargetId;
Best Answer
Result:
If you know your strings can't contain characters like
<
,>
and&
then it should be slightly more efficient to leave out theTYPE).value
bit: