SQL Server – Automating Foreign Keys Update

duplicationsql serversql server 2014

I'm on SQL Server 2014 and I need to clean a table from duplicates. Problem is: if a duplicate has references in other tables, that references should be updated to reference the chosen non-duplicate entity.

E.g., say a have a Person table:

+------+-------+-------+------------------------+
|  id  | fname | lname | mail                   | 
+------+-------+-------+------------------------+
| 1111 | John  | Smith | john.smith@example.com |
| 2222 | J.    | Smith | john.smith@example.com |
| 3333 | john  | smith | john.smith@example.com |
| 4444 | Smith | John  | john.smith@example.com |
+------+-------+-------+------------------------+

I need to delete 2222, 3333, 4444 and replace references around the database to these with a reference to 1111. After this operation, a unique index on mail will be created.

My approach would be gathering all the distinct mail and building a id => [ ids ] map (e.g. 1111 => [ 2222, 3333, 4444]) and with a scripting language like Perl or PHP update all the tables which may have references to the duplicates and setting the correct id.

Since there are thousands of users and hundreds of tables with relations to them, I wonder if operations like this could be done directly in SQL Server function, with something like:

DELETE FROM [Person] WHERE [id] IN (2222, 3333, 4444) UPDATE REFERENCES WITH 1111

Best Answer

I think this will do it
I changed this answer as I may have read the question incorrectly
Look at edit history if you are looking for something different

as for the fk

with CTE as 
(  select id, mail 
        , row_number() over (partition by mail order by id) as rn
   from table 
) 
update fk1  
   set fk1.fkID = cte1.ID 
  from fk as fk1 
  join cte as CTE2 
    on CTE2.ID = fk1.fkID
   and CTE2.rn > 1 
  join cte as CTE1 
    on CTE1.mail = CTE2.mail 
   and CTE1.rn = 1;

run above for addition fk

delete *  
from  
(  select id, mail 
        , row_number() over (partition by mail order by id) as rn
   from table 
) as t2 
where rn > 1;

Related Solutions

Sql-server – Replace cursor with set-based approach

I would probably just do this the brute force way, and add indexes to support these joins where they don't exist. Not much gain to treating new customers and old customers any different once you've inserted all the customers that don't already exist:

INSERT dbo.Customer(fname, lname, address, city, state, zip, email)
    SELECT fname, lname, address, city, state, zip, email
     FROM dbo.job AS j
     WHERE job_no = @job_no
     AND NOT EXISTS
     (
        SELECT 1 FROM dbo.Customer
        WHERE fname = j.fname
        AND lname = j.lname
        AND (address = j.address OR email = j.email)
     );

INSERT INTO dbo.personal_code (customer_id, mailing_id, personal_code, email)
SELECT c.customer_id, j.mailing_id, j.personal_code, c.email
  FROM dbo.Customer AS c
  INNER JOIN dbo.job AS j
  ON c.fname = j.fname AND c.lname = j.lname 
  AND (c.address = j.address OR c.email = j.email)
  WHERE j.job_no = @job_no;

INSERT dbo.personal_code_extra(personal_code_id, extra)
SELECT pc.personal_code_id, j.extra
  FROM dbo.personal_code AS pc
  INNER JOIN dbo.Customer AS c
  ON pc.customer_id = c.customer_id
  INNER JOIN dbo.job AS j
  ON c.fname = j.fname AND c.lname = j.lname 
  AND (c.address = j.address OR c.email = j.email)
  WHERE j.job_no = @job_no;

Sql-server – How to increase sql server performance? denormalization (making cascade delete off) vs making duplicate data

How big is this database? How many rows are in each table? Etc?

I would say that normalized data is default state to try to obtain. It is a leaner database, rows are shorter, and indexes may be used more effectively. The short, leaner rows therefore lead to a smaller, leaner database.

One of the major accelerators of performance is memory. If you can get your 5 tables to remain cached in memory, that will be a performance accelerator for your queries since you will avoid much of the disk I/O overhead.

You identify that you are joining with IDs (which are usually integers), so your indexes may be narrow and offer relatively inexpensive joins.

If you decide to denormalize, your tables will be bigger because they are carrying more redundant data on every row. This causes a need for more memory to keep the data in cache and will require even more I/O when the cache is insufficient to buffer the data. (And your backups are bigger.)

In addition, you have taken on the task to denormalize and to maintain the denormalized data. This is an extra load of programming and on the server as well: consuming memory, I/O, and CPU.

But sometimes denormalization is the best choice. Data Warehouses, for example, are largely denormalized data. Also, you may find that in your system the benefits of denormalization may exceed the cost.

Still, you are asking a forum for an answer.

Your best answer would come by building normalized test case and seeing how it works. Even though you may not have a lot of 'real data', you should generate a fairly large data set in the millions of rows to test with.

You can try to find a tool that does it for you (RedGate has one for example, but it is not free) or generate the data yourself so that you control the complexity. There are online sources of states, cities, et cetera, and you can make up call centers, generate landline numbers and so forth.

Then try it.

If you do not like the performance, then create a denormalized table to test. And put some effort into writing the code to maintain the denormalization, since that will become integral to your process.

Best Answer

Related Solutions

Sql-server – Replace cursor with set-based approach

Sql-server – How to increase sql server performance? denormalization (making cascade delete off) vs making duplicate data

Related Question