SQL Server – How to Exclude All Matching Rows

duplicationsql server

I have a table that has a unique ID, name, email, and some other fields. I want to match this table against another table on email, but sometimes emails are shared. The unique ID is not shared among tables, but I will input the ID into the other table when I have identified them. Name is not unique.

I'd like to build two rounds of joins. The first round I would like to match on email and append the ID, and the second round would match on ID and update fields as necessary.

Is there a way to query into the table and grab only those that have unique emails, while ignoring those that share emails? I could then append the ID and re-query those that don't have IDs and join on the name. Would that even work?

Would this be a good use of a partition? I don't understand those well.

Best Answer

Is there a way to query into the table and grab only those that have unique emails, while ignoring those that share emails?

I can think of a couple ways to do this:

1) If I restate this, I might say "Get all of the email addresses that have only 1 entry in the table, and exclude email addresses having more than one entry in the table. Use GROUP BY...HAVING to identify the non-duplicated email addresses, grouping by the address, and literally looking for a COUNT of 1 (or > 1):

--UNIQUE EMAILS
SELECT EmailAddress 
FROM TableName
GROUP BY EmailAddress
HAVING COUNT(*) = 1;

---NON-UNIQUE EMAILS
SELECT EmailAddress 
FROM TableName
GROUP BY EmailAddress
HAVING COUNT(*) > 1;

This would literally give you the list of email addresses, which you would then need to do stuff with.

2) If I restate your question another way, "Get all of the emails where there doesn't exist another row with the same email address. Use 'EXISTS' and 'NOT EXISTS' and a correlated subquery to identify rows with or without duplicates.

--UNIQUE EMAILS
SELECT t1.EmailId, t1.EmailAddress
FROM TableName t1
WHERE NOT EXISTS (SELECT 1 FROM TableName t2 
                 WHERE t1.EMailId <> t2.EmailId AND t1.EmailAddress = t2.EmailAddress);

--NON-UNIQUE EMAILS
SELECT t1.EmailId, t1.EmailAddress
FROM TableName t1
WHERE EXISTS (SELECT 1 FROM TableName t2 
                 WHERE t1.EMailId <> t2.EmailId AND t1.EmailAddress = t2.EmailAddress);

Personally, I'd go with the second approach, as I think the logic is easier to follow, and I suspect it would perform better (though, that's just a guess).