Sql-server – SQL – Finding Duplicates Across Columns

duplicationsql servertoad

I have a table that has 4 e-mail fields. A single contact can have anywhere from 1-4 e-mail addresses on file.

I'm having trouble writing a query that displays all duplicates across columns. For my example, I'll only discuss 2 email fields.

Example:

  • John Smith, Personal Email fake@fakejohn.com Work fake@fakejohn.com
  • Jane Smith, Personal Email jane@fake678.com Work NULL
  • Jane Smith, Personal Email NULL Work jane@fake678.com

In the above example, I can't find a way to say "I have 2 Jane Smiths that are duplicates" because they have the same email, but in different columns while also excluding John Smith, because while his email appears twice, I don't want to self-reference, as this is not a duplicate of itself.

Iterations of my attempt to find a cross-column duplication check have been unsuccessful but include the following:

Select First, Last, Personal
From Contacts C
where C.Personal = C.Work
  and c.ID <> c.ID -- my attempt to prevent self reference

This doesn't work, I receive 0 results, but should have thousands.

Without the c.ID <> c.ID bit, I receive too many results, as I retrieve results for non-duplicate records that simply have the same email address in both fields.

I've also added "Group By" clauses based on First & Last name and filter by greater than 1, but when checking our database, I'm still finding that I have contacts with the same name and no e-mail addresses in common with each other but have the same scenario as John Smith above.

Is there any way to achieve this? I haven't found any queries to get me started.

Best Answer

Formally you may try to use something like

WITH cte1 AS (SELECT name, email1 email FROM table
              UNION ALL
              SELECT name, email2 FROM table
              UNION ALL
              SELECT name, email3 FROM table
              UNION ALL
              SELECT name, email4 FROM table),
     cte2 AS (SELECT email
              FROM cte1
              GROUP BY email
              HAVING COUNT(DISTINCT name) = 1)
SELECT *
FROM table
LEFT JOIN cte2 ON cte2.email IN (table.email1, table.email2, table.email3, table.email4)
WHERE cte2.email IS NULL
UNION ALL
SELECT table.name, MAX(table.email1) email1, MAX(table.email2) email2, 
                   MAX(table.email3) email3, MAX(table.email4) email4
FROM table
JOIN cte2 ON cte2.email IN (table.email1, table.email2, table.email3, table.email4)
GROUP BY table.name

But I don't know what you must do with the records like

Name   email1           email2            email3           email4
John   john@gmail.com   johnX@gmail.com   -                -   
John   -                Xjohn@gmail.com   john@gmail.com   -