Sql-server – SQL – Finding Duplicates Across Columns

duplicationsql servertoad

I have a table that has 4 e-mail fields. A single contact can have anywhere from 1-4 e-mail addresses on file.

I'm having trouble writing a query that displays all duplicates across columns. For my example, I'll only discuss 2 email fields.

Example:

John Smith, Personal Email fake@fakejohn.com Work fake@fakejohn.com
Jane Smith, Personal Email jane@fake678.com Work NULL
Jane Smith, Personal Email NULL Work jane@fake678.com

In the above example, I can't find a way to say "I have 2 Jane Smiths that are duplicates" because they have the same email, but in different columns while also excluding John Smith, because while his email appears twice, I don't want to self-reference, as this is not a duplicate of itself.

Iterations of my attempt to find a cross-column duplication check have been unsuccessful but include the following:

Select First, Last, Personal
From Contacts C
where C.Personal = C.Work
  and c.ID <> c.ID -- my attempt to prevent self reference

This doesn't work, I receive 0 results, but should have thousands.

Without the c.ID <> c.ID bit, I receive too many results, as I retrieve results for non-duplicate records that simply have the same email address in both fields.

I've also added "Group By" clauses based on First & Last name and filter by greater than 1, but when checking our database, I'm still finding that I have contacts with the same name and no e-mail addresses in common with each other but have the same scenario as John Smith above.

Is there any way to achieve this? I haven't found any queries to get me started.

Best Answer

Formally you may try to use something like

WITH cte1 AS (SELECT name, email1 email FROM table
              UNION ALL
              SELECT name, email2 FROM table
              UNION ALL
              SELECT name, email3 FROM table
              UNION ALL
              SELECT name, email4 FROM table),
     cte2 AS (SELECT email
              FROM cte1
              GROUP BY email
              HAVING COUNT(DISTINCT name) = 1)
SELECT *
FROM table
LEFT JOIN cte2 ON cte2.email IN (table.email1, table.email2, table.email3, table.email4)
WHERE cte2.email IS NULL
UNION ALL
SELECT table.name, MAX(table.email1) email1, MAX(table.email2) email2, 
                   MAX(table.email3) email3, MAX(table.email4) email4
FROM table
JOIN cte2 ON cte2.email IN (table.email1, table.email2, table.email3, table.email4)
GROUP BY table.name

But I don't know what you must do with the records like

Name   email1           email2            email3           email4
John   john@gmail.com   johnX@gmail.com   -                -   
John   -                Xjohn@gmail.com   john@gmail.com   -

Related Solutions

Sql-server – Delete duplicates with specific columns as identifier

--=======================================================
-- delete the duplicate records from table @t
-- keeping a single unit of each
-- marcelo miorelli 24-nov-2014
--=======================================================


--=======================================================
--create a table variable and insert records in it
-- just for this example
--=======================================================
declare @t table ([num] int, com varbinary(512) )

insert into @t select 1, convert(varbinary(512), '6778981' )
insert into @t select 1, convert(varbinary(512), '6778981' )
insert into @t select 1, convert(varbinary(512), '6778982' )
insert into @t select 2, convert(varbinary(512), '6778982' )
insert into @t select 2, convert(varbinary(512), '6778982' )
insert into @t select 3, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778982' )
insert into @t select 4, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778983' )
insert into @t select 5, convert(varbinary(512), '6778984' )
insert into @t select 6, convert(varbinary(512), '6778984' )
insert into @t select 6, convert(varbinary(512), '6778984' )
insert into @t select 6, convert(varbinary(512), '6778985' )
insert into @t select 7, convert(varbinary(512), '6778985' )
insert into @t select 7, convert(varbinary(512), '6778985' )
insert into @t select 8, convert(varbinary(512), '6778985' )
insert into @t select 8, convert(varbinary(512), '6778985' )
insert into @t select 9, convert(varbinary(512), '6778985' )
insert into @t select 9, convert(varbinary(512), '6778986' )
insert into @t select 9, convert(varbinary(512), '6778986' )
insert into @t select 9, convert(varbinary(512), '6778986' )
insert into @t select 9, convert(varbinary(512), '6778986' )

--=======================================================
-- this is going to show the table before the deletes
--=======================================================
select * from @t


--=======================================================
-- this deletes the duplicates - leaving only 1 of each
--=======================================================
; with cte as (
    select *
        , row_number() over (partition by [num],com  order by [num],com ) as Picker
    from @t
    )
delete cte 
where Picker > 1


--=======================================================
-- this shows the table after the deletion - without the deletes
--=======================================================
select * from @t

SQLite3 – remove duplicates by multiple columns

Supposing you have no foreign keys referencing that table, you could do something like

create table some_table as
select min(id), name, age, x
from t
group by name, age, x

then you can drop the old table, rename the new table so that it has the same name as the old one before, and create indexes and other things you need on that table.

Best Answer

Related Solutions

Sql-server – Delete duplicates with specific columns as identifier

SQLite3 – remove duplicates by multiple columns

Related Question