Sql-server – Find Duplicate Customers

query-refactorsql server

Okay… I have a table that has customers:

-- Individual Table
* ID (Internal Unique ID)
* IndividualID (External Unique Individual Identifier)
* Last Name
* First Name
* Birth Date
* SSN
* ...

The issue is that sometimes a person gets multiple Individual ID's. Say the person doesn't provide a SSN for one Encounter, the last name changes, typo in birthday, etc. So you end up with the same person in the individual table multiple times:

1, Frost, Jack, 1/1/2000, 000-00-0008
2, Frost, Jack, 1/1/2000, 000-00-0003
3, Doe,   Jane, 1/1/2000, 000-00-0005
4, Doe,   Janet, 1/1/2000, 000-00-0005
5, Frost, Janet, 1/1/2000, 000-00-0005

Those are just some examples. The basic idea is that I need to find individuals that are potential matches, so that the right person can merge the individuals into a single account.

The particular query I'm currently on is on SS2008-SP1, but I have other queries on SS2005 through SS2012. Is there any way I can improve this?

Initially I had a single select statement (instead of 2 temp tables, 5 inserts and a select statement), but the "This or This or This or…" took many minutes and this takes ~10 seconds. Population of Customers is ~144k (Select count(*) from Data)

Current I'm using a simple attempt to try and match four parts: Last Name, First Name, DOB, SSN. If 3 or 4 of them match on different individuals, the need to be inspected closer to determine if they really are the same person.

IF object_id('tempdb..#DATA') IS NOT NULL
    DROP TABLE #DATA;
GO

CREATE TABLE #DATA (
      EXTID VARCHAR(30) NOT NULL
    , LNAME VARCHAR(30) NULL
    , FNAME VARCHAR(30) NULL
    , SSN VARCHAR(11) NULL
    , DOB VARCHAR(8) NULL
    )
GO

INSERT INTO #DATA
SELECT 
      EXTID = D1.EXTERNALID
    , LNAME = D1.LASTNAME
    , FNAME = D1.FIRSTNAME
    , SSN = CASE WHEN D1.SSN = '000-00-0000' THEN NULL ELSE D1.SSN END
    , DOB = convert(VARCHAR, D1.DOB, 112)
FROM Data D1
WHERE Type = 1 and STATUS = 1
GO

SELECT D1.*, [Splitter] = 'MATCH', D2.* 
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID
    AND (   D1.LNAME = D2.LNAME
        AND D1.FNAME = D2.FNAME
        AND D1.SSN   = D2.SSN
        AND D1.DOB   = D2.DOB)
        UNION
SELECT D1.*, 'LName', D2.* 
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID 
    AND (   D1.LNAME <> D2.LNAME
        AND D1.FNAME = D2.FNAME
        AND D1.SSN   = D2.SSN
        AND D1.DOB   = D2.DOB)
        UNION
SELECT D1.*, 'FName', D2.* 
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID
    AND (   D1.LNAME = D2.LNAME
        AND D1.FNAME <> D2.FNAME
        AND D1.SSN   = D2.SSN
        AND D1.DOB   = D2.DOB)
        UNION
SELECT D1.*, 'SSN  ', D2.* 
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID
    AND (   D1.LNAME = D2.LNAME
        AND D1.FNAME = D2.FNAME
        AND D1.SSN   <> D2.SSN
        AND D1.DOB   = D2.DOB)
        UNION
SELECT D1.*, 'DOB  ', D2.* 
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID
    AND (   D1.LNAME = D2.LNAME
        AND D1.FNAME = D2.FNAME
        AND D1.SSN   = D2.SSN
        AND D1.DOB   <> D2.DOB);

Edit to add Distinct Counts:

LName   FName   SSN DOB Count
36737   14539   115073  34284   144044

Edit: Cleaned up a bit to get rid of second temp table. Poking around the Estimated Execution plan, the above query – broken into 5 parts – uses hash map inner joins and takes about 10 seconds. My initial query, and other variations seem to use loop joins and is still chugging along at 10+ minutes.

Best Answer

To find rows where at least three out of four of those columns match you can use.

SELECT  D1.ID, D2.ID
FROM   DATA D1
       JOIN DATA D2
         ON D1.SSN = D2.SSN
            AND D1.ID > D2.ID
            AND 2 <= CASE
                       WHEN D1.FNAME = D2.FNAME THEN 1
                       ELSE 0
                     END +
                     CASE
                       WHEN D1.LNAME = D2.LNAME THEN 1
                       ELSE 0
                     END  +
                     CASE
                       WHEN D1.DOB = D2.DOB THEN 1
                       ELSE 0
                     END
UNION ALL
SELECT D1.ID, D2.ID
FROM   DATA D1
       JOIN DATA D2
         ON D1.DOB = D2.DOB
         AND  D1.FNAME = D2.FNAME
         AND  D1.LNAME = D2.LNAME
         AND D1.SSN <> D2.SSN
         AND D1.ID > D2.ID

The top branch gets all rows where the SSN are the same and at least 2 out of the three other columns are the same. The join on SSN is likely to be pretty selective in itself.

That just leaves one other possible three column combination left which is dealt with by the second branch.

Both branches of the UNION ALL have an equi join so it should perform better than a join with some complicated OR condition.