Okay… I have a table that has customers:
-- Individual Table
* ID (Internal Unique ID)
* IndividualID (External Unique Individual Identifier)
* Last Name
* First Name
* Birth Date
* SSN
* ...
The issue is that sometimes a person gets multiple Individual ID's. Say the person doesn't provide a SSN for one Encounter, the last name changes, typo in birthday, etc. So you end up with the same person in the individual table multiple times:
1, Frost, Jack, 1/1/2000, 000-00-0008
2, Frost, Jack, 1/1/2000, 000-00-0003
3, Doe, Jane, 1/1/2000, 000-00-0005
4, Doe, Janet, 1/1/2000, 000-00-0005
5, Frost, Janet, 1/1/2000, 000-00-0005
Those are just some examples. The basic idea is that I need to find individuals that are potential matches, so that the right person can merge the individuals into a single account.
The particular query I'm currently on is on SS2008-SP1, but I have other queries on SS2005 through SS2012. Is there any way I can improve this?
Initially I had a single select statement (instead of 2 temp tables, 5 inserts and a select statement), but the "This or This or This or…" took many minutes and this takes ~10 seconds. Population of Customers is ~144k (Select count(*) from Data
)
Current I'm using a simple attempt to try and match four parts: Last Name, First Name, DOB, SSN. If 3 or 4 of them match on different individuals, the need to be inspected closer to determine if they really are the same person.
IF object_id('tempdb..#DATA') IS NOT NULL
DROP TABLE #DATA;
GO
CREATE TABLE #DATA (
EXTID VARCHAR(30) NOT NULL
, LNAME VARCHAR(30) NULL
, FNAME VARCHAR(30) NULL
, SSN VARCHAR(11) NULL
, DOB VARCHAR(8) NULL
)
GO
INSERT INTO #DATA
SELECT
EXTID = D1.EXTERNALID
, LNAME = D1.LASTNAME
, FNAME = D1.FIRSTNAME
, SSN = CASE WHEN D1.SSN = '000-00-0000' THEN NULL ELSE D1.SSN END
, DOB = convert(VARCHAR, D1.DOB, 112)
FROM Data D1
WHERE Type = 1 and STATUS = 1
GO
SELECT D1.*, [Splitter] = 'MATCH', D2.*
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID
AND ( D1.LNAME = D2.LNAME
AND D1.FNAME = D2.FNAME
AND D1.SSN = D2.SSN
AND D1.DOB = D2.DOB)
UNION
SELECT D1.*, 'LName', D2.*
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID
AND ( D1.LNAME <> D2.LNAME
AND D1.FNAME = D2.FNAME
AND D1.SSN = D2.SSN
AND D1.DOB = D2.DOB)
UNION
SELECT D1.*, 'FName', D2.*
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID
AND ( D1.LNAME = D2.LNAME
AND D1.FNAME <> D2.FNAME
AND D1.SSN = D2.SSN
AND D1.DOB = D2.DOB)
UNION
SELECT D1.*, 'SSN ', D2.*
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID
AND ( D1.LNAME = D2.LNAME
AND D1.FNAME = D2.FNAME
AND D1.SSN <> D2.SSN
AND D1.DOB = D2.DOB)
UNION
SELECT D1.*, 'DOB ', D2.*
FROM #Demo D1, #Demo D2 WHERE D1.ID > D2.ID
AND ( D1.LNAME = D2.LNAME
AND D1.FNAME = D2.FNAME
AND D1.SSN = D2.SSN
AND D1.DOB <> D2.DOB);
Edit to add Distinct Counts:
LName FName SSN DOB Count
36737 14539 115073 34284 144044
Edit: Cleaned up a bit to get rid of second temp table. Poking around the Estimated Execution plan, the above query – broken into 5 parts – uses hash map inner joins and takes about 10 seconds. My initial query, and other variations seem to use loop joins and is still chugging along at 10+ minutes.
Best Answer
To find rows where at least three out of four of those columns match you can use.
The top branch gets all rows where the
SSN
are the same and at least 2 out of the three other columns are the same. The join onSSN
is likely to be pretty selective in itself.That just leaves one other possible three column combination left which is dealt with by the second branch.
Both branches of the
UNION ALL
have an equi join so it should perform better than a join with some complicatedOR
condition.