Sql-server – Find all potentially redundant records in a table using cursors and fuzzy string matching

cursorssql server

I have been given an interesting challenge and I believe I have a valid approach, but before I set about implementing it I want to see if there are some better ways to be doing things.

The Problem
We have a call center that tracks customer bookings and, if the customer does not exist in the system, they create a new customer record. There is a common occurrence where the call center will not perform an adequate search and therefore will not find the customer, and as a result will create a redundant record for the customer with variations on the name and contact information. It is my task to come up with a report that reflects groupings of these redundant records.

Example:
jCam Example

You can see that due to the variations on each line (not considering the ID) that I can't just look for exact matches on any particular column. Yet I want each of these to be grouped together in a report so we can monitor ongoing data entry errors.

My Proposed Solution
I have never used a cursor so I will need to set about learning them, but I would suppose that I would need to nest 2 cursors so that I can evaluate each record against every other record in the table. For each comparison of 2 line items (as a result of the nested cursors) I would compare the [First Name], [Last Name], and [Email] fields with a Levenshtein Distance Function, and also perform 6 phone number comparisons in order to find any exact matches there, and write the resulting line to a table that can then be used as a reference point for pulling a reporting.

Example:
fuzzyResults

You can see the first line of the example would be compared against the resulting four lines. I can then provide the user with a front-end that allows them to search for results that fall with parameters that they set, specifying degrees of accuracy on any given column or combination of columns. I suspect that this is going to have a significant amount of overhead and so I would most likely run this overnight, and any reporting for the end users be based on this derived index table.

So, again, I am looking for critiques on my approach. I have a feeling that this is not an uncommon problem and I feel like a comprehensive, optimized solution will be useful to the community at large.

–EDIT–

So I wrote a sproc that does what I wanted it to:

ALTER PROCEDURE [dbo].[sp_Cust_BuildIndex]
AS
BEGIN

SET NOCOUNT ON;

DECLARE @ID1 VARCHAR(50)
DECLARE @ID2 VARCHAR(50)
DECLARE @distFN Int
DECLARE @distLN Int
DECLARE @distEmail Int
DECLARE @HH Int
DECLARE @HW Int
DECLARE @HM Int
DECLARE @WH Int
DECLARE @WW Int
DECLARE @WM Int
DECLARE @MH Int
DECLARE @MW Int
DECLARE @MM Int

DECLARE @FN1 VARCHAR(256)
DECLARE @FN2 VARCHAR(256)
DECLARE @LN1 VARCHAR(256)
DECLARE @LN2 VARCHAR(256)
DECLARE @Email1 VARCHAR(256)
DECLARE @Email2 VARCHAR(256)
DECLARE @Phone1 VARCHAR(20)
DECLARE @Phone2 VARCHAR(20)
DECLARE @Mobile1 VARCHAR(20)
DECLARE @Mobile2 VARCHAR(20)
DECLARE @Work1 VARCHAR(20)
DECLARE @Work2 VARCHAR(20)

DELETE FROM dbThis.dbo.[tbl_Cust_Dup_Index]

DECLARE db_ID1 CURSOR FOR  
SELECT [ID]
FROM [dbThat].[dbo].[CUSTOMER]

OPEN db_ID1   
FETCH NEXT FROM db_ID1 INTO @ID1

    SELECT @FN1 = [FIRSTNAME]
          ,@LN1 = [LASTNAME]
          ,@Phone1 = [TEL_HOME]
          ,@Work1 = [TEL_WORK]
          ,@Mobile1 = [TEL_MOBIL]
          ,@Email1 = [EMAIL]
    FROM [SPABIZ_MIRROR].[dbo].[CUSTOMER]
    WHERE ID = @ID1 and [Delete] = 0

    DECLARE db_ID2 CURSOR FOR  
    SELECT [ID]
    FROM [dbThat].[dbo].[CUSTOMER]

    OPEN db_ID2   
    FETCH NEXT FROM db_ID1 INTO @ID2   

    WHILE @@FETCH_STATUS = 0   
    BEGIN
        IF @ID1 <> @ID2
        BEGIN

           SELECT  @FN2 = [FIRSTNAME]
                  ,@LN2 = [LASTNAME]
                  ,@Phone2 = [TEL_HOME]
                  ,@Work2 = [TEL_WORK]
                  ,@Mobile2 = [TEL_MOBIL]
                  ,@Email2 = [EMAIL]
          FROM [dbThat].[dbo].[CUSTOMER]
          WHERE ID = @ID2 and [Delete] = 0

          SET @HH = dbThis.dbo.fn_Phone_Bool(@Phone1, @Phone2)
          SET @HW = dbThis.dbo.fn_Phone_Bool(@Phone1, @Work2)
          SET @HM = dbThis.dbo.fn_Phone_Bool(@Phone1, @Mobile2)
          SET @WH = dbThis.dbo.fn_Phone_Bool(@Work1, @Phone2)
          SET @WW = dbThis.dbo.fn_Phone_Bool(@Work1, @Work2)
          SET @WM = dbThis.dbo.fn_Phone_Bool(@Work1, @Mobile2)
          SET @MH = dbThis.dbo.fn_Phone_Bool(@Mobile1, @Phone2)
          SET @MW = dbThis.dbo.fn_Phone_Bool(@Mobile1, @Work2)
          SET @MM = dbThis.dbo.fn_Phone_Bool(@Mobile1, @Mobile2)

          Insert Into dbThis.dbo.[tbl_Cust_Dup_Index]
          VALUES (@ID1
                ,@ID2
                ,dbThis.dbo.edit_distance(@FN1, @FN2)
                ,dbThis.dbo.edit_distance(@LN1, @LN2)
                ,dbThis.dbo.edit_distance(@Email1, @Email2)
                ,@HH
                ,@HW
                ,@HM
                ,@WH
                ,@WW
                ,@WM
                ,@MH
                ,@MW
                ,@MM) 
        END
           FETCH NEXT FROM db_ID2 INTO @ID2   
    END   

    CLOSE db_ID2   
    DEALLOCATE db_ID2

CLOSE db_ID1   
DEALLOCATE db_ID1

I used the function in the above Levenshtein Distance link, and I wrote a "fn_Phone_Bool" function that returns 0 if matching an 1 if not matching. This creates the sort of index table that I was after:

indexResults

Sadly, it takes forEEEEEEEVER, which is completely unsurprising. In one hour of searching it produced 22000 matches. So I figure at that rate it will go through the entire table of ~377,000 customer records and compare them against eachother, creating ~141.8b matches in just over 768 years. I don't have that kind of time, so I'm going to have to get crafty with the nested cursor Selects. Also there is clearly something wrong with the email edit_distances because there is no way they are all exact matches.

My thought is that the first Select (cluster?) grabs all the IDs, and the second Select uses "like" in the "where" clause to isolate the second cluster to a much smaller set of results. If I can limit the results to a few matches per record then I can run this report weekly or wotnot.

As usual, suggestions welcome!

Best Answer

I know it's a bit "version 1" but Data Quality Services (DQS) is supposed to be able to do this kind of thing. I took your data and did a simple walkthrough (roughly following here), built some simple domains, trained the Knowledge Base to convert 'Cambell' to 'Campbell', and got some ok results. DQS project example

I then exported the results to a table (or Excel file or .csv) where they could be used to update / de-duplicate the original values. The output includes a 'confidence' score based on fuzzy matching against the domains. Might be worth a look before you roll your own version of this? The Knowledge Base can become more useful over time, eg once you've trained it to convert to Cambell to Campbell you won't have to do so again.

A DQS Data Matching project would also be appropriate - eg here.