Sql-server – On-Demand Fuzzy Lookup

sql serversql-server-2008ssis

We've got a customers table (Who doesn't?), containing many records that are, from a business perspective, duplicates. I've been able to create an SSIS package to perform fuzzy grouping, and report on potential duplicates.

Now, suppose I want to do this kind of analysis just as somebody is entering a new customer. The idea would be to perform a fuzzy lookup on customer name (and possibly some other basic info like postal code), and show potential duplicates prior to proceeding to the customer creation form.

The obvious problem here is that the fuzzy grouping and lookup components are part of SSIS. If I wanted to run those on-demand, I'd have to do something insane like putting the search terms in a staging table, running the SSIS package, waiting for it to complete, and fetching the results from an output table. It would be slow, painful, and have severe concurrency problems.

So, the other idea was to use full-text indexing. In experimenting with it, it looks like it won't be suitable. It can't catch subtle misspellings of customer names, or names that differ in "Company" vs. "Corporation" vs. "Co.", or "Anderson" vs. "Andersen", and other such variations.

Is there something that will allow for the flexibility of fuzzy grouping/matching from T-SQL? I can tell a fuzzy lookup to save the tokens, but it looks like I would still have to reimplement most of the matching algorithm to make use of them.

Best Answer

In the past I built a "fuzzy-search" in a .Net CLR function. This function gets called the same way a user-defined function gets called.

For example,

select id, name
from customers
where dbo.CompareStrings("newCustomerName", customers.name) > .8

would only return customers with a name that was 80% similar to the input name.

The % match is based on the number of changes needed to convert one value to another, not the number of characters that are different. We use it to compare addresses, and found this was more effective because of the numerous street abbreviations that are used.

Here's the code I used to compare strings. I did this so long ago that I can't remember how to deploy it, although a quick search will show you many articles on how to create SQL CLR functions

' Checks two strings against each other and returns a decimal between 0 (doesn't match at all) and 1 (100% match)
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function CompareStrings(ByVal input1 As SqlChars, ByVal input2 As SqlChars) _
As <SqlFacet(Precision:=10, Scale:=4)> SqlDecimal

    If IsNothing(input1) And IsNothing(input2) Then
        Return New SqlDecimal(1.0)
    ElseIf IsNothing(input1) Or IsNothing(input2) Then
        Return New SqlDecimal(0.0)
    End If

    Dim s1 As String = New String(input1.Value)
    Dim s2 As String = New String(input2.Value)

    If s1.Length = 0 Or s2.Length = 0 Then
        Return New SqlDecimal(1.0)
    Else
        Dim re As New Regex("[^A-Za-z0-9 ]", RegexOptions.IgnorePatternWhitespace Xor RegexOptions.Singleline)
        s1 = re.Replace(s1, "( )\1*", "$1")
        s2 = re.Replace(s2, "( )\1*", "$1")
        s1 = UCase(re.Replace(s1, ""))
        s2 = UCase(re.Replace(s2.ToString, ""))

        Dim dif As Integer = GetStringSimilarity(s1, s2)
        Dim max As Integer = s1.Length
        If s2.Length > max Then max = s2.Length

        Return New SqlDecimal(1.0 - (dif / max))
    End If
End Function

' Compares two strings using the relationship in patterns of letters
Private Shared Function GetStringSimilarity(ByVal s1 As String, ByVal s2 As String) As Integer
    Dim n As Integer = s1.Length
    Dim m As Integer = s2.Length
    Dim distance(n + 1, m + 1) As Integer

    Dim cost As Integer = 0
    If n = 0 Then Return m
    If m = 0 Then Return n

    For i As Integer = 0 To n
        distance(i, 0) = i
    Next
    For j As Integer = 0 To m
        distance(0, j) = j
    Next

    For i As Integer = 1 To n
        For j As Integer = 1 To m
            If Mid(s2, j, 1) = Mid(s1, i, 1) Then cost = 0 Else cost = 1
            distance(i, j) = Min3(distance(i - 1, j) + 1, distance(i, j - 1) + 1, distance(i - 1, j - 1) + cost)
        Next
    Next
    Return distance(n, m)
End Function

' Returns the min of 3 values
Private Shared Function Min3(ByVal x As Integer, ByVal y As Integer, ByVal z As Integer) As Integer
    Dim min As Integer = x
    If y < min Then min = y
    If z < min Then min = z
    Return min
End Function