In the past I built a "fuzzy-search" in a .Net CLR function. This function gets called the same way a user-defined function gets called.
For example,
select id, name
from customers
where dbo.CompareStrings("newCustomerName", customers.name) > .8
would only return customers with a name that was 80% similar to the input name.
The % match is based on the number of changes needed to convert one value to another, not the number of characters that are different. We use it to compare addresses, and found this was more effective because of the numerous street abbreviations that are used.
Here's the code I used to compare strings. I did this so long ago that I can't remember how to deploy it, although a quick search will show you many articles on how to create SQL CLR functions
' Checks two strings against each other and returns a decimal between 0 (doesn't match at all) and 1 (100% match)
<Microsoft.SqlServer.Server.SqlFunction()> _
Public Shared Function CompareStrings(ByVal input1 As SqlChars, ByVal input2 As SqlChars) _
As <SqlFacet(Precision:=10, Scale:=4)> SqlDecimal
If IsNothing(input1) And IsNothing(input2) Then
Return New SqlDecimal(1.0)
ElseIf IsNothing(input1) Or IsNothing(input2) Then
Return New SqlDecimal(0.0)
End If
Dim s1 As String = New String(input1.Value)
Dim s2 As String = New String(input2.Value)
If s1.Length = 0 Or s2.Length = 0 Then
Return New SqlDecimal(1.0)
Else
Dim re As New Regex("[^A-Za-z0-9 ]", RegexOptions.IgnorePatternWhitespace Xor RegexOptions.Singleline)
s1 = re.Replace(s1, "( )\1*", "$1")
s2 = re.Replace(s2, "( )\1*", "$1")
s1 = UCase(re.Replace(s1, ""))
s2 = UCase(re.Replace(s2.ToString, ""))
Dim dif As Integer = GetStringSimilarity(s1, s2)
Dim max As Integer = s1.Length
If s2.Length > max Then max = s2.Length
Return New SqlDecimal(1.0 - (dif / max))
End If
End Function
' Compares two strings using the relationship in patterns of letters
Private Shared Function GetStringSimilarity(ByVal s1 As String, ByVal s2 As String) As Integer
Dim n As Integer = s1.Length
Dim m As Integer = s2.Length
Dim distance(n + 1, m + 1) As Integer
Dim cost As Integer = 0
If n = 0 Then Return m
If m = 0 Then Return n
For i As Integer = 0 To n
distance(i, 0) = i
Next
For j As Integer = 0 To m
distance(0, j) = j
Next
For i As Integer = 1 To n
For j As Integer = 1 To m
If Mid(s2, j, 1) = Mid(s1, i, 1) Then cost = 0 Else cost = 1
distance(i, j) = Min3(distance(i - 1, j) + 1, distance(i, j - 1) + 1, distance(i - 1, j - 1) + cost)
Next
Next
Return distance(n, m)
End Function
' Returns the min of 3 values
Private Shared Function Min3(ByVal x As Integer, ByVal y As Integer, ByVal z As Integer) As Integer
Dim min As Integer = x
If y < min Then min = y
If z < min Then min = z
Return min
End Function
I've seen this behavior as well. Mine was due to server memory, which I'll explain in the guidelines. This was pesky because it was intermittent.
With partial cache, the cache starts off empty then queries until it finds a match. If you have multiple matches, the first one wins. With full cache, if you have multiple matches, I'm not sure which would win. Probably the first in the cache order.
Partial cache has an option for miss cache, which will remember which records don't have matches and won't query them again. This will be a problem if you are inserting into the table that you are doing a lookup on. Also with full cache, if your source contains duplicates, the second won't get a match after the first is inserted which will be a problem if you want to suppress all but the first.
Here are a few guidelines that I try to follow when using lookups:
- The lookup will never swap cache to disk. If it runs out of memory, the task will fail. If this is your issue, get more ram or try partial.
- Use integers whenever possible. If the field is string but can be cast to int, do it. This remove case and white space issues all together
- Trim strings. White space will give you fits.
- If it is a string, upper case both sides before comparing (lookups are case sensitive and will fail if the cases don't match).
- Null never equals Null. If the column is nullable, replace both sides with "UNKNOWN" if string and a irrational number like -999 if it is integer.
- If the column is empty, replace both sides with the text "EMPTY"
These are a pain in the butt, but it's better than getting a call at 3am because a tier 1 customers ETL failed.
Best Answer
Something similar happened to my BIDS after I installed Visual Studio '15 on my dev machine. There are registry overwrites when a newer version of BIDS/SSDT/Business Intelligence Templates are installed causing failures in older version. I fixed this by setting up a vm on my local that I only use BIDS on. Why Can't SQL Server and Visual Studio Get Along?