I have a Microsoft SQL Server 2012 database table with around 7 million crowd-sourced records, primarily containing a string name value with some related details. For nearly every record it seems there are a dozen similar typo records and I am trying to do some fuzzy matching to identify record groups such as "Apple", "Aple", "Apples", "Spple", etc. These names can also contain multiple words with spaces between them.
I've come up with a solution of using an edit-distance scalar function that returns number of keystrokes required for transformation from string1 to string2 and using that function to join the table to itself. As you can imagine, this doesn't perform that well since its having to execute the function millions of times to evaluate a join.
So I put that in a cursor so at least only one string1 is being evaluated at a time, this at least gets results coming out but after letting it run for weeks it has only made it through evaluating 150,000 records. With 7 million to evaluate, I don't think I have the kind of time my method is going to take.
I put full text indexes on the string names, but couldn't really find a way to use the full text predicates when I didn't have a static value I was searching.
Any ideas how I could do something like the following in a way that wouldn't take months to run?
SELECT t1.name, t2.name
FROM names AS t1
INNER JOIN names AS t2
ON EditDistance(t1.name,t2.name) = 1
AND t1.id != t2.id
I've tried soundex
, but since the names can contain spaces and multiple words per value I get too many false positives to use it reliably.
Best Answer
Having dealt this issue, the most effective and performant way is to create a CLR function that calculates the Levenshtein Distance. You will be able to mark the assembly as SAFE (if you're at all concerned about security), and it runs much quicker than SOUNDEX() or any in-built SQL Server functions.
Here's the code to set up the assembly and function in the database, as well as a basic version of the Levenshtein Distance algorithm implemented in C# from https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#C#
C#:
T-SQL: