We've got a customers table (Who doesn't?), containing many records that are, from a business perspective, duplicates. I've been able to create an SSIS package to perform fuzzy grouping, and report on potential duplicates.
Now, suppose I want to do this kind of analysis just as somebody is entering a new customer. The idea would be to perform a fuzzy lookup on customer name (and possibly some other basic info like postal code), and show potential duplicates prior to proceeding to the customer creation form.
The obvious problem here is that the fuzzy grouping and lookup components are part of SSIS. If I wanted to run those on-demand, I'd have to do something insane like putting the search terms in a staging table, running the SSIS package, waiting for it to complete, and fetching the results from an output table. It would be slow, painful, and have severe concurrency problems.
So, the other idea was to use full-text indexing. In experimenting with it, it looks like it won't be suitable. It can't catch subtle misspellings of customer names, or names that differ in "Company" vs. "Corporation" vs. "Co.", or "Anderson" vs. "Andersen", and other such variations.
Is there something that will allow for the flexibility of fuzzy grouping/matching from T-SQL? I can tell a fuzzy lookup to save the tokens, but it looks like I would still have to reimplement most of the matching algorithm to make use of them.
Best Answer
In the past I built a "fuzzy-search" in a .Net CLR function. This function gets called the same way a user-defined function gets called.
For example,
would only return customers with a name that was 80% similar to the input name.
The % match is based on the number of changes needed to convert one value to another, not the number of characters that are different. We use it to compare addresses, and found this was more effective because of the numerous street abbreviations that are used.
Here's the code I used to compare strings. I did this so long ago that I can't remember how to deploy it, although a quick search will show you many articles on how to create SQL CLR functions