Handling Duplicate Records Across Different Databases

database-designnormalizationsql-server-2005

I have a table holding 'People' data (about 70K records) that coexists with an 'Address' table so there's one AddressID per every person in the 'People' table.

The idea was to centralize the data in this single 'Population' database, problem started once duplicated records started to appear due to a poorly executed Import process (Data coming from different sources), this left the table with as much as 7 records for a single Person.

Additionally the Address table started collecting 'all sorts', so let say I live in "24 Wickam Heights" you can found this address in the following ways:

  • 24 Wickam
  • 24 Wickam St
  • 24 Wickam Street
  • 24 Whikam H.
  • 24 Wikam Str.

In some cases as bad as over 20 different versions of the same street…

The most beautiful part is that the data from this database is being referenced from at least 5 other databases in the same server, making every change a very risky process.

So I'm thinking, what steps can be taken to get rid of the duplicates?, What alternatives there is for avoiding the address table accumulating such an amount of data inconsistency?.

Maybe even to ask, is there any salvation for such a big mess? It really is a nightmare.

Best Answer

There are some great data cleansing products out there, one in particular that is actually top notch and affordable. I've come across DataMatch by Data Ladder, which is an excellent fuzzy matching and address standardization/address parsing tool used across business and would work really well for this situation. They offer a complimentary trial for new users.

In fact, an independent verified evaluation was done of the software comparing it to major software tools by IBM and SAS. There was a study done at Curtin University Centre for Data Linkage in Australia that simulated the matching of 4.4 Million records. It identified what providers had in terms of accuracy (Number of matches found vs available. Number of false matches)

1.  DataMatch Enterprise, Highest Accuracy (>95%), Very Fast, Low Cost
2.  IBM Quality Stage , high accuracy (>90%), Very Fast, High Cost (>$100K)
3.  SAS Data Flux, Medium Accuracy (>85%), Fast, High Cost (>100K)