Sql-server – Most efficient way to compare strings in SQL Server 2014

sql servert-sql

Hope everything is going ok.

So I'm in an awful situation. I have these two tables, and I need to check that there are not duplicated people. The duplicates are shown to the user in the front end:

  • Table 1: Id, SSN, First Name, Last Name, Birth Date, Age Gender, Address (yeah, this includes everything: street, cp, city), Email, Cellphone.
  • Table 2: SSN, First Name, Last Name, Gender, State (its a tinyint) and other data that isn't relevant (a bunch of datetimes)

So, the problem is that the data from table 1 was sent to us in a csv. But this csv wasn't generated by a computer, someone just sat and typed around 4,000 records. So it's likely that there are errors in the NSS. While false positives do not matter that much (there will be an extra page where the user can double check if those two people are exactly the same), it is really important that the system doesn't throw false negatives.

Another problem is that the data from second table will be captured by external users (a bunch of people that are not part of my organization), and I have to consider that they can have typos as well (in the SSN, capitalization, perhaps event accents that the ones who captured the data on the first table didn't include). Both datasets aren't reliable. Yet, the less reliable is the first one, at least we are building the page where users will enter the data that'll fill the second table and we can place some validation to make sure they don't include extra whitespaces or things like that.

Currently, im comparing like this:

SELECT NSS, CONCAT_WS(' ', FirstName, LastName) AS Name, Email, Cellphone
 FROM dbo.Table1 T1
     JOIN dbo.Table2 T2 ON (T1.SSN=T2.SSN 
ON (TRIM(T1.FirstName) COLLATE Latin1_General_AI_CI LIKE '%'+TRIM(T2.FirstName)+'%'  COLLATE Latin1_General_AI_CI 
AND TRIM(T1.LastName) COLLATE Latin1_General_AI_CI LIKE '%'+TRIM(T2.LastName)+'%'  COLLATE Latin1_General_AI_CI))

While it gets the job done, the performance is terrible (leading wildcards). It takes up to 4 minutes to give the user the result (the users of that page are people of my organization, so in the worst case scenario I can simply tell them that the results are going to take a while whenever they want to check it, it'll be slower if they compared it manually anyway). So I'm trying to find a faster way to do it which doesn't sacrifice it's reliability. Sadly, asking for a second and cleaner csv is out of the question (lots of papers and time that we don't have).

The first table haves like 4,000 entries. For the second one, we are expecting it to have between 2,500 and 3,000 records at the end of the registration.

Thanks for your time! Take care!

Best Answer

So a couple things:

  1. You are collating both sides of the predicate comparison which can affect your query's cardinality estimate which can severely affect performance. It looks like you're doing this to ensure a case insensitive comparison. You're better off storing the data in tables or a database with that collation specified instead, so you can eliminate the need to do it in your JOIN clause.

  2. Your should insert the data into the table already trimmed to eliminate a function call from your predicates as well, that can also affect the cardinality estimates.

  3. Ensure you have an index on (LastName, FirstName). While it's unlikely to be seeked on in your query (because you're doing a wildcard contains search) even an index scan on those fields might be more performant than nothing.

*4. If you're ok with loosening your comparison logic a little bit (at the risk of the JOIN not matching as many cases) you can do a starts with comparison, and a starts with comparison on the reversed strings of the FirstName and LastName fields instead of the contains search. This will be more efficient because starts with comparisons are SARGable whereas ends width and contains searches are not SARGable, meaning the index will be seekable. If you implement your logic this way that means your predicates would match on "John" and "Johnson", or "Smith" and "Blacksmith" but would not match on "lack" and "Blacksmith".