Sql-server – Most efficient way to compare strings in SQL Server 2014

sql servert-sql

Hope everything is going ok.

So I'm in an awful situation. I have these two tables, and I need to check that there are not duplicated people. The duplicates are shown to the user in the front end:

Table 1: Id, SSN, First Name, Last Name, Birth Date, Age Gender, Address (yeah, this includes everything: street, cp, city), Email, Cellphone.
Table 2: SSN, First Name, Last Name, Gender, State (its a tinyint) and other data that isn't relevant (a bunch of datetimes)

So, the problem is that the data from table 1 was sent to us in a csv. But this csv wasn't generated by a computer, someone just sat and typed around 4,000 records. So it's likely that there are errors in the NSS. While false positives do not matter that much (there will be an extra page where the user can double check if those two people are exactly the same), it is really important that the system doesn't throw false negatives.

Another problem is that the data from second table will be captured by external users (a bunch of people that are not part of my organization), and I have to consider that they can have typos as well (in the SSN, capitalization, perhaps event accents that the ones who captured the data on the first table didn't include). Both datasets aren't reliable. Yet, the less reliable is the first one, at least we are building the page where users will enter the data that'll fill the second table and we can place some validation to make sure they don't include extra whitespaces or things like that.

Currently, im comparing like this:

SELECT NSS, CONCAT_WS(' ', FirstName, LastName) AS Name, Email, Cellphone
 FROM dbo.Table1 T1
     JOIN dbo.Table2 T2 ON (T1.SSN=T2.SSN 
ON (TRIM(T1.FirstName) COLLATE Latin1_General_AI_CI LIKE '%'+TRIM(T2.FirstName)+'%'  COLLATE Latin1_General_AI_CI 
AND TRIM(T1.LastName) COLLATE Latin1_General_AI_CI LIKE '%'+TRIM(T2.LastName)+'%'  COLLATE Latin1_General_AI_CI))

While it gets the job done, the performance is terrible (leading wildcards). It takes up to 4 minutes to give the user the result (the users of that page are people of my organization, so in the worst case scenario I can simply tell them that the results are going to take a while whenever they want to check it, it'll be slower if they compared it manually anyway). So I'm trying to find a faster way to do it which doesn't sacrifice it's reliability. Sadly, asking for a second and cleaner csv is out of the question (lots of papers and time that we don't have).

The first table haves like 4,000 entries. For the second one, we are expecting it to have between 2,500 and 3,000 records at the end of the registration.

Thanks for your time! Take care!

Best Answer

So a couple things:

You are collating both sides of the predicate comparison which can affect your query's cardinality estimate which can severely affect performance. It looks like you're doing this to ensure a case insensitive comparison. You're better off storing the data in tables or a database with that collation specified instead, so you can eliminate the need to do it in your JOIN clause.
Your should insert the data into the table already trimmed to eliminate a function call from your predicates as well, that can also affect the cardinality estimates.
Ensure you have an index on (LastName, FirstName). While it's unlikely to be seeked on in your query (because you're doing a wildcard contains search) even an index scan on those fields might be more performant than nothing.

*4. If you're ok with loosening your comparison logic a little bit (at the risk of the JOIN not matching as many cases) you can do a starts with comparison, and a starts with comparison on the reversed strings of the FirstName and LastName fields instead of the contains search. This will be more efficient because starts with comparisons are SARGable whereas ends width and contains searches are not SARGable, meaning the index will be seekable. If you implement your logic this way that means your predicates would match on "John" and "Johnson", or "Smith" and "Blacksmith" but would not match on "lack" and "Blacksmith".

Related Solutions

Sql-server – Query table A for any rows that are like any search term in table B

...
FROM
  TableA
  JOIN
  TableB ON
          TableA.FirstName LIKE '%' + TableB.SearchTerm + '%' OR 
          TableA.LastName LIKE '%' + TableB.SearchTerm + '%'
....

Now, this isn't going to be the best of queries. Consider full text search instead (I'm not too familiar with it)

This OR could be split into a UNION

Sql-server – the best approach to parsing a large amount of strings

400 million names is a lot. Am I in there? ;-)

My gut level feeling says that using substring isn't going to be terribly much slower than coding up something via the CLR. I'm a SQL guy, I've done a fair amount of simple parsing in the past (2000 or 2005), and I was involved in what was going to be a very complicated parsing scheme (addresses, world-wide) written in c and called via an xproc until we found that out prototype "native" code wasn't any faster than the same thing written with tsql functions.

If you want to use a language other than tsql, I'd suggest writting a CLR in c# or vb.net. For simple things, it's not hard to code in CLR. I went from newb to having a couple of working directory and file utilities in less than one morning. There are plenty of examples of simple clr procedures on the net. and you wont have to learn anything (or install visual studio) to write it in tsql

No matter what, you will have to go through the table at least once. If you export, then parse and then put back what is not a small amount of data, that's a lot of time. Can you guarantee that your source isn't going to change in the mean time?

Here's the thing that always seems to sneak up on everyone: What happens with the parsed data? Where does it wind up? Do you intend to update the row, perhaps there are lastname and firstname columns that you don't show in your example?

If you do, and those columns are currently null or have zero length stings in them, you may find that the update statement performance is very bad because sql may have to split pages to store the lastname.

In other words, your performance problem isn't parsing, it is storing the parsed data. Often, this is worse than inserting the data into another table. Also, all of those page splits will fragment your table and cause query performance to drop, which may enrage your dba because s/he will have to run a defrag procedure on the (large) table.

Here's one last thought: Do you really need to store the parsed data? Can you get away with a computed column that calculates the last name on the fly? Those are indexable, with certain conditions, if you need that. Another approach would be a view that exposes the columns of the table as well as your "parsed lastname column".

Best Answer

Related Solutions

Sql-server – Query table A for any rows that are like any search term in table B

Sql-server – the best approach to parsing a large amount of strings

Related Question