Sql-server – the best approach to parsing a large amount of strings

sql server

I have a table with over 400,000,000 records and I am looking for advice on how to parse it quickly.

TheNameTable
(
  NameID  int primary key,
  TheName varchar(500)
)

The names are stored like this: "FirstName, LastName" (not my table, just what I have to work with)

I need to extract a unique list of last names. My initial thoughts are to process the table in a series of batches (say 50,000 records at a time), using the NameID to control the batch ranges. I would then use SQL's built in string functions to break the string at the "," and keep the right half of the string.

right(TheName,charindex('.',reverse(TheName))-1)

I have a feeling this will still take a LONG time.

Anyone out there have any other ideas?

Would it be worthwhile to simply export the data and process the file outside the database?

The Solution I went with:

As suggested, I created two computed columns. One for first name, one for last name. They are not persisted, as I do have limited space.

alter table TheNameTable 
add LastName as substring(TheName, charindex(',',TheName)+1,1000)

alter table TheNameTable 
add FirstName as left(TheName,charindex(',',TheName)-1)

I asked the admins for a temporary increase in RAM, they bumped the VM up to 32GB.

I created a new table, with columns for both FirstName and LastName. I put a unique compound index on the columns, but specified IGNORE_DUP_KEY = ON.

I just inserted the first 1,000,000 records. It filtered out 125,000 duplicates. The whole statement took 9 seconds to run.

That is the speed I was looking for!

Best Answer

400 million names is a lot. Am I in there? ;-)

My gut level feeling says that using substring isn't going to be terribly much slower than coding up something via the CLR. I'm a SQL guy, I've done a fair amount of simple parsing in the past (2000 or 2005), and I was involved in what was going to be a very complicated parsing scheme (addresses, world-wide) written in c and called via an xproc until we found that out prototype "native" code wasn't any faster than the same thing written with tsql functions.

If you want to use a language other than tsql, I'd suggest writting a CLR in c# or vb.net. For simple things, it's not hard to code in CLR. I went from newb to having a couple of working directory and file utilities in less than one morning. There are plenty of examples of simple clr procedures on the net. and you wont have to learn anything (or install visual studio) to write it in tsql

No matter what, you will have to go through the table at least once. If you export, then parse and then put back what is not a small amount of data, that's a lot of time. Can you guarantee that your source isn't going to change in the mean time?

Here's the thing that always seems to sneak up on everyone: What happens with the parsed data? Where does it wind up? Do you intend to update the row, perhaps there are lastname and firstname columns that you don't show in your example?

If you do, and those columns are currently null or have zero length stings in them, you may find that the update statement performance is very bad because sql may have to split pages to store the lastname.

In other words, your performance problem isn't parsing, it is storing the parsed data. Often, this is worse than inserting the data into another table. Also, all of those page splits will fragment your table and cause query performance to drop, which may enrage your dba because s/he will have to run a defrag procedure on the (large) table.

Here's one last thought: Do you really need to store the parsed data? Can you get away with a computed column that calculates the last name on the fly? Those are indexable, with certain conditions, if you need that. Another approach would be a view that exposes the columns of the table as well as your "parsed lastname column".