Sql-server – the best approach to parsing a large amount of strings

sql server

I have a table with over 400,000,000 records and I am looking for advice on how to parse it quickly.

TheNameTable
(
  NameID  int primary key,
  TheName varchar(500)
)

The names are stored like this: "FirstName, LastName" (not my table, just what I have to work with)

I need to extract a unique list of last names. My initial thoughts are to process the table in a series of batches (say 50,000 records at a time), using the NameID to control the batch ranges. I would then use SQL's built in string functions to break the string at the "," and keep the right half of the string.

right(TheName,charindex('.',reverse(TheName))-1)

I have a feeling this will still take a LONG time.

Anyone out there have any other ideas?

Would it be worthwhile to simply export the data and process the file outside the database?

The Solution I went with:

As suggested, I created two computed columns. One for first name, one for last name. They are not persisted, as I do have limited space.

alter table TheNameTable 
add LastName as substring(TheName, charindex(',',TheName)+1,1000)

alter table TheNameTable 
add FirstName as left(TheName,charindex(',',TheName)-1)

I asked the admins for a temporary increase in RAM, they bumped the VM up to 32GB.

I created a new table, with columns for both FirstName and LastName. I put a unique compound index on the columns, but specified IGNORE_DUP_KEY = ON.

I just inserted the first 1,000,000 records. It filtered out 125,000 duplicates. The whole statement took 9 seconds to run.

That is the speed I was looking for!

Best Answer

400 million names is a lot. Am I in there? ;-)

My gut level feeling says that using substring isn't going to be terribly much slower than coding up something via the CLR. I'm a SQL guy, I've done a fair amount of simple parsing in the past (2000 or 2005), and I was involved in what was going to be a very complicated parsing scheme (addresses, world-wide) written in c and called via an xproc until we found that out prototype "native" code wasn't any faster than the same thing written with tsql functions.

If you want to use a language other than tsql, I'd suggest writting a CLR in c# or vb.net. For simple things, it's not hard to code in CLR. I went from newb to having a couple of working directory and file utilities in less than one morning. There are plenty of examples of simple clr procedures on the net. and you wont have to learn anything (or install visual studio) to write it in tsql

No matter what, you will have to go through the table at least once. If you export, then parse and then put back what is not a small amount of data, that's a lot of time. Can you guarantee that your source isn't going to change in the mean time?

Here's the thing that always seems to sneak up on everyone: What happens with the parsed data? Where does it wind up? Do you intend to update the row, perhaps there are lastname and firstname columns that you don't show in your example?

If you do, and those columns are currently null or have zero length stings in them, you may find that the update statement performance is very bad because sql may have to split pages to store the lastname.

In other words, your performance problem isn't parsing, it is storing the parsed data. Often, this is worse than inserting the data into another table. Also, all of those page splits will fragment your table and cause query performance to drop, which may enrage your dba because s/he will have to run a defrag procedure on the (large) table.

Here's one last thought: Do you really need to store the parsed data? Can you get away with a computed column that calculates the last name on the fly? Those are indexable, with certain conditions, if you need that. Another approach would be a view that exposes the columns of the table as well as your "parsed lastname column".

Related Solutions

Sql-server – Query table A for any rows that are like any search term in table B

...
FROM
  TableA
  JOIN
  TableB ON
          TableA.FirstName LIKE '%' + TableB.SearchTerm + '%' OR 
          TableA.LastName LIKE '%' + TableB.SearchTerm + '%'
....

Now, this isn't going to be the best of queries. Consider full text search instead (I'm not too familiar with it)

This OR could be split into a UNION

Sql-server – SQL table design for primary keys (best practice)

I agree with Maess - if you have distinct entities, each of those ought to have their own ID columns. But I have a strong objections against using GUID's for your primary keys - or more specifically - for your SQL Server's clustering keys.

GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.

You really need to keep two issues apart:

1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.

2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.

By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.

As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.

Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.

Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.

Quick calculation - using INT vs. GUID as Primary and Clustering Key:

Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)

TOTAL: 25 MB vs. 106 MB - and that's just on a single table!

Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.

Best Answer

Related Solutions

Sql-server – Query table A for any rows that are like any search term in table B

Sql-server – SQL table design for primary keys (best practice)

Related Question