400 million names is a lot. Am I in there? ;-)
My gut level feeling says that using substring isn't going to be terribly much slower than coding up something via the CLR. I'm a SQL guy, I've done a fair amount of simple parsing in the past (2000 or 2005), and I was involved in what was going to be a very complicated parsing scheme (addresses, world-wide) written in c and called via an xproc until we found that out prototype "native" code wasn't any faster than the same thing written with tsql functions.
If you want to use a language other than tsql, I'd suggest writting a CLR in c# or vb.net. For simple things, it's not hard to code in CLR. I went from newb to having a couple of working directory and file utilities in less than one morning. There are plenty of examples of simple clr procedures on the net. and you wont have to learn anything (or install visual studio) to write it in tsql
No matter what, you will have to go through the table at least once. If you export, then parse and then put back what is not a small amount of data, that's a lot of time. Can you guarantee that your source isn't going to change in the mean time?
Here's the thing that always seems to sneak up on everyone:
What happens with the parsed data? Where does it wind up? Do you intend to update the row, perhaps there are lastname and firstname columns that you don't show in your example?
If you do, and those columns are currently null or have zero length stings in them, you may find that the update statement performance is very bad because sql may have to split pages to store the lastname.
In other words, your performance problem isn't parsing, it is storing the parsed data. Often, this is worse than inserting the data into another table. Also, all of those page splits will fragment your table and cause query performance to drop, which may enrage your dba because s/he will have to run a defrag procedure on the (large) table.
Here's one last thought:
Do you really need to store the parsed data? Can you get away with a computed column that calculates the last name on the fly? Those are indexable, with certain conditions, if you need that. Another approach would be a view that exposes the columns of the table as well as your "parsed lastname column".
A solution I've used in the past (and have recommended here and on StackOverflow before) is to create two additional schemas:
CREATE SCHEMA shadow AUTHORIZATION dbo;
CREATE SCHEMA cache AUTHORIZATION dbo;
Now create a mimic of your table in the cache
schema:
CREATE TABLE cache.IPLookup(...columns...);
Now when you are doing your switch operation:
TRUNCATE TABLE cache.IPLookup;
BULK INSERT cache.IPLookup FROM ...;
-- the nice thing about the above is that it doesn't really
-- matter if it takes one minute or ten - you're not messing
-- with a table that anyone is using, so you aren't going to
-- interfere with active users.
-- this is a metadata operation so extremely fast - it will wait
-- for existing locks to be released, but won't block new locks
-- for very long at all:
BEGIN TRANSACTION;
ALTER SCHEMA shadow TRANSFER dbo.IPLookup;
ALTER SCHEMA dbo TRANSFER cache.IPLookup;
COMMIT TRANSACTION;
-- now let's move the shadow table back over to
-- the cache schema so it's ready for next load:
ALTER SCHEMA cache TRANSFER shadow.IPLookup;
TRUNCATE TABLE cache.IPLookup;
-- truncate is optional - I usually keep the data
-- around for debugging, but that's probably not
-- necessary in this case.
This will be more cumbersome if you have foreign keys and other dependencies (since you may have to drop those and re-create them), and of course it completely invalidates statistics etc. and this, in turn, can affect plans, but if the most important thing is getting accurate data in front of your users with minimal interruption, this can be an approach to consider.
Best Answer
Your intuition is right here - it's not common to have a 600 MB SQL script as part of your SSDT project in source control.
Especially in the world of distributed version control systems, like git, you're talking about potentially numerous copies of that file, on various machines, and tracking the history of it in detail.
However, there is no "built-in" way of dealing with static data other than post deploy scripts, as you've noticed:
SQL SSDT Database Projects: Source Control Lookup Data
One solution would be to create a one-time-use SSDT project, and all it has is your configuration data scripts. Then you can produce a dacpac file from that (which is really just a compressed folder), and "reference" that dacpac in your main project.
The advantage of this approach is that you get a compressed deploy script, but it's in a format that easily fits with your existing deployment pipeline (rather than just throwing the script file into a .zip archive).
Here's what that looks like.
I have an SSDT project with the AdventureWorks sample database schema in it. The resulting dacpac is 80 KB:
I'll create a second project, and all it contains is a Post Deploy script, which references my static config data script:
This produces a dacpac file that is 34 KB.
Based on that compression ratio, which depends a lot on your data, your 600 MB could get as small as 2 MB (there is a lot of repetition in a script like this, which lends itself to good compression).
Then you can add the dacpac as a "Database Reference" in your main SSDT project:
At this point, you could discard the "InitScripts" project. Or you could store it in a separate standalone repository, that less people will need, in case you ever need to update and regenerate the dacpac.
Note that, in the end, your build / deploy process will need to deploy both dacpac files - your "main" one, and the "InitScripts" one only on initial setup (or you could deploy it every time, and have the init scripts check for existing data before taking any action).