Sql-server – UPDATE with JOIN on 100mm records, how to do this better? (in T-SQL)

sql serversql-server-2008t-sql

I need to update 100 million records in a single table, in effect, normalizing the table by replacing the varchar value of a column with simply an ID. (I say "replacing" but really I'm writing the ID into another column.)

What I'm trying to achieve is to normalize the dataset. The not yet normalized data has no indexing. My thought was that I would not build indexes on the raw values, waiting, instead to index the foreign keys that will be replacing the varchar values with tinyint values after the update completes.

UPDATE A
SET A.AutoClassID = B.AutoClassID
FROM AutoDataImportStaging.dbo.Automobile as A
JOIN AutoData.dbo.AutoClass as B on (A.AutoClassName = B.AutoClassName)

Background

  • using MSSQL 2008 R2 on Server 2008 R2
  • server has 8 GB RAM
  • server has one RAID10, 7200 RPM SATA (not great, I know, in production this will only read data and not write data; plus recent HD shortage made this necessary for cost)
  • server has dual quad-core Xeon CPU
  • the machine is not doing anything else (currently dedicated to dev, only this process)
  • simple logging turned on (? – but does it still log so that it can rollback?)
  • note that the query references two different DBs, for what that's worth
  • "width" of a record in table getting updated is 455 bytes

Resources During Execution

  • physical RAM is maxed out
  • disk I/O is maxed out
  • CPU is hardly doing anything (choke point is I/O)
  • run time has been 14 hours and counting!

I suspect a few things like I need an index on the raw data, even though I will be dropping the column (AutoClassName) after the normalization updates. I also wonder if I should just loop down the table one record at a time instead of the JOIN, which seemed ridiculous at the time I started this, but now it seems that would have been faster.

How should I change my methodology for my remaining normalization updates (similar to this one) more quickly?

Best Answer

You're trying to do this as a single (very large) transaction. Instead, do the update in smaller batches.

You would also benefit from:

  • A temporary index on AutoData.dbo.AutoClass.AutoClassName
  • More RAM. Lot's more RAM.