Sql-server – Updating large table worked before, now taking hours

performancequery-performancesql serverupdate

BACKGROUND: I am using a non-production SQL Server to manipulate a large data set and run a few basic calculations. The original data file I have, which is about 30M rows, is missing records (not NULL; they are non-existent, but for reasons beyond my judgement they must be there). So I created essentially a "shell" table for all rows and columns (6 total columns) that should be there- this one is about 80M rows.

PROBLEM: The step I need help with is overlaying the original 30M rows on top of the 80M row table. (All 30M rows have a matching record in the larger table based on the values of three columns, and the remaining three columns, currently null, need to be overwritten with the original data.)

I successfully did this process for a larger dataset (40M/100M tables) and the process took about 4 hours. That's fine- it's a one-shot thing. But when I run it on this smaller data set, it runs for 10+ hours. The first time I got a time out error. I've tried adjusting a few things and running it twice more, but I've cancelled the query after about 10 hours thinking I could try something else.

Any thoughts on why it worked the first time for a larger table, but not on this smaller one? Any thoughts on how I can make the code more efficient or clear out something in a file somewhere?

CODE:

update [shelltable]
set shelltable.column4 = originaltable.column4, 
    shelltable.column5 = originaltable.column5, 
    shelltable.column6 = originaltable.column6
from shelltable left join original table
  on shelltable.column1 = originaltable.column1 and
     shelltable.column2 = originaltable.column2 and
     shelltable.column3 = originaltable.column3;

Things I've tried since the unsuccessful attempts:

  • Restarting the SQL Server instance so tembdb clears
  • Restarting the PC
  • Reducing the number of columns in the SET clause to 1 (I would re run for each column 4-6)
  • Update statistics on both tables
  • Ran DB Tuning Engine, which recommended and created an index on the larger table
  • Added 'WHERE originaltable.column4 is not null' to exclude writing on some of the rows

The latest attempt is still running after these adjustments and is currently at 9.5 hours. Thanks for your tips and insights!

Best Answer

It might help a lot if you would have listed the indexes you have or the query plan. If the columns 4, 5 or 6 are indexed, that can slow the process down, and having a clustered index on columns 1, 2 and 3 should speed it up.

If columns 4,5 or 6 are variable length (e.g. varchar) then updating new value can cause a lot of page splits, that definitely will make it slower.

I think what John M meant was that if there's a reasonable amount of distinct values in for example column1, you could first collect the distinct into a temp. table and then run the update separately for each of the values because doing the joining the tables in smaller parts is usually a lot faster.

Of course everything depends on several things, like data types, number of distinct values in the columns 1-3 etc.