Our vendor changed column widths on almost every column in the entire database. The database is around 7TB, 9000+ tables. We are trying to create an index on a table that has 5.5billion rows. Before the vendor's upgrade we could create the index in 2 hours. Now it takes days. What they have done is increase any varchar(xx) size to varchar(256). So most columns used to be varchar(18) or varchar(75), etc.
Anyway the primary key consists of 6 columns that combined width was 126 characters. Now after the upgrade, the primary key is 1283 characters which violates SQL Servers limit of 900 characters. The entire table column width went from a total combined varchar count of 1049 to a total combined varchar count of 4009.
There is not an increase in data, the table doesn't take up any more "space" than it did before all the column width increase, but performance to create something as simple as an index is now taking an unreasonable amount of time.
Can anyone explain why it is taking so much longer to create and index when the only thing done was increasing the size of the columns?
The index we are trying to create is nonclustered since the pk is the clustered index. After several attempts to create the index, we gave up. I think it ran 4 or 5 days without completion.
I tried this in a non-production environment by taking a file system snapshot and brought the database up on a quieter server.
Best Answer
Remus has helpfully pointed out that the max length of the
VARCHAR
column impacts the estimated row size and therefore memory grants that SQL Server provides.I tried to do a bit more research to expand on the "from this on things cascade" part of his answer. I don't have a complete or concise explanation, but here is what I found.
Repro script
I created a full script that generates a fake data set on which index creation takes roughly 10x as long on my machine for the
VARCHAR(256)
version. The data used is exactly the same, but the first table uses the actual max lengths of18
,75
,9
,15
,123
, and5
, while all columns use a max length of256
in the second table.Keying the original table
Here we see that the original query completes in about 20 seconds and the logical reads are equal to the table size of
~1.5GB
(195K pages, 8K per page).Keying the VARCHAR(256) table
For the
VARCHAR(256)
table, we see that the elapsed time has increased dramatically.Interestingly, neither the CPU time nor the logical reads increase. This makes sense given that the table has the exact same data, but it doesn't explain why the elapsed time is so much slower.
I/O and wait stats: original
If we capture a bit more detail (using p_perfMon, a procedure that I wrote), we can see that the vast majority of the I/O is performed on the
LOG
file. We see a relatively modest amount of I/O on the actualROWS
(the main data file), and the primary wait type isLATCH_EX
, indicating in-memory page contention.We can also see that my spinning disk is somewhere between "bad" and "shockingly bad", according to Paul Randal :)
I/O and wait stats: VARCHAR(256)
For the
VARCHAR(256)
version, the I/O and wait stats look completely different! Here we see a huge increase in the I/O on the data file (ROWS
), and the stall times now make Paul Randal simply say "WOW!".It's not surprising that the #1 wait type is now
IO_COMPLETION
. But why is so much I/O generated?Actual query plan: VARCHAR(256)
From the query plan, we can see that the
Sort
operator has a recursive spill (5 levels deep!) in theVARCHAR(256)
version of the query. (There is no spill at all in the original version.)Live query progress: VARCHAR(256)
We can use sys.dm_exec_query_profiles to view live query progress in SQL 2014+. In the original version, the entire
Table Scan
andSort
are processed without any spills (spill_page_count
remains0
throughout).In the
VARCHAR(256)
version, however, we can see that page spills quickly accumulate for theSort
operator. Here is a snapshot of the query progress just before the query completes. The data here is aggregated across all threads.If I dig into each thread individually, I see that 2 threads complete the sort within about 5 seconds (@20 seconds overall, after 15 seconds spent on the table scan). If all threads progressed at this rate, the
VARCHAR(256)
index creation would have completed in roughly the same time as the original table.However, the remaining 6 threads progress at a much slower rate. This may be due to the way that memory is allocated and the way that the threads are being held up by I/O as they are spilling data. I don't know for sure though.
What can you do?
There are a number of things you might considering trying:
OPTION (MAXDOP X)
whereX
is a lower number than your current server-level setting. When I usedOPTION (MAXDOP 2)
on this specific data set on my machine, theVARCHAR(256)
version completed in25 seconds
(compared to 3-4 minutes with 8 threads!). It's possible that the spilling behavior is exacerbated by higher parallelism.Further reading
Paul White has a nice blog post on the internals of SQL Server sorts that may be of interest. It does talk a little bit about spilling, thread skew, and memory allocation for parallel sorts.