Sql-server – Why is it taking longer to create an index after column size increases

database-internalssql server

Our vendor changed column widths on almost every column in the entire database. The database is around 7TB, 9000+ tables. We are trying to create an index on a table that has 5.5billion rows. Before the vendor's upgrade we could create the index in 2 hours. Now it takes days. What they have done is increase any varchar(xx) size to varchar(256). So most columns used to be varchar(18) or varchar(75), etc.

Anyway the primary key consists of 6 columns that combined width was 126 characters. Now after the upgrade, the primary key is 1283 characters which violates SQL Servers limit of 900 characters. The entire table column width went from a total combined varchar count of 1049 to a total combined varchar count of 4009.

There is not an increase in data, the table doesn't take up any more "space" than it did before all the column width increase, but performance to create something as simple as an index is now taking an unreasonable amount of time.

Can anyone explain why it is taking so much longer to create and index when the only thing done was increasing the size of the columns?

The index we are trying to create is nonclustered since the pk is the clustered index. After several attempts to create the index, we gave up. I think it ran 4 or 5 days without completion.

I tried this in a non-production environment by taking a file system snapshot and brought the database up on a quieter server.

Best Answer

Remus has helpfully pointed out that the max length of the VARCHAR column impacts the estimated row size and therefore memory grants that SQL Server provides.

I tried to do a bit more research to expand on the "from this on things cascade" part of his answer. I don't have a complete or concise explanation, but here is what I found.

Repro script

I created a full script that generates a fake data set on which index creation takes roughly 10x as long on my machine for the VARCHAR(256) version. The data used is exactly the same, but the first table uses the actual max lengths of 18, 75, 9, 15, 123, and 5, while all columns use a max length of 256 in the second table.


Keying the original table

Here we see that the original query completes in about 20 seconds and the logical reads are equal to the table size of ~1.5GB (195K pages, 8K per page).

-- CPU time = 37674 ms,  elapsed time = 19206 ms.
-- Table 'testVarchar'. Scan count 9, logical reads 194490, physical reads 0
CREATE CLUSTERED INDEX IX_testVarchar
ON dbo.testVarchar (s1, s2, s3, s4)
WITH (MAXDOP = 8) -- Same as my global MAXDOP, but just being explicit
GO


Keying the VARCHAR(256) table

For the VARCHAR(256) table, we see that the elapsed time has increased dramatically.

Interestingly, neither the CPU time nor the logical reads increase. This makes sense given that the table has the exact same data, but it doesn't explain why the elapsed time is so much slower.

-- CPU time = 33212 ms,  elapsed time = 263134 ms.
-- Table 'testVarchar256'. Scan count 9, logical reads 194491
CREATE CLUSTERED INDEX IX_testVarchar256
ON dbo.testVarchar256 (s1, s2, s3, s4)
WITH (MAXDOP = 8) -- Same as my global MAXDOP, but just being explicit
GO


I/O and wait stats: original

If we capture a bit more detail (using p_perfMon, a procedure that I wrote), we can see that the vast majority of the I/O is performed on the LOG file. We see a relatively modest amount of I/O on the actual ROWS (the main data file), and the primary wait type is LATCH_EX, indicating in-memory page contention.

We can also see that my spinning disk is somewhere between "bad" and "shockingly bad", according to Paul Randal :)

enter image description here


I/O and wait stats: VARCHAR(256)

For the VARCHAR(256) version, the I/O and wait stats look completely different! Here we see a huge increase in the I/O on the data file (ROWS), and the stall times now make Paul Randal simply say "WOW!".

It's not surprising that the #1 wait type is now IO_COMPLETION. But why is so much I/O generated?

enter image description here


Actual query plan: VARCHAR(256)

From the query plan, we can see that the Sort operator has a recursive spill (5 levels deep!) in the VARCHAR(256) version of the query. (There is no spill at all in the original version.)

enter image description here


Live query progress: VARCHAR(256)

We can use sys.dm_exec_query_profiles to view live query progress in SQL 2014+. In the original version, the entire Table Scan and Sort are processed without any spills (spill_page_count remains 0 throughout).

In the VARCHAR(256) version, however, we can see that page spills quickly accumulate for the Sort operator. Here is a snapshot of the query progress just before the query completes. The data here is aggregated across all threads.

enter image description here

If I dig into each thread individually, I see that 2 threads complete the sort within about 5 seconds (@20 seconds overall, after 15 seconds spent on the table scan). If all threads progressed at this rate, the VARCHAR(256) index creation would have completed in roughly the same time as the original table.

However, the remaining 6 threads progress at a much slower rate. This may be due to the way that memory is allocated and the way that the threads are being held up by I/O as they are spilling data. I don't know for sure though.

enter image description here


What can you do?

There are a number of things you might considering trying:

  • Work with the vendor to roll back to a previous version. If that's not possible, let the vendor that you are not happy with this change so that they can consider reverting it in a future release.
  • When adding your index, consider using OPTION (MAXDOP X) where X is a lower number than your current server-level setting. When I used OPTION (MAXDOP 2) on this specific data set on my machine, the VARCHAR(256) version completed in 25 seconds (compared to 3-4 minutes with 8 threads!). It's possible that the spilling behavior is exacerbated by higher parallelism.
  • If additional hardware investment is a possibility, profile the I/O (the likely bottleneck) on your system and consider using an SSD to reduce the latency of the I/O incurred by spills.


Further reading

Paul White has a nice blog post on the internals of SQL Server sorts that may be of interest. It does talk a little bit about spilling, thread skew, and memory allocation for parallel sorts.