Sql-server – Not able to regulate/auto grow tempdb size

sql servertempdb

We have a SQL Server 2017 (RTM-CU15) AG group (3 servers, 2 HA synchronous) that has had performance issues for the past two months. During the resolution of an issue yesterday we found that the instance was struggling with the tempdb, as per the logs:

A time-out occurred while waiting for buffer latch type 2, bp 000001C9A41D4A40, page 4:40616, stat 0x810b, database id: 2, allocation unit Id… Not continuing to wait

There were no mention of storage issues in Event Viewer system logs.

Upon inspection, we found that tempdb had 8 data files each of 250MB (2GB in total), even though they were set to autogrow unlimited. This is small in my opinion.

Based on the size of the other DBs on the instance (about 1.5TB) and the time it's taken for some of my own BI queries (30 minutes) I would have expected the tempdb to be way over 2GB.

No one was monitoring the page life expectancy, but I suspect the instance was not able to manage/auto grow its tempdb files. Once we increased the tempdb size, we no longer had performance issues.

I was surprised to find both HA nodes had the same tempdb sizes of 2GB (both have run the production services for some time).

I haven't found this issue online. Is this an issue with this CU or with 2017 in general? What could have prevented the tempdb from auto growing? Should we have looked elsewhere for the performance issue?

Instant file initialization is not enabled:

enter image description here

The AG logs don't have anything useful. The AG did break with the tempdb error that occurred though.

The tempdb database had two months to grow, but it didn't. It stayed at the initial size of 2GB. It should have grown by itself based on the autogrow settings on each file, which were 100MB, unlimited?

Best Answer

As Tibor mentioned in the comments, add your SQL Server Service account to the Perform volume maintenance tasks local security policy so it can perform Instant File Initialization. I personally don't feel that alone will solve your problem as I suspect you may be running into the byproduct of thin provisioned disk. A nice definition being as follows:

A thin-provisioned virtual disk consumes only the space that it needs initially, and grows with time according to demand.

First, are these servers VMs? If so, is your infrastructure team presenting VM disk that has been thin provisioned?

Alternatively is the SAN presenting thin provisioned disk to the VM or physical hosts (in the event they are not VMs)?

What can happen with thin provisioned disk is that when additional space is requested to be allocated, and the underlying disk pool/group is over-allocated, you'll run into errors similar to what you're seeing as there's no more actual storage available. From the OS perspective, the disk has plenty of free space to expand into, but you get these random errors that make no sense. Databases tend to suffer most from thin-provisioning as they often request resources in large quantities and need said resources quickly, and thin-provisioning mechanisms are built for slow and steady rather than varied and rapid.

It is pretty easy to identify if this is your issue or not as you only need to reach out to both your VM admins and your storage admins. Ask if any of the underlying disk the TempDB (or any other database) files are on are thin provisioned, and if so how much actual space is available to use. I've run into scenarios where the OS showed I had a 1 TB drive, but in actuality, I was limited to significantly less space as the underlying disk pool didn't have enough actual storage to honor the extra space the OS was told it has.

I hate when resources are thin provisioned (especially when this happens with memory) because you're not working with all the resources you've been promised. It's basically like telling your kids they can spend $100 but you only have $80 available and you just hope they are fiscally responsible. I've yet to work in an environment where thin provisioning doesn't eventually cause issues. Database Servers should be built for usage spikes rather than averages, and infrastructure admins hate seeing idle resources so many times they offer thin-provisioning as a compromised approach.