Sql-server – Unexplained sustained space used in the TempDB log file

sql serversql-server-2017tempdb

A few days ago the amount of used space in both the tempdb data and log files started growing rapidly. After 30 minutes, the tempdb log remained around 99% space used for about 7 hours. Towards the end of the 7 hour period, I:\Databases\templog.ldf: Operating system error 112(There is not enough space on the disk.) encountered. was logged followed by about 30 instances of The transaction log for database 'tempdb' is full due to 'ACTIVE_TRANSACTION' over a 3 second period, and then suddenly the tempdb log file returned to 1% used space. This usage data is recorded in logs every 3 minutes via a SQL agent job.

I also have a SQL agent job that runs sp_WhoIsActive every 1 minute and captures the output. I don't see anything that stands out in this data that explains what was consuming or holding space in the tempdb. There are some maintenance jobs running for several minutes at a time, typically running no longer than 5 or 10 minutes each. Even if one of those jobs or another session used up tempdb log space, I would expect the space to be reclaimable after any transactions complete or at least after the session ends.

  • It's a 200 GB drive dedicated for tempdb data and log files.
  • Both the data and log files grew during this time until there was no more space on disk.
  • There's a tempdb_2.ndf file I don't recall seeing before (maybe it's always been there…). tempdb.mdf and tempdb_2.ndf are the same exact size, around 29 GBs each. templog.ldf is using 137 GBs.
  • Running Microsoft SQL Server 2017 (RTM-CU15-GDR) (KB4505225) – 14.0.3192.2 (X64)

I'm trying to understand what happened and how to prevent it again.

  • Is it possible for tempdb log space to not be reclaimable for hours even if there are no sessions running for anywhere near that long?
  • What caused tempdb log space used to suddenly free up 3 seconds after the above error messages (not enough space on disk and tempdb transaction log is full)?
  • Is sp_WhoIsActive logging not enough to capture what process was holding onto the tempdb log space, what other additional logging can I do?
  • If the tempdb log space used was actually reclaimable but a glitch in SQL Server prevented it from clearing out, is there a T-SQL command I can run to try and free up available space in the log?

Any thoughts or suggestions are very appreciated.

Best Answer

Is it possible for tempdb log space to not be reclaimable for hours even if there are no sessions running for anywhere near that long?

While it's probably possible, I would say under normal circumstances the answer is no. In the absence of long-running transactions or long-lived sessions, the tempdb log should clear pretty quickly.

What caused tempdb log space used to suddenly free up 3 seconds after the above error messages (not enough space on disk and tempdb transaction log is full)?

The most likely explanation is that there was a long-running transaction. It finally encountered this error and was killed, which allowed the log space to free up.

Is sp_WhoIsActive logging not enough to capture what process was holding onto the tempdb log space, what other additional logging can I do?

I would expect that to capture it. One thing that might have caused you to miss it is if the process had executed statements in a transaction and then just stopped doing anything (without committing, rolling back, or disconnecting).

This would be the case if someone ran a query in SSMS that used a ton of tempdb, and then just left that SSMS window open all day. These queries (that are holding resources, but not doing anything actively) won't show up in sp_WhoIsActive by default. You need to add the @show_sleeping_spids = 2 option to see those. Here's an excerpt from the built-in docs:

screenshot of show_sleeping_spids option description

If the tempdb log space used was actually reclaimable but a glitch in SQL Server prevented it from clearing out, is there a T-SQL command I can run to try and free up available space in the log?

Not that I can think of. You might try issuing a CHECKPOINT, to force the VLFs to clear:

USE [tempdb];
GO
CHECKPOINT;