Sql-server – System SPID running for hours filling up the TempDB Log

sql serversql-server-2017tempdb

I've been trying to track down the source of my tempdb log filling up every few weeks where the tempdb log space used will continue to grow until it reaches 99% and stays there for hours while the log file grows. After adding in some additional logging and email alerts, I think I identified the source of the problem, at least for this incident, not sure if the cause is the same in past incidents.

dbcc opentran in the tempdb reported this:

Oldest active transaction:
    SPID (server process ID): 47s
    UID (user ID) : -1
    Name          : sort_init
    LSN           : (41:3017:42)
    Start time    : Dec 19 2020 10:03:20:410AM
    SID           : 0x01
DBCC execution completed. If DBCC printed error messages, contact your system administrator.

The spid is 47s which I discovered is a system spid. I ran sp_WhoIsActive with @show_system_spids = 1 which brought back data on the process. After running sp_WhoIsActive over the course of an hour, tempdb_allocations ended up at 4,221,088, tempdb_current was 4,128,656, open_tran_count was 1 (throughout the entire incident) and sql_text was always null. After running out of ideas, I ended up restarting SQL Server. Several times prior to restarting SQL server, I issued a checkpoint in tempdb but that didn't help, likely since the transaction from the system process was still active.

The Start time from the dbcc opentran output is 10:03am and it was 1:00pm when I restarted SQL Server. I ran dbcc opentran a few times over the course of an hour and the same output was returned each time.

In this particular case, the server was rebooted and it restarted at 10:02:54 — 26 seconds before this long running system process started at 10:03:20. Past incidents don't appear to have happened immediately after a system reboot. I don't have many details on past incidents, but at least the last time this occurred 9 days ago, in the sp_WhoIsActive output I collected (without @show_system_spids set to 1) there were no long running processes captured which likely means it was a similar system process running / filling up the tempdb log. This 99% full tempdb log incident 9 days ago was ongoing for 7 hours.

Nothing stands out in the SQL Server log and Windows logs. I'm running Microsoft SQL Server 2017 (RTM-CU15-GDR) (KB4505225) – 14.0.3192.2 (X64).

Is there a way to determine what a system SPID is doing? Or any tips on how I can better troubleshoot this problem when it happens next time? The only clue from this incident is the Name value of sort_init in the dbcc opentran output which could be helpful but seems generic at the same time.

Best Answer

There are some newer DMVs designed specifically for troubleshooting what is happening in TEMPDB. Take a look at sys.dm_db_session_space_usage and sys.dm_db_task_space_usage. See this link for a discussion of those DMSVs