I have a problem with my MSSQL cluster. When switching to SQL Server 2016 (from 2014) an error occurs with the tempdb, which is full after a certain time for no apparent reason.
Is there a proven way to monitor or avoid this behavior?
Because every time the tempdb is full, the entire server does not work and rejects every query.
The only solution that works when the error occurs is to increase the tempdb log and temdb dev files.
Currently the tempdb size for a 700 GB database is approximately 90 GB.
Thanks for everyone who can help!
Best Answer
TempDB is a database like any other in many respects, so monitoring tools that can watch metrics like space allocated & used and raise alerts based on those metrics can be used to monitor TempDB too. As a quick proof of that run
in SSMS.
If TempDB is constantly growing like this and the space remains used (not just allocated) which it sounds like it is temporarily as you need to increase the allocation in order to move on, then it sounds like you have long-running, possibly blocked, sessions that are holding a lot of data in TempDB. You could use diagnostic tools like sp_whoisactive (run in the app database, not tempdb) at such times to list active sessions at the time you have a problem. From sp_whoisactive's documentation:
If you spot a problem session (or several problem sessions) the
sql_text
andsql_command
outputs from it run with default options and with@get_full_inner_text=1, @get_outer_command=1
may help identify the processes that are either running away in loops or getting snarled up in lock waits that you need to address, or stopping and not closing (due to an application problem, or a human leaving a connection open in a tool like SSMS).There are many things that could cause sessions to balloon TempDB this way (perhaps a bug where you have an accidental
CROSS JOIN
that is spooling to TempDB, the size of that dump getting larger each time as your data grows, to pick one example - I've seen that before where someone has "fixed" it by addingDISTINCT
instead of actually addressing the errant/missing join predicates). It may also be that rather than a few errant sessions you have a glut of concurrent ones that use TempDB a lot and rather than being blocked by locks they are simple taking a lot of time to complete due to IO contention - if that is the case then in the run-up to things stalling because TempDB becomes full you will see your drives or network being pummelled.Once the space in TempDB is released these diagnostics will be less useful so you'll need to look at other things, but from your description it sounds like the space is still allocated so it is trying to grow as it needs more.