Sql-server – TempDB contentions

sql serversql server 2014tempdb

We have an active OLTP 40GB database on SQL Server 2014 SP1. Queries are found to be slow with IO_Completion waits, Disk Queue Length rising to 900, and SQL Server stops responding. What we tried:

  1. Restart instance and with in a minute it start behaving same way.

  2. After second restart, we changed initial size of each tempdb datafile (there are 16 data files created) and it start working correctly.

Note: We are using table variables for intermediate result sets. These result sets are very small.

It happened twice in a month. Every time I add little bit of space manually to the data files, then it starts working normally. The more interesting thing is that the same setup (same hardware, same folder and files setup, same workload) we have on SQL Server 2008 R2 and and SQL Server 2012 is working fine.

Kindly help us to find a permanent solution.

Initial size of all data files is same 1000MB, Current is 1500MB each. All are identical. Autogrowth is 100MB for each. Before this we were facing PFS and GAM pages contention and we increased to 16 and problem resolved. Both trace flags 1117 & 1118 are enabled. 24 cores on 2 NUMA nodes. All datafiles are on the same volume. Simple disk, no SAN.

Instance is on a physical machine. Queries with Table Variables and queries with Hash Joins are most commonly generating IO_Completion waits.


The detailed answer by wBob pushed us to search more in detail. How did we missed it before:

Autogrow of file 'templog' in database 'tempdb' was cancelled by user or timed out after 7704 milliseconds. Use ALTER DATABASE to set a smaller FILEGROWTH value for this file or to explicitly set a new file size.

This we found in log when ever this type of problem is occurring. We are moving TempDB to separate fast drive.

Best Answer

I think you've overfragmented your tempdb and there is a mismatch between the server CPU and disk setup, but let's collect some more information:

Questions / Further information required

  • Please confirm processor name and type (I'm basically trying to establish if it's 2 x hex-core with HT ). Use System information (eg Control Panel > System and Security > System on Windows Server 2012 R2) and / or sysinternals tool CoreInfo to confirm.
  • Please confirm server maxdop ( eg EXEC sp_configure 'max degree of parallelism' ). If the CPUs are hex-core, the server maxdop should be at most 6 (as per here), or arguably lower on an OLTP system. I normally keep my tempdb files in line with my server DOP to a max of 8 but we'll come onto that.
  • Please confirm server total memory on the box and the SQL Server memory cap ( eg EXEC sp_configure 'max server memory (MB)' ).
  • Please confirm if any other services are running on the box ( eg SSIS, SSAS, SSRS, the application, iTunes etc )
  • Please confirm Instant File Initialisation is enabled for the SQL Server service account. (Ways to test it here).
  • Why is there such a massive discrepancy between the CPU (beefy 2 node NUMA setup) verus the one disk (home PC)? Consider adding disks, striping, SSD for tempdb (although avoid over-reacting : ).
  • Please add an actual execution plan for one of the problem queries. Anonymise with SQL Sentry Plan Explorer if you wish.
  • Hash joins with table variables in an OLTP system? This suggests a lack of indexing on the table variable, main table or both. Are you declaring your table variables like this (with no indexes)?

    DECLARE @t TABLE ( x INT )
    
  • Don't skimp on the table variable definition even though it's holding small resultsets. It's always best to give the optimizer as much information as possible so be explicit with nullability, uniqueness, whether or not the index is clustered / non-clustered, eg

    DECLARE @t TABLE ( x INT PRIMARY KEY )
    DECLARE @u TABLE ( x INT PRIMARY KEY NONCLUSTERED, u INT NOT NULL UNIQUE CLUSTERED, z INT NOT NULL UNIQUE, a CHAR(1) NULL ) -- not sure why you would do this but you can
    DECLARE @v TABLE ( x INT NOT NULL, y INT NOT NULL, PRIMARY KEY ( x, y ) )   -- multi-column primary key
    
  • Posting the execution plan will help diagnose this.

  • Check for code preventing table variable caching as per here, here. I think dynamic SQL and proc executed WITH RECOMPILE are the only ones that affect table variables.

    DECLARE @u TABLE ( x INT )
    
    INSERT @u
    EXEC('DECLARE @t TABLE ( x INT ); INSERT INTO @t VALUES ( 1 ); SELECT x FROM @t;' )
    
    SELECT *
    FROM @u
    
  • Check SQL Server Log ( Object Explorer > Management > SQL Server Logs ) for messages, eg IO warnings.

  • Check Windows Event Viewer
  • There have been a number of builds released since SP1. Review the CU fixes put in since SP1. It's possible there are bugs in SP1 fixed in subsequent CUs, eg FIX: Sort operator spills to tempdb in SQL Server 2012 or SQL Server 2014 when estimated number of rows and row size are correct https://support.microsoft.com/en-us/kb/3088480
  • Establish this is your cause before applying any hotfixes, although it is more important to keep up-to-date with CUs with SQL Server 2014, due the number of new features (in-memory OLTP, clustered columnstore).
  • Finally, the need for one tempdb file per core is a myth and looking at your disk setup my guess is tempdb is overly fragmented. I have a nagging feeling you have one disk-head, tempdb has one filegroup, many files.

However forget what we think we know; create a test rig which reproduces your problem, and experiment with reducing the number of temp files ... start at 1, 2, 4, 6 etc gather the information, to make an evidence-based decision. Now this is the harder bit as your problem seems intermittent and you may not be able to mess with your tempdb setup, but that's how I would approach this.

Good luck. Let us know how you get on.