We have an active OLTP 40GB database on SQL Server 2014 SP1. Queries are found to be slow with IO_Completion waits, Disk Queue Length rising to 900, and SQL Server stops responding. What we tried:
-
Restart instance and with in a minute it start behaving same way.
-
After second restart, we changed initial size of each tempdb datafile (there are 16 data files created) and it start working correctly.
Note: We are using table variables for intermediate result sets. These result sets are very small.
It happened twice in a month. Every time I add little bit of space manually to the data files, then it starts working normally. The more interesting thing is that the same setup (same hardware, same folder and files setup, same workload) we have on SQL Server 2008 R2 and and SQL Server 2012 is working fine.
Kindly help us to find a permanent solution.
Initial size of all data files is same 1000MB, Current is 1500MB each. All are identical. Autogrowth is 100MB for each. Before this we were facing PFS and GAM pages contention and we increased to 16 and problem resolved. Both trace flags 1117 & 1118 are enabled. 24 cores on 2 NUMA nodes. All datafiles are on the same volume. Simple disk, no SAN.
Instance is on a physical machine. Queries with Table Variables and queries with Hash Joins are most commonly generating IO_Completion waits.
The detailed answer by wBob pushed us to search more in detail. How did we missed it before:
Autogrow of file 'templog' in database 'tempdb' was cancelled by user or timed out after 7704 milliseconds. Use ALTER DATABASE to set a smaller FILEGROWTH value for this file or to explicitly set a new file size.
This we found in log when ever this type of problem is occurring. We are moving TempDB to separate fast drive.
Best Answer
I think you've overfragmented your tempdb and there is a mismatch between the server CPU and disk setup, but let's collect some more information:
Questions / Further information required
EXEC sp_configure 'max degree of parallelism'
). If the CPUs are hex-core, the server maxdop should be at most 6 (as per here), or arguably lower on an OLTP system. I normally keep my tempdb files in line with my server DOP to a max of 8 but we'll come onto that.EXEC sp_configure 'max server memory (MB)'
).Hash joins with table variables in an OLTP system? This suggests a lack of indexing on the table variable, main table or both. Are you declaring your table variables like this (with no indexes)?
Don't skimp on the table variable definition even though it's holding small resultsets. It's always best to give the optimizer as much information as possible so be explicit with nullability, uniqueness, whether or not the index is clustered / non-clustered, eg
Posting the execution plan will help diagnose this.
Check for code preventing table variable caching as per here, here. I think dynamic SQL and proc executed WITH RECOMPILE are the only ones that affect table variables.
Check SQL Server Log ( Object Explorer > Management > SQL Server Logs ) for messages, eg IO warnings.
However forget what we think we know; create a test rig which reproduces your problem, and experiment with reducing the number of temp files ... start at 1, 2, 4, 6 etc gather the information, to make an evidence-based decision. Now this is the harder bit as your problem seems intermittent and you may not be able to mess with your tempdb setup, but that's how I would approach this.
Good luck. Let us know how you get on.