Sql-server – High write latancy in temp db

sql servertempdb

I'm in a sort of "accidental dba" role at the moment. I'm attempted to monitor and improve a db. I've been using a script from a Udemy course (I could post but it might be a bit long). I've been running it for 5 mins every hour to capture samples of disk latency, I'm unsure if that's a good way to do it. But I notice quite a few massive spike in tempdb e.g. 300+ milliseconds but normally a bit lower.

The tempdb, is on the same disk as the data files. I'm going to suggest a separate drive, ideally SSD but for the moment I've split the tempdb data files into 4 files and made them fairly big so they don't need to autogrow often. Could this changes be making the spikes, unfortunately a don't have a baseline as the changes were done to try to improve some issues. Or are the tempdb issues simply performance with a not so great 3rd party app. Any recommendations would be appreciated!

Here's the sp_blitz markdown

I've obscured any database names that refer to specific companies.

Priority 1: Backup:

  • Backing Up to Same Drive Where Databases Reside – 85 backups done on drive D:\ in the last two weeks, where database files also live. This represents a serious risk if that array fails.

  • Full Recovery Mode w/o Log Backups

    • COMPCRM – The 500.00MB log file has not been backed up in the last week.

    • ClientManager – The 2.25MB log file has not been backed up in the last week.

    • CRMSelfService – The 2.31MB log file has not been backed up in the last week.

    • VOIP.SDC – The 2362.81MB log file has not been backed up in the last week.

    • Elmah – The 20.94MB log file has not been backed up in the last week.

    • hMailServer – The 2.06MB log file has not been backed up in the last week.

    • NLog – The 117.63MB log file has not been backed up in the last week.

    • ReportServer – The 6.25MB log file has not been backed up in the last week.

    • ACC_COMP_SYS_Be – The 2599.13MB log file has not been backed up in the last week.

    • ACC_COMP_SYS_DW – The 32.78MB log file has not been backed up in the last week.

    • ACC_COMPSYS_Y2013 – The 2599.13MB log file has not been backed up in the last week.

    • ACC_CRM – The 63.00MB log file has not been backed up in the last week.

    • ACCLogViewer – The 4.00MB log file has not been backed up in the last week.

Priority 1: Reliability:

  • Last good DBCC CHECKDB over 2 weeks old – Last successful CHECKDB: never.

    • COMPCRM

    • ClientManager

    • CRMSelfService

    • VOIP.SDC

    • Elmah

    • hMailServer

    • master

    • model

    • msdb

    • NLog

    • ReportServer

    • ReportServerTempDB

    • ACC_COMP_SYS – Last successful CHECKDB: 2016-10-17 13:29:27.837

    • ACC_COMP_SYS_Bs – Last successful CHECKDB: never.

    • ACC_COMP_SYS_SUPPORT

    • ACC_COMP_SYS_DW

    • ACC_COMP_SYS_Test – Last successful CHECKDB: 2016-10-17 13:29:27.837

    • ACC_COMPSYS_Snapshot_2016 – Last successful CHECKDB: never.

    • ACC_COMPSYS_TEST

    • ACC_COMPSYS_Y2013

    • ACC_COMPSYS_Y2014

    • ACC_COMPSYS_Y2015

    • ACC_CRM

    • ACC_Demo_Data

    • ACCConfiguration

    • ACCLogViewer

Priority 10: Performance:

  • Auto-Shrink Enabled

    • ACC_COMP_SYS_DW – Database [ACC_COMP_SYS_DW] has auto-shrink enabled. This setting can dramatically decrease performance.

    • ACC_Demo_Data – Database [ACC_Demo_Data] has auto-shrink enabled. This setting can dramatically decrease performance.

Priority 20: Reliability:

  • Unsupported Build of SQL Server – Version 10.50.1617.00 is no longer supported by Microsoft. You need to apply a service pack.

Priority 50: Reliability:

  • Page Verification Not Optimal

    • COMPCRM – Database [COMPCRM] has NONE for page verification. SQL Server may have a harder time recognizing and recovering from storage corruption. Consider using CHECKSUM instead.

    • ACC_COMP_SYS_DW – Database [ACC_COMP_SYS_DW] has TORN_PAGE_DETECTION for page verification. SQL Server may have a harder time recognizing and recovering from storage corruption. Consider using CHECKSUM instead.

    • ACC_Demo_Data – Database [ACC_Demo_Data] has TORN_PAGE_DETECTION for page verification. SQL Server may have a harder time recognizing and recovering from storage corruption. Consider using CHECKSUM instead.

  • Remote DAC Disabled – Remote access to the Dedicated Admin Connection (DAC) is not enabled. The DAC can make remote troubleshooting much easier when SQL Server is unresponsive.

  • Transaction Log Larger than Data File

    • VOIP.SDC – The database [VOIP.SDC] has a 2 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough.

    • ACC_COMP_SYS_Bvs – The database [ACC_COMP_SYS_Bs] has a 2 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough.

    • ACC_COMPSYS_Y2013 – The database [ACC_COMPSYS_Y2013] has a 2 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough.

Priority 100: Performance:

  • Many Plans for One Query – 590 plans are present for a single query in the plan cache – meaning we probably have parameterization issues.

Priority 150: Performance:

  • Slow Storage Writes on Drive D – Writes are averaging longer than 100ms for at least one database on this drive. For specific database file speeds, run the query from the information link.

Priority 200: Backup:

  • MSDB Backup History Not Purged msdb – Database backup history retained back to Apr 21 2011 11:46AM

Priority 200: Informational:

  • Agent Jobs Starting Simultaneously – Multiple SQL Server Agent jobs are configured to start simultaneously. For detailed schedule listings, see the query in the URL.

  • Backup Compression Default Off – Uncompressed full backups have happened recently, and backup compression is not turned on at the server level. Backup compression is included with SQL Server 2008R2 & newer, even in Standard Edition. We recommend turning backup compression on by default so that ad-hoc backups will get compressed.

Priority 200: Monitoring:

  • Agent Jobs Without Failure Emails

    • The job COMPSYS_CoreDatabaseBackup has not been set up to notify an operator if it fails.

    • The job PerfLog has not been set up to notify an operator if it fails.

    • The job Re-index ACC_COMPSYS_Live.Subplan_1 has not been set up to notify an operator if it fails.

    • The job syspolicy_purge_history has not been set up to notify an operator if it fails.

  • No Alerts for Corruption – SQL Server Agent alerts do not exist for errors 823, 824, and 825. These three errors can give you notification about early hardware failure. Enabling them can prevent you a lot of heartbreak.

  • No Alerts for Sev 19-25 – SQL Server Agent alerts do not exist for severity levels 19 through 25. These are some very severe SQL Server errors. Knowing that these are happening may let you recover from errors faster.

  • No failsafe operator configured – No failsafe operator is configured on this server. This is a good idea just in-case there are issues with the [msdb] database that prevents alerting.

  • No Operators Configured/Enabled – No SQL Server Agent operators (emails) have been configured. This is a free, easy way to get notified of corruption, job failures, or major outages even before monitoring systems pick it up.

  • Not All Alerts Configured – Not all SQL Server Agent alerts have been configured. This is a free, easy way to get notified of corruption, job failures, or major outages even before monitoring systems pick it up.

Priority 200: Non-Default Server Config:

  • Agent XPs – This sp_configure option has been changed. Its default value is 0 and it has been set to 1.

  • cost threshold for parallelism – This sp_configure option has been changed. Its default value is 5 and it has been set to 50.

  • max degree of parallelism – This sp_configure option has been changed. Its default value is 0 and it has been set to 4.

  • max server memory (MB) – This sp_configure option has been changed. Its default value is 2147483647 and it has been set to 9000.

Priority 200: Performance:

  • Old Compatibility Level

    • ACC_COMP_SYS_DW – Database ACC_COMP_SYS_DW is compatibility level 80, which may cause unwanted results when trying to run queries that have newer T-SQL features.

    • ACC_Demo_Data – Database ACC_Demo_Data is compatibility level 90, which may cause unwanted results when trying to run queries that have newer T-SQL features.

Priority 210: Non-Default Database Config:

  • ANSI NULL Default Enabled COMPCRM – This database setting is not the default.

  • Read Committed Snapshot Isolation Enabled VOIP.SDC – This database setting is not the default.

Priority 240: Wait Stats:

  • No Significant Waits Detected – This server might be just sitting around idle, or someone may have cleared wait stats recently.

Priority 250: Server Info:

  • Default Trace Contents – The default trace holds 559 hours of data between Oct 12 2016 3:28AM and Nov 4 2016 10:37AM. The default trace files are located in: D:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\Log

  • Drive C Space – 177886.00MB free on C drive

  • Drive D Space – 379091.00MB free on D drive

  • Drive F Space – 3575387.00MB free on F drive

  • Hardware – Logical processors: 4. Physical memory: 16GB.

  • Hardware – NUMA Config – Node: 0 State: ONLINE Online schedulers: 4 Offline schedulers: 0 Processor Group: 0 Memory node: 0 Memory VAS Reserved GB: 16

  • Server Last Restart – Oct 12 2016 2:27AM

  • Server Name – SVR01

  • SQL Server Last Restart – Oct 17 2016 8:14PM

  • SQL Server Service – Version: 10.50.1617.0. Patch Level: RTM. Edition: Standard Edition (64-bit). AlwaysOn Enabled: 0. AlwaysOn Mgr Status: 0

Priority 254: Rundate:

  • Captain's log: stardate something and something…

Best Answer

Rather than picking random metrics and trying to tune them, what you want to do is step back and ask SQL Server, "What have you been waiting on?"

My favorite way to do that is with the open source tool sp_Blitz (because, uh, I wrote it.) You can download it from FirstResponderKit.org, run the sp_Blitz.sql script to install it in the database of your choice.

When you run it with just plain old:

EXEC sp_Blitz @CheckServerInfo = 1

It gives you a prioritized list of issues to check out that will impact performance.

Around priority 240, you'll find a section on Wait Stats (around priority 240) that tell you what SQL Server has been waiting on. That's what I'd focus on first to figure out what metrics to examine more deeply. For example, if your biggest wait type is PAGEIOLATCH, that means you're waiting to read data pages from a data file - and the TempDB storage speed isn't really going to matter. (Don't worry, you don't have to know what those cryptic wait types mean - just click on the more-info links in the output.)

If you'd like help from the Stack folks interpreting the results, you can also run it like this:

EXEC sp_Blitz @OutputType = 'markdown'

That'll export the results in Markdown, a text format that you can copy/paste into your StackOverflow question.