Sql-server – SQL Server Disk I/O Latency During Specific Timeframe

sql serversql server 2014sql-server-2012sql-server-2016

We are experiencing high disk I/O latency in a specific timeframe in SQL Server. Every day at exactly 4:30 AM we see a significant difference in disk I/O and it lasts up to 6:30 or 9:30 AM. I already checked with the development and storage teams. They confirmed that everything is running normally. Storage team declared that there is no issue in LUN and they are blaming SQL Server.

One thing is sure–some job is triggering exactly at 4:30 AM every day, which causes the high disk I/O. But I am not sure how to find that out. I attached the screen shot of the SQL error log at 4:30 AM. All the monitoring tools and the logs are stating that the transactions are taking longer that usual due to high disk I/O. But nowhere is it mentioned what is causing it.

Ours is 3 node Always On environment. P1 – primary, P2 – Stand only, P3 – Read only.

The issue we are facing now is from P3 (Read only node).

Screen shot of the SQL server log

Best Answer

This is too long for a comment, but there won't be enough in here to provide the answer as to WHAT is causing this, just HOW to work out what IT is.

Firstly: What applications are connecting to databases on that server, and are they running batch processes against databases that have grown faster than you've kept track of, thus extending the time required to complete?

Are any backups being taken at this time? Context switching is really high, so there could be backups conflicting with batch processes running at the same time.

Here's an example of how you could capture the job/query causing your issues:

  1. Download the WhoIsActive stored proc by Adam Machanic from http://whoisactive.com/. Run it against an Administration database you might have on the server, or in the master database.
  2. Create a table that can capture the output of WhoIsActive to a table (http://whoisactive.com/docs/06_options/)
  3. Create a SQL Agent job that runs executes sp_WhoIsActive every 15 seconds (this is my preference, change to suit yours) and log the data to the table you created.
  4. Create a set of PerfMon counters (using https://www.brentozar.com/archive/2006/12/dba-101-using-perfmon-for-sql-performance-tuning/) on the SQL Server and log the data to CSV.
  5. If the server is a VM, ask your VM Admin for server stats for the given period. vRealize Operations contains a lot of useful reports for proving a point.
  6. Get some reports from your SAN Admin showing SAN activity during the problem period.