SQL Server 2012 IO Latency and Wait Types – Analysis and Solutions

sql serversql-server-2012

I'm working on a system that has a resource intenstive houskeeping job. It hits two different data files on the same LUN. One is for relational data, one for blobs. During the job, read latency on the relational data file averages over 250ms, while on the LOB data it is only 5ms. This is measured using sys.dm_io_virtual_file_stats (two samples 12 hours apart to encompass the houskeeping – during the other 12 hours, latency is 20ms).

I'm sampling wait stats using "who is active". In this I don't see any PAGEIOLATCH waits with large enough wait times to explain the 250ms. sys.dm_os_wait_stats backs this up with an average of 5ms for PAGEIOLATCH_EX and 15ms for PAGEIOLATCH_SH. What I do see is long waits for LCK_M (which is explainable, due to the nature of the housekeeping).

My question is: can the LCK_M waits contribute to the io_stall_read_ms times in the virtual file stats?

In case the details of the housekeeping matter, it is removing smallish batches of data using multiple cascading deltes, including ~1MB of LOB data per item. The system is using AlwaysOn with sync commit.

Best Answer

The waits that are seen in sys.dm_io_virtual_stats are aggregates instead of point values. If you are doing a lot of transactions in a certain period, it will show you an higher aggregate --> higher average latency. More information on how this dmv works can be found on https://sqlperformance.com/2013/10/t-sql-queries/io-latency

Virtual file stats are a great starting point when you want to understand I/O performance for a SQL Server instance. If you see I/O-related waits when looking at wait statistics, looking at sys.dm_io_virtual_file_stats is a logical next step. However, understand that the data you're viewing is an aggregate since the stats last cleared by one of the associated events (instance restart, offline of database, etc). If you see low latencies, then the I/O subsystem is keeping up with the performance load. However, if you see high latencies, it's not a foregone conclusion that storage is a problem. To really know what's going on you can start to snapshot file stats, as shown here, or you can simply use Performance Monitor to look at latency in real time. It's very easy to create a Data Collector Set in PerfMon that captures the Physical Disk counters Avg. Disk Sec/Read and Avg. Disk Sec/Read for all disks that host database files.

The LCK_M will not contribute to the IO_STALL_READ_MS as these counters are only IO related