Sql-server – Slow checkpoint and 15 second I/O warnings on flash storage

performanceperformance-tuningsansql serversql-server-2012storage

Last couple of weeks we've been working on getting to the root cause of what could likely be the cause of the occurrence of these I/O issues and slowdown of the checkpoints.

At first glance it looks to be clearly an I/O subsystem error and the SAN admin was to be blamed for it. But recently we changed the SAN to utilize Full Flash but as of today the error still pops up and I have no clue as to why since every metric, whether wait stats or any other metric, that I run to check if SQL server is a possible culprit seems out to return normal.

It doesn't really add up. It could also be very likely that something else is chewing the disk and SQL Server is becoming victimized here…but I am not able to find out what?

Dbs are in Availability Groups and as and when these events occur we do see role changes and flip overs occurring along with timeouts.

Any help in figuring this out would be highly appreciated. Let me know if any further details is needed.

Error msgs. below

SQL Server has encountered 14212 occurrence(s) of I/O requests taking
longer than 15 seconds to complete on file [E:\MSSQL\DATA\ABC.mdf] in
database [ABC] (7). The OS file handle is 0x0000000000000D64. The
offset of the latest long I/O is: 0x0000641262c000

SQL Server has
encountered 5347 occurrence(s) of I/O requests taking longer than 15
seconds to complete on file [E:\MSSQL\DATA\XYZ.mdf] in database [XYZ]
(7). The OS file handle is 0x0000000000000D64. The offset of the
latest long I/O is: 0x0000506c060000

FlushCache: cleaned up 111476 bufs with 62224 writes in 925084 ms
(avoided 19 new dirty bufs) for db 7:0
average throughput: 0.94 MB/sec, I/O saturation: 55144, context switches 98407
last target outstanding: 10240, avgWriteLatency 14171 FlushCache: cleaned up 5616 bufs with 3126 writes in 248687 ms
(avoided 3626 new dirty bufs) for db 6:0
average throughput: 0.18 MB/sec, I/O saturation: 10080, context switches 20913
last target outstanding: 2, avgWriteLatency 3

Here's the virtual file stats info over a 30 minute span:

enter image description here

And wait stats as well:

enter image description here

Here is the note from the system architect:

We separate workloads for high I/O intense workloads (such as DB) so that we only have one per host. The specs for the current host is Dell R730 with 16 cores of Xeon E5-2620 (2 sockets), 512GB, and 2x10G interconnects for storage. No other VM on the cluster nor host are experiencing these issues. Storage for VMs and workloads is on Pure FA-x20.

General System Information:

  • SQL Server 2012 sp3-cu9 (Enterprise Edition)
  • Total RAM: 128 GB
  • Total DB size: Close to 1 TB

Best Answer

Last couple of weeks we've been working on getting to the root cause of what could likely be the cause of the occurrence of these I/O issues and slowdown of the checkpoints.

Sounds good. Have you collected and cut up the minifilter and storport tracing, yet? If so, what did it show?

At first glance it looks to be clearly an I/O subsystem error and the SAN admin was to be blamed for it. But recently we changed the SAN to utilize Full Flash but as of today the error still pops up and I have no clue as to why since every metric, whether wait stats or any other metric, that I run to check if SQL server is a possible culprit seems out to return normal.

I want go over two different areas here.

The first is that SQL Server itself doesn't actually do anything with I/O, it posts it to Windows using the typical Windows APIs. Whether it is ReadFile, WriteFile, or the vectored I/O of those it's all up to Windows. SQL Server keeps a list of pending I/O and checks that I/O at various times to get the status if it isn't completed. This is done using, again, the typical Windows asynchronous I/O model. The message is printed when the I/O has been pending and not completed, according to Windows for over 15 seconds as we're using the GetOverlappedResult Windows API to check status. This means, SQL Server doesn't really have a say in the matter, it's what is being returned via Windows.

The second item is that just because it's all flash and 10 Gb fiber doesn't mean something isn't setup or configured incorrectly, that a driver, filter, or other bug or item isn't hit, or that something isn't physically wrong. Just to get an idea:

  1. Windows Config
  2. Windows Drivers such as mutli-pathing being setup and the latest version
  3. Filter Drivers (you know, disk devices, antivirus, backup, etc.)
  4. Hypervisors (if any)
  5. HBA drivers
  6. HBA firmware
  7. HBA configuration
  8. Physical Cabling
  9. Fiber switching
  10. I/O Group connections/SAN/Device
  11. Configuration of SAN/Device

That's all under SQL Server, it's just that SQL Server is the one telling you about it.

Dbs are in Availability Groups and as and when these events occur we do see role changes and flip overs occurring along with timeouts.

That's really good information to know, although it doesn't necessarily mean it's exactly related. Now, if it only happens when there is a failover, then that would hone in the issue much more and that would sound to me more like the drivers et al. doesn't like throwing a whole lot of mixed I/O at it as a failover typically results in the redo/undo and resync's happening which could be a spike in outstanding I/O.

Any help in figuring this out would be highly appreciated.

Unless it's a query or set of queries that are pushing high IOPs, which it doesn't sound like as the snapshot for 30 minutes you have was only 737,465 I/O operations which averages to 410 IOPs (not that high, especially if it's flash) looking inside of SQL Server isn't going to help with this issue since SQL Server is the messenger.

You'd want to collect if not already:

  1. Minifilter time spent. This can be done through WPR (XPerf) if you don't have anything else. This can help if the I/O is getting stalled in a filter driver.
  2. Storport trace. This will be the last stop on the way our and the first stop on the way back. Any time between these two readings is time spent outside of Windows... It'll also show you the targets and where the slowness might be on the other end (but isn't always conclusive).

If none of those are helpful in diagnosis or narrowing in the scope of the issue, it may be time to open a ticket with Windows Storage support and have all the data already collected so that you all can start on the same page.