Sql-server – SQL Server High Disk Queue, Low CPU

sql server

I'm running a large-ish SQL Server Enterprise 2016 Data Warehouse where we've for some time now had issues where we are seeing the following symptoms:

  • High Disk Queue length, including 100% activity on those disks. The queues are going over 10,000 at times with very low actual throughput, under 100Mb/s compared to over 1.5Gb/s the drives are capable of.
  • High Disk Latency, up to 10 seconds per read.
  • Low CPU usuage, I suspect this is because of the High Disk queue length, and the CPU is simply waiting.

We're running a 24 Core machine, INtel Xeon 6128 (6 cores per socket, 4 CPU's total). We currently got 680 Gb allocated to the VM, with 660 Gb avaiable to SQL.

We're patched to the latest CU, and Windows Patches are more or less up to date.

We're running on a very performant SAN, and haven't seen any IO issues from the SAN point of view that are affecting things.

The only solution we currently have, is to restart the entire machine. Even restarting the service doesn't make a difference. When SQL is stopped disk IO is fine, but once it is restarted (without a server restart), the queue length and latency all go down the toilet.

My current suspicion is around the use CLR integrations and the memory utilisation/garbar colleciton on these, but I haven't been able to find a way to accurately spot this.

I'm at a complete loss of where else to start looking, but currently we're having to restart our server almost weekly due to the issue.

Best Answer

I think best approach is try to find connections, which use most IO.

Look at open connections, see who use most IO. Look at currently executing commands, what they wait. Would be seen some waits on latches. After finding who use most IO, it would be easier to make next step in investigation.

As for CLR, there are performance counters which shows quite a lot information about its behavior, so it is possible to look on it. May be there is usage of swap on the server? Like page/sec significanly above zero? It also may cause such behavior of server.