Sql-server – Lots of short term blockings on SQL Server 2008-R2 database

blockingsql serversql-server-2008-r2waits

Seeking experts advise on short term blocking which we are seeing for almost a week now.

DB is 15 TB and this Table B where we have lots of blocking – almost 300 in no is 2 TB.

While using sp_whoisactive, I see queries getting blocked for shorter fraction of 3-4 secs and some in 9-10 secs but because these come and go almost through the day for connections over 2-3 thousand, this causes lots of errors in application.

Troubleshooting done:

  • Updated stats on NC index, We just have one NC and one C index.
  • While checking from sentry we can see that these blocked statements
    are all insert statements for that table mentioned above.

Below are some metrics:

  1. Wait stats for past 3-4 hours where blocking is till on
    • PAGELATCH_EX –> 51.1 % Total waits 730 513 153.5
    • PAGELATCH_SH –> 23 % Total waits 343 538 769.7
    • THREADPOOL –> 18 % Total waits 257 222 894.4
  2. BATCHES/sec on Avg: 8612.6
  3. CPU and memory –> well under the threshold as we have 256 logical processors and 612 GB RAM
  4. CHECKPOINT pages/sec –> 1706
  5. Lazy writes –> 0
  6. Log flushes/sec 1844.4
  7. Transactions/sec 9074

results as per david's query: Note@ below was run 15 mins after we did node failover

 wait_type  wait_time_ms_per_sec
PAGELATCH_EX    733024.4
PAGELATCH_SH    328846.4
CPU_USED    7131.4
WRITELOG    6484.4
DBMIRROR_EVENTS_QUEUE   4973.6
DBMIRRORING_CMD 1456.6
OLEDB   990.2
TRACEWRITE  984.2
SOS_SCHEDULER_YIELD 280.6
THREADPOOL  254
ASYNC_NETWORK_IO    45.4
LATCH_EX    9.4
PAGEIOLATCH_EX  7.6
PREEMPTIVE_OS_GETPROCADDRESS    3.4
MSQL_XP 3.2
DBMIRROR_SEND   1.4
PREEMPTIVE_OS_DELETESECURITYCONTEXT 1
PAGELATCH_UP    0.4
PREEMPTIVE_OS_WAITFORSINGLEOBJECT   0.4
PREEMPTIVE_OS_AUTHORIZATIONOPS  0.2
PREEMPTIVE_OS_AUTHENTICATIONOPS 0.2
SOS_RESERVEDMEMBLOCKLIST    0.2
CMEMTHREAD  0.2

Please provide additional troubleshooting suggestions and let me know if there are any further metrics needed.

Edit@ We are using MAXDOP 8 and Cost threshold of par'sm is set default-5

Update: Did not made any changes as such and when we saw drop in connections during the night time, suddenly issue seem to have vanish and no blocking as checked for now. Below is the stats as of now

OLEDB   11883.8
WRITELOG    11732.2
DBMIRROR_EVENTS_QUEUE   4969.8
CPU_USED    4093.75
DBMIRRORING_CMD 1457.2
TRACEWRITE  1080.2
PAGEIOLATCH_SH  333.4
PAGELATCH_EX    237.4
PAGELATCH_SH    31.4
ASYNC_NETWORK_IO    30.2
SOS_SCHEDULER_YIELD 6
LATCH_EX    4.4
DBMIRROR_SEND   1.6
PREEMPTIVE_OS_AUTHENTICATIONOPS 1.4
MSQL_XP 1.2
PREEMPTIVE_OS_GETPROCADDRESS    1.2
PAGELATCH_UP    0.4
PREEMPTIVE_OS_DECRYPTMESSAGE    0.4
SOS_RESERVEDMEMBLOCKLIST    0.4
PREEMPTIVE_OS_DELETESECURITYCONTEXT 0.2
PREEMPTIVE_OS_WAITFORSINGLEOBJECT   0.2
CMEMTHREAD  0.2
PREEMPTIVE_OS_AUTHORIZATIONOPS  0.2
PREEMPTIVE_OS_CRYPTACQUIRECONTEXT   0.2
PREEMPTIVE_OS_NETVALIDATEPASSWORDPOLICY 0.2

Not sure how and why it went as when seeing today we dont see those many connections and neither the blocking. But thats for sure it will pop up soon

Update@ Its back and looks with same waits

enter image description here

Best Answer

This:

wait_type  wait_time_ms_per_sec
PAGELATCH_EX    733024.4
PAGELATCH_SH    328846.4
CPU_USED    7131.4
WRITELOG    6484.4

Shows that you have a latch contention problem, and only a latch contention problem.

Since

@ below was run 15 mins after we did node failover

The CPU_USED row should be accurate. And so your workload is only managing to use 7sec of CPU time per second. This is like only running on 7 cores. In the mean time you have 1000sec of page latch time, meaning that you have, on average 1000 sessions waiting to latch a page, and only 7 performing useful work.

You are also running very old software on a very large server. The throughput on this server is pretty high, suggesting that the operation it supports is important. Also consider your username. So you should consider, strongly, engaging Microsoft Support or a partner with expertise in this sort of thing.

That said, the next step is to identify the source of your latch waits. Generally these are either on a hot page of a table in your database, or a system page in TempDb. The sys.dm_exec_requests.wait_resource should tell you which.

If tempdb, then Recommendations to reduce allocation contention in SQL Server tempdb.

If a table in your database, then post the DDL (including indexes) for the table and a the details of the queries that are blocking.

There is a whitepaper on diagnosing and resolving latch contention on SQL 2008 here: Diagnosing and Resolving Latch Contention on SQL Server