Sql-server – Mirror failure after windows updates – possible cause

mirroringsql serversql-server-2008-r2

Over a dozen SQL Server mirrors had windows updates installed this morning – 2 of which developed serious faults afterwards.

The windows updates installed where:

  • KB2939576
  • KB2957189
  • KB2957509
  • KB2957689
  • KB2957503

The 2 servers effected are running Windows Server 2008 Standard & SQL Server 2008 R2 (10.50.4000)

The fault consisted of 20% through 60% constant CPU usage on the mirror (even when the principle server was idle at 0%), constant mirror state changes (from disconnected to synchronising & vica versa).

When looking into the PID's in question we noticed the following blocked event against msdb:

WAITFOR (RECEIVE message_body FROM WMIEventProviderNotificationQueue)

The "Application" listed against this was "Microsoft Windows Operating System" & the status was "SUSPENDED", this was being blocked by a whole chain of system PID's which all seemed to be notification related (for mirror state changes).

As a hunch we tried disabling the WMI service – this wouldn't stop – its almost as if the windows updates have caused WMI to hang and its taking SQL with it – uninstalling the windows updates didn't solve the problem, nor did dropping of all SQL Server Agent Alerts that where making use of WMI events.

Rebooting the servers made no difference – CPU usage was still excessive for no reason, Mirrors where still alternating between Synchronising & Disconnected – on the rare occasion that they did sync, transactions on the principle suddenly became deadly slow (often causing timeouts in client applications) until the mirror dropped again.

We're currently formatting the mirror, re-installing the OS with R2 in an attempt to solve this asap.

Has anyone had problems with these windows updates? Is there any known issues with WMI notifications causing SQL to hang?

I should probably add that the mirror servers in question had been working perfectly before these windows updates for years, CPU usage never goes above 5% really & the mirrors had been synchronised for months – only drops where when servers where deliberately rebooted.

Edit: When we tried dropping all SQL Server Agent alerts (just in-case they where causing the problem) (They fired an email when a mirrored database state changed via WMI) we got the message below suggesting there was something wrong in the notification system & the queue was not clearing down at the normal rate.

Unable to post notification to SQLServerAgent (reason: The maximum number of pending SQLServerAgent notifications has been exceeded.

Edit 2: We also noticed lots of the following in the SQL ERRORLOG:

Failure to send an event notification instance of type 'DATABASE_MIRRORING_STATE_CHANGE' on conversation handle '{5A821186-76A4-DD11-9F30-00188B481C62}'. Error Code = '8429'. 

Edit 3: The chain of PID's that where blocking the WAITFOR command above where nearly all of type "BRKR TASK" – often flagged as Head Blocker & consuming the most CPU – SQL is doing something internally – is there any way to determine what a non user PID is doing?

Edit 4: sp_WhoIsActive shows no rows (which makes sense since the server is idle) – yet with the CPU constantly at 28% with sqlsvr.exe being the only process using CPU we've ran it with @show_system_spids = 1 instead:

Note that all CPU is being used by non-user background processes

Best Answer

We had to bring MS in on this one via a paid support call. After a week of dumps, xperf analysis, driver updates, playig with Virtual infrastructure we ruled out all the basics (or so we thought) - over a week later we found the problem - there was over 500,000 sys.server_event_notifications defined - these all looked as follows:

name    object_id   parent_class    parent_class_desc   parent_id   create_date modify_date service_name    broker_instance creator_sid principal_id
SQLWEP_E0B9EA66_1E73_45B7_B528_75166DC7A15D 337640  100 SERVER  0   2013-06-14 01:56:27.077 2013-06-14 01:56:27.077 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259
SQLWEP_6492D14E_99D2_49F2_9CD9_43F4B7DC7F63 338711  100 SERVER  0   2013-02-12 17:50:55.897 2013-02-12 17:50:55.897 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259
SQLWEP_C9871B51_F866_48A6_862E_08EC4F92DAFA 351457  100 SERVER  0   2013-03-06 02:04:18.070 2013-03-06 02:04:18.070 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259
SQLWEP_0E5D8FAF_1A9C_43FA_8EDF_A5713A8C7825 362061  100 SERVER  0   2014-02-08 04:57:17.217 2014-02-08 04:57:17.217 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259
SQLWEP_712ED252_E529_4022_9EA3_C785BAEB8147 364203  100 SERVER  0   2013-04-05 15:52:11.190 2013-04-05 15:52:11.190 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259
SQLWEP_E2D31BC2_E416_4D12_8B4A_88F4183FE021 374807  100 SERVER  0   2014-03-15 13:31:25.390 2014-03-15 13:31:25.390 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259
SQLWEP_7B389ACB_EA1F_434C_8B2A_7B93D4D0F098 376949  100 SERVER  0   2013-05-12 05:02:12.450 2013-05-12 05:02:12.450 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259
SQLWEP_4A828CDE_D72D_4277_97C0_030686F86377 378020  100 SERVER  0   2013-01-21 16:49:04.780 2013-01-21 16:49:04.780 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259
SQLWEP_4EA3C2AD_ACCE_4A66_83E9_0676D2250C34 387553  100 SERVER  0   2014-04-24 20:26:31.053 2014-04-24 20:26:31.053 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259
SQLWEP_6EA3C0FB_6EDA_4B9D_A3FE_D2B07799212C 389695  100 SERVER  0   2013-06-16 18:07:06.043 2013-06-16 18:07:06.043 SQL/Notifications/ProcessWMIEventProviderNotification/v1.0  706DF30B-6BD5-458B-85A4-F1C174263ADE    0x010100000000000512000000  259

These had all been created gradually over 5 years, what they are, where they come from, no one could say & MS couldn't find any related alerts to justify them either. Dropping 1 notification took 4 seconds so we would have been looking at 22 days to drop them all - suffice to say we ended up nuking the SQL instance & re-installing SQL from scratch.