Over a dozen SQL Server mirrors had windows updates installed this morning – 2 of which developed serious faults afterwards.
The windows updates installed where:
- KB2939576
- KB2957189
- KB2957509
- KB2957689
- KB2957503
The 2 servers effected are running Windows Server 2008 Standard & SQL Server 2008 R2 (10.50.4000)
The fault consisted of 20% through 60% constant CPU usage on the mirror (even when the principle server was idle at 0%), constant mirror state changes (from disconnected to synchronising & vica versa).
When looking into the PID's in question we noticed the following blocked event against msdb:
WAITFOR (RECEIVE message_body FROM WMIEventProviderNotificationQueue)
The "Application" listed against this was "Microsoft Windows Operating System" & the status was "SUSPENDED", this was being blocked by a whole chain of system PID's which all seemed to be notification related (for mirror state changes).
As a hunch we tried disabling the WMI service – this wouldn't stop – its almost as if the windows updates have caused WMI to hang and its taking SQL with it – uninstalling the windows updates didn't solve the problem, nor did dropping of all SQL Server Agent Alerts that where making use of WMI events.
Rebooting the servers made no difference – CPU usage was still excessive for no reason, Mirrors where still alternating between Synchronising & Disconnected – on the rare occasion that they did sync, transactions on the principle suddenly became deadly slow (often causing timeouts in client applications) until the mirror dropped again.
We're currently formatting the mirror, re-installing the OS with R2 in an attempt to solve this asap.
Has anyone had problems with these windows updates? Is there any known issues with WMI notifications causing SQL to hang?
I should probably add that the mirror servers in question had been working perfectly before these windows updates for years, CPU usage never goes above 5% really & the mirrors had been synchronised for months – only drops where when servers where deliberately rebooted.
Edit: When we tried dropping all SQL Server Agent alerts (just in-case they where causing the problem) (They fired an email when a mirrored database state changed via WMI) we got the message below suggesting there was something wrong in the notification system & the queue was not clearing down at the normal rate.
Unable to post notification to SQLServerAgent (reason: The maximum number of pending SQLServerAgent notifications has been exceeded.
Edit 2: We also noticed lots of the following in the SQL ERRORLOG:
Failure to send an event notification instance of type 'DATABASE_MIRRORING_STATE_CHANGE' on conversation handle '{5A821186-76A4-DD11-9F30-00188B481C62}'. Error Code = '8429'.
Edit 3: The chain of PID's that where blocking the WAITFOR command above where nearly all of type "BRKR TASK" – often flagged as Head Blocker & consuming the most CPU – SQL is doing something internally – is there any way to determine what a non user PID is doing?
Edit 4: sp_WhoIsActive
shows no rows (which makes sense since the server is idle) – yet with the CPU constantly at 28% with sqlsvr.exe being the only process using CPU we've ran it with @show_system_spids = 1
instead:
Best Answer
We had to bring MS in on this one via a paid support call. After a week of dumps, xperf analysis, driver updates, playig with Virtual infrastructure we ruled out all the basics (or so we thought) - over a week later we found the problem - there was over 500,000 sys.server_event_notifications defined - these all looked as follows:
These had all been created gradually over 5 years, what they are, where they come from, no one could say & MS couldn't find any related alerts to justify them either. Dropping 1 notification took 4 seconds so we would have been looking at 22 days to drop them all - suffice to say we ended up nuking the SQL instance & re-installing SQL from scratch.