SQL Server 2016 – Resolving DBCC Failures Due to AlwaysOn State Changes

availability-groupsdbcc-checkdbjobssql serversql-server-2016

I got a job DatabaseIntegrityCheck – USER_DATABASES, failing with the following error:

Date and time: 2018-10-29 02:34:13 Command: DBCC CHECKDB ([one of my
databases]) WITH NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY HResult
0x254, Level 21, State 1 Cannot continue the execution because the
session is in the kill state.

I get an alert email with the following error message:

The client was unable to reuse a session with SPID 193, which had been
reset for connection pooling. The failure ID is 46. This error may
have been caused by an earlier operation failing. Check the error logs
for failed operations immediately before this error message.

enter image description here

On the primary server I get this message:

The availability group database "all databases" is changing roles from
"PRIMARY" to "RESOLVING" because the mirroring session or availability
group failed over due to role synchronization. This is an
informational message only. No user action is required.

enter image description here

and on the secondary:

The availability group database "each database" is changing roles from
"SECONDARY" to "SECONDARY" because the mirroring session or
availability group failed over due to role synchronization. This is an
informational message only. No user action is required.

enter image description here

As I could not find out the cause of this error, I have escalated it to the sys admin friends, also because I suspect it is something to do with network.

question:
as nothing about this on the error log, where else can I find any information about this matter?

I have run sp_blitz with markdown

EXEC sp_Blitz @OutputType = ‘markdown’, @CheckServerInfo = 1

and the results are here

I have also run the following command on one of my small databases test1 which is part of the availability group:

DBCC CHECKDB ([test1]) WITH ALL_ERRORMSGS, DATA_PURITY

and that gave me the results shown here and important to say, on this occasion it did not cause any failover or change of state in the availability group.

Update:

Yesterday I changed the schedule of the job DatabaseIntegrityCheck – USER_DATABASES from 2 am to 3 am and it upset the availability group at 3:43 am
It seems to be caused by the dbcc checkdb

enter image description here

Best Answer

I have experienced this behavior before, where the AGs "blip:"

  • from PRIMARY to RESOLVING and back to PRIMARY
  • from SECONDARY to SECONDARY

It's my understanding that, on a system with a small number of logical processors (1-4), and with multiple AGs (or AGs with many databases), some of these "blips" are unavoidable. Especially if the server is a VM with even slightly noisy neighbors.

Availability Groups use a lot of worker threads baseline. You can read about the thread requirements for AGs in detail here: Thread Usage by Availability Groups

I'd suggest that you avoid having other CPU-heavy scheduled tasks run at the same time, in order to reduce the risk of these events occurring. In this case, it looks like index maintenance was running at the same time as CHECKDB. If you have a big enough window, try to spread these out so they don't overlap.