Sql-server – Error Failover in Always on high availability groups

availability-groupsfailoversql serversql-server-2016

We have a problem with Always on high availability groups in Microsoft SQL Server 2016 (SP2) , When we want to failover to the secondary node manually, it fails because of this error:

Failed to bring availability group 'per-ag1' online. The operation timed out. Verify that the local Windows Server Failover Clustering (WSFC) node is online. Then verify that the availability group resource exists in the WSFC cluster. If the problem persists, you might need to drop the availability group and create it again. (.Net SqlClient Data Provider)

And databases go to not synchronizing situation and Availability group goes to resolving mode so we have to reset the secondary node until the Availability group return back to primary node.

We checked the failover cluster manager events we found these errors:

Error1:

Network Name resource 'per-ag1_per-lis3' (with associated network name 'PER-LIS3') has Kerberos Authentication support enabled. Failed to add required credentials to the LSA – the associated error code is '-2146893802'.
Cluster resource 'per-ag1_per-lis3' of type 'Network Name' in clustered role 'per-ag1' failed.

Error2:

Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

Error3:

The Cluster service failed to bring clustered role 'per-ag1' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role*.

And the last one is time out:

Error4:

Clustered role 'per-ag1' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

I find some a command in error2 so I try it in Windows PowerShell the result is:

Then I try to check logs of SQL Server with hoping to find any thing to see more in detail:

Name                 State  OwnerGroup    ResourceType
----                 -----  ----------    ------------
Cluster IP Address   Online Cluster Group IP Address
Cluster Name         Online Cluster Group Network Name
File Share Witness   Online Cluster Group File Share Witness
per-ag1              offline per-ag1       SQL Server Availability Group
per-ag1_[ my ip address] Online per-ag1       IP Address
per-ag1_FSShare      offline per-ag1       SQL Server FILESTREAM Share
per-ag1_per-lis3     failed per-ag1       Network Name

but in the normal situation of high availability (when I reset the secondary node and the high availability return back to primary ) everything return to online:

Name                 State  OwnerGroup    ResourceType
----                 -----  ----------    ------------
Cluster IP Address   Online Cluster Group IP Address
Cluster Name         Online Cluster Group Network Name
File Share Witness   Online Cluster Group File Share Witness
per-ag1              Online per-ag1       SQL Server Availability Group
per-ag1_172.16.0.230 Online per-ag1       IP Address
per-ag1_FSShare      Online per-ag1       SQL Server FILESTREAM Share
per-ag1_per-lis3     Online per-ag1       Network Name

I try to check “show dashboard” report too and it has a critical error that I wrote below:

The availability group is offline, and is unavailable. This issue can be caused by a failure in the server instance that hosts the primary replica or by the WSFC availability group resource going offline.

Do you have any suggestion for me about this error? It will be appreciated.
I’m look forward to hearing suggestion from DBAs.

This is running on Windows Server 2012 R2.

Best Answer

try to give the VCO AD object full control on the listener AD object.

To do this, go to AD user and computer, locate the listener object, open properties, security and verify if the virtual cluster object is here and if it has full control on this object.

Then retry the failover.