Always On Availability Group Failover Issue

availability-groupsdisaster recovery

I'm testing a PowerShell Disaster Recovery script and trying to fail over to an off-site AG DR (asynch) replica with this command

ALTER AVAILABILITY GROUP [MyAG] FORCE_FAILOVER_ALLOW_DATA_LOSS

On several clusters this works fine but on a couple of other ones it usually fails (but not always) with the error:

Failed to move a Windows Server Failover Clustering (WSFC) group to
the local node (Error code 5023). The WSFC service may not be running
or may not be accessible in its current state, or the specified
cluster group or node handle is invalid. For information about this
error code, see "System Error Codes" in the Windows Development
documentation.

Failed to designate the local availability replica of
availability group 'MyAG' as the primary replica. The operation
encountered SQL Server error 41018 and has been terminated. Check the
preceding error and the SQL Server error log for more details about
the error and corrective actions.

We have to force the quorum simulating the situation where the async DR replica can't communicate with the other replicas: https://docs.microsoft.com/en-us/sql/sql-server/failover-clusters/windows/force-a-wsfc-cluster-to-start-without-a-quorum?view=sql-server-2017

A comparison of the clusters that work vs. those that don't with Get-Cluster and Get-ClusterGroup doesn't reveal any major differences.

This failover has worked successfully on some clusters but fails 99% of the time on two of the clusters. It has worked on these two clusters, but rarely. One of the common issues I found from a Google search is permissions for NT Authority\System (see https://dataginger.com/2014/10/28/sql-server-failed-to-bring-availability-group-availability-group-name-online/), which I confirmed it has.

I ran a trace and see the same errors and no other clues in the trace.

If nothing turns up I may have to open a Microsoft ticket to engage their support.

Any thoughts?

Thanks for your help!

Best Answer

This error may happen if your SQL Server service is running under user who does not have the needed permission on your servers or the account is locked. try to use another account and make sure account is unlocked.

If that did not fix the issue then you need to edit your regisery (make sure you take a backup first)

  1. open start menu,write regedit then press ok
  2. navigate to HKEY_LOCAL_MACHINE\Cluster\Resources\<GUID>\Parameters
  3. Create the following registry values in the

    Parameters registry key:

    For a default instance of SQL Server: InstanceName
    Value Name: InstanceName
    Value Type: REG_SZ

    Value Data: MSSQLSERVER

    VirtualServerName

    Value Name: VirtualServerName
    Value Type: REG_SZ

    Value Data:
    For a named instance of SQL Server:

    Value Name: InstanceName
    Value Type: REG_SZ

    Value Data:
    VirtualServerName

    Value Name: VirtualServerName
    Value Type: REG_SZ
    Value Data:

4- quit registery editor