SQL Server – Failed to Create, Join, or Add Replica to Availability Group

availability-groupsclusteringfailoversql server

I have a 2 node FCI and a standalone SQL Server installation on a non FCI node. I have been automating the configuration/installation of FCIs, AGs, and DB Replicas which has worked fine up to now in all of my testing.

Today I am getting the error below when I execute:

USE [master]
GO
CREATE AVAILABILITY GROUP [AGName]
WITH (AUTOMATED_BACKUP_PREFERENCE = SECONDARY)
FOR 
REPLICA ON N'Node3\ReadOnly' WITH (ENDPOINT_URL = N'TCP://Node3-blah.blah.com:5022', FAILOVER_MODE = MANUAL, AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT, SESSION_TIMEOUT = 10, BACKUP_PRIORITY = 50, PRIMARY_ROLE(ALLOW_CONNECTIONS = ALL), SECONDARY_ROLE(ALLOW_CONNECTIONS = ALL)),
    N'Primary/Primary' WITH (ENDPOINT_URL = N'TCP://primary.blah.com:5022', FAILOVER_MODE = MANUAL, AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT, SESSION_TIMEOUT = 10, BACKUP_PRIORITY = 50, PRIMARY_ROLE(ALLOW_CONNECTIONS = ALL), SECONDARY_ROLE(ALLOW_CONNECTIONS = NO));
GO

Error:

Msg 19405, Level 16, State 17, Line 3
Failed to create, join or add replica to availability group 'AGName', because node 'Node3' is a possible owner for both replica 'Node3\ReadOnly' and 'Primary/Primary'. If one replica is failover cluster instance, remove the overlapped node from its possible owners and try again.

Node 3 is not part of the FCI. It has a standalone installation of SQL Server and it is not listed as a possible owner.

If I attempt to failover to Node 3 the FCI informs me that it is not a possible owner.

I'm not sure what caused this. I did failover using the FCI between node 1 and 2 several days earlier. I did remove the listener this time to test that, as that was the last thing I was working on. Any thoughts?

I could just tear down the FCI and have the automation recreate it but I wanted to try to solve this without doing that just in case this happens in prod one day. We should be able to evict the node from the WSFC and re-add it as well but I don't want to bug our ops team which has to do that. We will be tearing down the entire cluster and letting it build from scratch before releasing it though.

Edit. This is the output from the FCI node:

select * from sys.dm_os_cluster_nodes

NodeName    status  status_description  is_current_owner
---------------------------------------------------------
SQNodeL001-LA   0   up  1
SQNodeL002-LA   0   up  0
SQLNode003-LA   0   up  0

The output from the standalone is empty. It is part of the WSFC but not FCI or AG yet.

The powershell output shows us that all 3 can be owner nodes which is odd.

ClusterObject                                               OwnerNodes
-------------                                               ----------
SQL Server (Instance)                                           {SQNodeL001-LA, SQNodeL002-LA, SQNodeL003-LA}

From the GUI it does not have node 3 selected as a preferred owner. Pardon for making minor changes and removing the names. The first node name is all in lowercase. The other 2 are in upper case. It looks like I might need to do all this in powershell to get the accurate data, haven't gotten to automating that step yet. That'll be next sprint.

Edit 2 – solved:

Thanks to Sean's PowerShell query I was able to see that it is still listed as a owner in PowerShell although it is not in the GUI. I removed it using get-clusterresource "sql server (instance)" | set-clusterownernode -Owners node1 node2 and it worked. Thanks Sean!

enter image description here

Best Answer

Msg 19405, Level 16, State 17, Line 3 Failed to create, join or add replica to availability group 'AGName', because node 'Node3' is a possible owner for both replica 'Node3\ReadOnly' and 'Primary/Primary'. If one replica is failover cluster instance, remove the overlapped node from its possible owners and try again.

This happens for two main reasons that I've witnessed.

Reason #1 - The resource/group is set to have ownership on the node in error

Sometimes (for a multitude of reasons) resources and resource groups in windows clustering won't always have the same ownership. The best way to diagnose this error is to first check to see what SQL Server (which calls windows clustering APIs) thinks the cluster nodes are:

SELECT * FROM sys.dm_os_cluster_nodes

Once we know what is in the cluster, check via Powershell to see what the cluster thinks the ownership is for the FCI:

Get-ClusterOwnerNode -Resource "SQLFCIInstanceName"

This will return the nodes that could own the cluster resource. Chances are it'll include the node name of a node that we know shouldn't really be there.

To fix this, run the following powershell command:

Get-ClusterResource -Name "SQLFCIInstanceName" | Set-ClusterOwnerNode -Owners NodeName1,NodeName2

Double check by running the first powershell command to check ownership, then try to add the replica to the AG again.

Reason #2 - Node Names + Language != Node Names

If the language used wasn't US_English there would be a good chance that the node names (when compared to each other) wouldn't necessarily compare properly. This would cause a whole bunch of other issues (and it does) with the cluster outside of the AG.

This can be tested by taking the node names, converting them to upper or lower and the comparing them against themselves. Sounds like it should always work... but some languages have special characters that don't do the UPPER and LOWER conversions well.