Sql-server – Availability Group reporting disconnected replica

availability-groupssql serversql-server-2012

I'm trying to configure availability groups in a VM environment so I can run some tests.

I think I've got the group created correctly, I can see the Always on group on both servers. However when I look at the dashboard for the group it has the following error

Availability replica disconnected

This secondary replica is not connected to the primary replica. The connected state is DISCONNECTED.

I've checked the endpoints on both servers and they look correct. There are no firewalls running and both servers can see each other. What's the best way to debug this sort of error?

Below is the TSQL I used to set all this up

Primary Server

CREATE ENDPOINT dbm_endpoint
    STATE=STARTED 
    AS TCP (LISTENER_PORT=7022) 
    FOR DATABASE_MIRRORING (ROLE=ALL)
GO

Secondary Server

CREATE ENDPOINT dbm_endpoint
    STATE=STARTED 
    AS TCP (LISTENER_PORT=5022) 
    FOR DATABASE_MIRRORING (ROLE=ALL)
GO

Primary Server

CREATE AVAILABILITY GROUP AG1
    FOR
        DATABASE TestDb
    REPLICA ON
        'SQL1' WITH
            (
                ENDPOINT_URL = 'TCP://sql1.sql.sandbox.net:7022',
                PRIMARY_ROLE ( ALLOW_CONNECTIONS = READ_WRITE),
                SECONDARY_ROLE (ALLOW_CONNECTIONS=READ_ONLY),
                AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
                FAILOVER_MODE = MANUAL
            ),
        'SQL2' WITH
            (
                ENDPOINT_URL = 'TCP://sql2.sql.sandbox.net:5022',
                PRIMARY_ROLE ( ALLOW_CONNECTIONS = READ_WRITE),
                SECONDARY_ROLE (ALLOW_CONNECTIONS=READ_ONLY),
                AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
                FAILOVER_MODE = MANUAL
            );

Secondary Server

ALTER AVAILABILITY GROUP AG1 JOIN;

Obviously I also restored the primary database to the secondary server as well.

One thought, I didn't install the SQL Agent on either server, I'm guessing this is not needed for always on availability groups?

Best Answer

This question is very old and likely no more troubleshooting will be possible, however for those of you stumbling upon this question having the same issue the following may be of some help. This isn't meant to be a complete list of all issues that could occur, but it will hit the main ones that do.

When the secondary replica is marked as disconnected by the primary, this means the primary hasn't heard or can't contact the secondary replica over the database mirroring endpoint. There are various reasons for this:

Endpoint configuration

One of the most common causes are misconfigured endpoints for the Database Mirroring role. There are various options and configurations that can be used and sometimes, especially when created by hand, those configuration options do not match. Take the OPs example:

CREATE ENDPOINT dbm_endpoint
    STATE=STARTED 
    AS TCP (LISTENER_PORT=5022) 
    FOR DATABASE_MIRRORING (ROLE=ALL)

This creates an endpoint called dbm_endpoint on tcp port 5022 for database mirroring. However it leaves out quite a bit of configuration options to which we assume are defaulted but we really do not know. This means the default options such as Windows authentication using either Kerberos or NTLM, it also doesn't specify which IP address (v4, v6, interface) is used, nor does it specify encryption options to which the default is REQUIRED and depending on the version of SQL Serve either RC4 or AES.

That is quite the assumption to make, especially when it isn't working. If you're not sure what the endpoint is set to, join sys.tcp_endpoints with sys.database_mirroring_endpoints to turn the resulting single row to individual configuration options.

There are advanced configurations, for example, that use certificates since they can't use Windows authentication. We don't know if the OP has a proper domain and if these servers are joined. Again, the assumption based off the endpoints is that they are, however we don't know for sure. When using certificates the largest issue is that the same certificate is used on multiple replicas yet the private key is not exported when copying the certificate to import to the replicas. Since the endpoints use the private/public key pair, using the same certificate will require that endpoint at some time to use the private key. Make sure the certificate is exported and imported with both keys if you're using it on multiple replicas.

When specifying the IP Address, the default is ANY which could be any of the IPv4 or IPv6 addresses. Though most places don't register or use IPv6 addresses, I do see it as an issue every once and a while. Checking to see what addresses are registered to a host in your DNS can help narrow the issue. You can also setup a network capture and filter based on the port. Every once and a while an IPv6 address is used and it's improperly setup, disabled, or firewalled on other servers but is returned via DNS.

Availability Group Replica Specification

When creating the availability group each replica will need to be specified and as such will need an endpoint url which will define the communications port (database mirroring endpoint).

Taking the OPs example:

REPLICA ON
        'SQL1' WITH
            (
                ENDPOINT_URL = 'TCP://sql1.sql.sandbox.net:7022',
                PRIMARY_ROLE ( ALLOW_CONNECTIONS = READ_WRITE),
                SECONDARY_ROLE (ALLOW_CONNECTIONS=READ_ONLY),
                AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
                FAILOVER_MODE = MANUAL
            ),

This replica, called SQL1 has a TCP endpoint on sql1.sql.sandbox.net port 7022. the address seems to be a FQDN and should be checked that the proper DNS lookups, (if using) Kerberos SPNs, and firewall rules exist or are created. If there are multiple interfaces or addresses that are returned on a lookup of the address, then it may make sense to narrow it down to a specific address in the endpoint url for testing or to narrow it down in the mirroring endpoint specification.

SQL Server Permissions

One of the last surface areas for the disconnected issues to appear is the endpoint permissions on each replica. Since each endpoint the account will need the CONNECT permission (similar if using certificates) in order to use the endpoints. This is a fairly common issue, so common that this is marked in the errorlog and gives you the fix in the error message which is to grant the connect permission to the endpoint.

Windows Clustering

Since the replica manager replies upon Windows Server Failover Clustering to be working (not counting Linux, Docker, or Read-Scale) in order to start up, do make sure the cluster service is running on each replica and each node shows an up status in the cluster (Powershell, Failover Cluster Manager, etc.). If the cluster service is not running, then the manager will stop until it comes back up - this will be logged in the errorlog and should be very evident.

Troubleshooting Methodology

When nothing else works or shows up, what do you do?

Availability groups are a distributed system, this means there are many individual parts (replicas) that are distributed among many areas (databases) that function through a common set of protocols (availability groups). Given this, the troubleshooting methodology differs from typical single instance or clustered instance troubleshooting.

Each part needs to be checked against itself and against the overall metadata available. Additionally, each part needs to have the intermediary items checked against each other intermediary (such as firewalls, network cards, load balancers, etc.).

Start by checking the errorlog on each replica as only some errors (due to where the error occurs) are only applicable when the replica is primary or secondary. After an overview of each errorlog, this should point in a direction - if it does not, then most likely the issue is under SQL Server which could mean networking, firewall, OS, etc. System and application logs can be helpful here as a next step.

Look out for NDIS drivers installed on the system as they could be blocking traffic or incorrectly working and dropping traffic. Additionally, do check the interface diagnostics for the network cards as things such as dropped/discarded packets may be an issue if the disconnected state is intermittent.

When all else fails, check every piece of metadata against the other and make sure all match and are the same. In the OPs metadata specified above, one port was on 7022 and the other on 5022. While this in of itself is not incorrect and can be used, it does make issues in other places such as firewall rules where someone can easily misread the port or assume and not read that the ports are the same.

Depending upon skill level and knowledge, a network trace from the primary side and the secondary side would be ideal. This can show the packets leaving the primary/secondary and hopefully arriving at the opposite side. If it does not, an intermediary is not configured correctly - look for firewalls, load balancers, routers, etc., that are stopping the traffic.