This question is very old and likely no more troubleshooting will be possible, however for those of you stumbling upon this question having the same issue the following may be of some help. This isn't meant to be a complete list of all issues that could occur, but it will hit the main ones that do.
When the secondary replica is marked as disconnected by the primary, this means the primary hasn't heard or can't contact the secondary replica over the database mirroring endpoint. There are various reasons for this:
Endpoint configuration
One of the most common causes are misconfigured endpoints for the Database Mirroring role. There are various options and configurations that can be used and sometimes, especially when created by hand, those configuration options do not match. Take the OPs example:
CREATE ENDPOINT dbm_endpoint
STATE=STARTED
AS TCP (LISTENER_PORT=5022)
FOR DATABASE_MIRRORING (ROLE=ALL)
This creates an endpoint called dbm_endpoint on tcp port 5022 for database mirroring. However it leaves out quite a bit of configuration options to which we assume are defaulted but we really do not know. This means the default options such as Windows
authentication using either Kerberos
or NTLM
, it also doesn't specify which IP address (v4, v6, interface) is used, nor does it specify encryption options to which the default is REQUIRED and depending on the version of SQL Serve either RC4
or AES
.
That is quite the assumption to make, especially when it isn't working. If you're not sure what the endpoint is set to, join sys.tcp_endpoints
with sys.database_mirroring_endpoints
to turn the resulting single row to individual configuration options.
There are advanced configurations, for example, that use certificates since they can't use Windows
authentication. We don't know if the OP has a proper domain and if these servers are joined. Again, the assumption based off the endpoints is that they are, however we don't know for sure. When using certificates the largest issue is that the same certificate is used on multiple replicas yet the private key is not exported when copying the certificate to import to the replicas. Since the endpoints use the private/public key pair, using the same certificate will require that endpoint at some time to use the private key. Make sure the certificate is exported and imported with both keys if you're using it on multiple replicas.
When specifying the IP Address, the default is ANY
which could be any of the IPv4 or IPv6 addresses. Though most places don't register or use IPv6 addresses, I do see it as an issue every once and a while. Checking to see what addresses are registered to a host in your DNS can help narrow the issue. You can also setup a network capture and filter based on the port. Every once and a while an IPv6 address is used and it's improperly setup, disabled, or firewalled on other servers but is returned via DNS.
Availability Group Replica Specification
When creating the availability group each replica will need to be specified and as such will need an endpoint url which will define the communications port (database mirroring endpoint).
Taking the OPs example:
REPLICA ON
'SQL1' WITH
(
ENDPOINT_URL = 'TCP://sql1.sql.sandbox.net:7022',
PRIMARY_ROLE ( ALLOW_CONNECTIONS = READ_WRITE),
SECONDARY_ROLE (ALLOW_CONNECTIONS=READ_ONLY),
AVAILABILITY_MODE = ASYNCHRONOUS_COMMIT,
FAILOVER_MODE = MANUAL
),
This replica, called SQL1 has a TCP endpoint on sql1.sql.sandbox.net port 7022. the address seems to be a FQDN and should be checked that the proper DNS lookups, (if using) Kerberos SPNs, and firewall rules exist or are created. If there are multiple interfaces or addresses that are returned on a lookup of the address, then it may make sense to narrow it down to a specific address in the endpoint url for testing or to narrow it down in the mirroring endpoint specification.
SQL Server Permissions
One of the last surface areas for the disconnected issues to appear is the endpoint permissions on each replica. Since each endpoint the account will need the CONNECT
permission (similar if using certificates) in order to use the endpoints. This is a fairly common issue, so common that this is marked in the errorlog and gives you the fix in the error message which is to grant the connect permission to the endpoint.
Windows Clustering
Since the replica manager replies upon Windows Server Failover Clustering to be working (not counting Linux, Docker, or Read-Scale) in order to start up, do make sure the cluster service is running on each replica and each node shows an up status in the cluster (Powershell, Failover Cluster Manager, etc.). If the cluster service is not running, then the manager will stop until it comes back up - this will be logged in the errorlog and should be very evident.
Troubleshooting Methodology
When nothing else works or shows up, what do you do?
Availability groups are a distributed system, this means there are many individual parts (replicas) that are distributed among many areas (databases) that function through a common set of protocols (availability groups). Given this, the troubleshooting methodology differs from typical single instance or clustered instance troubleshooting.
Each part needs to be checked against itself and against the overall metadata available. Additionally, each part needs to have the intermediary items checked against each other intermediary (such as firewalls, network cards, load balancers, etc.).
Start by checking the errorlog on each replica as only some errors (due to where the error occurs) are only applicable when the replica is primary or secondary. After an overview of each errorlog, this should point in a direction - if it does not, then most likely the issue is under SQL Server which could mean networking, firewall, OS, etc. System and application logs can be helpful here as a next step.
Look out for NDIS drivers installed on the system as they could be blocking traffic or incorrectly working and dropping traffic. Additionally, do check the interface diagnostics for the network cards as things such as dropped/discarded packets may be an issue if the disconnected state is intermittent.
When all else fails, check every piece of metadata against the other and make sure all match and are the same. In the OPs metadata specified above, one port was on 7022 and the other on 5022. While this in of itself is not incorrect and can be used, it does make issues in other places such as firewall rules where someone can easily misread the port or assume and not read that the ports are the same.
Depending upon skill level and knowledge, a network trace from the primary side and the secondary side would be ideal. This can show the packets leaving the primary/secondary and hopefully arriving at the opposite side. If it does not, an intermediary is not configured correctly - look for firewalls, load balancers, routers, etc., that are stopping the traffic.
From here, the sql_severity
column is the "severity of any SQL Server error."
When this column has the value 0, that means either no message was returned (and it's therefore a magic value), or a message with severity 0 was returned. I would argue this is a bad table design, as the column does not allow NULL
.
It's almost the same thing with the sql_message_id
column. This column will contain 0 when no message was returned -- in this case, though, it's purely a magic value, as there's no message id 0 (run SELECT * FROM sys.messages WHERE message_id = 0;
).
You may want to clarify the intent of the report query you've inherited, because only comparing the sql_severity
column may not do exactly what the author thought. (Severity 0 errors being raised are undoubtedly rare, but hey, you asked.)
If it's supposed to look for steps that returned a message, then it should check for sql_message_id
> 0. As mentioned, it's completely possible that a job step can have a sql_severity
> 0 and/or sql_message_id
> 0 while still succeed. **
If the query is supposed to look for failed steps, it should be looking at the run_status
column instead.
** You can try this yourself by creating and running a job with a T-SQL step that simply calls RAISERROR
.
Best Answer
Zip codes are strings, not numbers. Some of them have 1 or even 2 (but not more than 2) leading zeroes. The datatype in the import table should be
VARCHAR(10)
so that it can hold 5 digit and 9 digit + hyphen zip codes. Even if you never have to store postal codes of other countries, and even if the values will only ever have numeric digits (i.e. 0 - 9), this data is still string data, just like phone numbers.Based on the import query shown in your other question (automate import and export of process EXCEL -> SQL SERVER - > EXCEL without using SSIS ):
I would suggest not relying upon the
SELECT INTO
construct to create theXLS_IMPORT
table, but instead create the import table(s) manually, and then use theINSERT INTO ... SELECT FROM OPENROWSET()
construct. Doing this would allow you to do the following things to improve this situation:ZipCode
field asVARCHAR(10)
FLOAT
value of432013256
, will return432013256
instead of4.32013e+008
(which is what you get when you convert toVARCHAR
).Fix any missing leading zeroes using something like the following:
Example:
Returns:
Ideally, you would fix the column definition in the spreadsheet to be a string. But even if you do that, it might still be a good idea to keep this code around.
With that goal in mind, the following inline TVF can be used to both convert the
FLOAT
value toVARCHAR
, and to add in the hyphen for ZIP + 4 values.Code for iTVF:
Test:
Returns:
In order to be clearer about what is being suggested, the following shows all of the suggestions noted above put together:
Some of the spreadsheets are "aware" of there being leading zeroes, and so prefixed the field in Excel with a single apostrophe so that Excel would treat the value as a string instead of as numeric (e.g.
'01234
). In that case, you can use theREPLACE
function to strip out that apostrophe.However, you can't use
REPLACE
on all spreadsheets because the spreadsheets that have theZIP
column as numeric will do aCONVERT_IMPLICIT
as it passes the value to theREPLACE
function and the converted value will be in scientific notation (e.g.1.23457e+008
). So if you can't be certain which datatype is going to be returned byOPENROWSET
, then you can useIIF
(orCASE
if using a version of SQL Server prior to 2012) andCHARINDEX
to test for the presence of an apostrophe. If the data is being returned asFLOAT
, then the implicit conversion toVARCHAR
that will occur when passing the values intoCHARINDEX
will not be an issue since there is no apostrophe in1.23457e+008
and the converted value is only going toCHARINDEX
and not todbo.FormatZIPCode
.