Sql-server – Why did the Always On AG lock up

availability-groupsshrinksql serversql server 2014

I am in over my head here. tl;dr; did a shrinkfile (I know, I know) (shrinking a 100gb file with 10gb of empty space by 2gb), it ran for ~15 minutes, then everything broke. What I don't understand, is WHY did everything break? We have two servers in a synchronous AlwaysOn Availability group. Sql server 2014 sp1. server 2012 r2. The primary started rejecting transactions with the log message Remote harden of transaction 'user_transaction' (ID 0x000000887e6bc779 0000:002ede71) started at Mar 13 2018 4:20PM in database 'AAAA' at LSN (296:17783:1) failed.

Why would a shrinkfile operation on a single database lock up every single database on the secondary? First thing we tried was restarting the secondary sql instance. This had no effect.

Dropping a database from the AG (ALTER REMOVE DATABASE) allowed transactions to resume on that specific database. This left the database in the RESTORING state on the secondary, as expected. In an attempt to put the databases back in the AG, we tried to apply the latest tx log backups on the secondary. The secondary rejected the tx log backups with an error about LSNs not matching. I unfortunately didn't write the specific message down, but it wasn't the simple 'you need to restore an earlier tx backup' message. We were giving it the tx log backup, it would spin for 30 seconds and then error out.

Primary log entries of note

2018-03-13 16:45:05.650    Backup    BACKUP failed to complete the command
BACKUP LOG AAAA. Check the backup application log for detailed messages.

2018-03-13 16:42:34.830    spid40s    A connection for availability group
'SQLDB' from availability replica 'SQLDB02' with id
[90C6C7E0-E41B-45D6-92D0-1CDCD33F49BF] to 'SQLDB03' with id
[659B20D9-F54D-4CE3-A492-B1CFD4D9BC22] has been successfully established.  This
is an informational message only. No user action is required.

(we restarted the secondary sql service) 2018-03-13 16:42:08.820    spid28s
A connection timeout has occurred on a previously established connection to
availability replica 'SQLDB03' with id [659B20D9-F54D-4CE3-A492-B1CFD4D9BC22].
Either a networking or a firewall issue exists or the availability replica has
transitioned to the resolving role.

2018-03-13 16:27:40.670 BACKUP failed to complete the command BACKUP LOG AAAAA.
Check the backup application log for detailed messages.

2018-03-13 16:22:18.900 DBCC SHRINKFILE for file ID 1 is waiting for the
snapshot transaction with timestamp 22194769242 and other snapshot transactions
linked to timestamp 22194769242 or with timestamps older than 22194835703 to
finish.

2018-03-13 16:21:13.460 A connection for availability group 'SQLDB' from
availability replica 'SQLDB02' with id  [90C6C7E0-E41B-45D6-92D0-1CDCD33F49BF]
to 'SQLDB03' with id [659B20D9-F54D-4CE3-A492-B1CFD4D9BC22] has been
successfully established.  This is an informational message only. No user
action is required.

2018-03-13 16:21:13.390 The state of the local availability replica in
availability group 'SQLDB' has changed from 'PRIMARY_PENDING' to
'PRIMARY_NORMAL'.  The state changed because the local replica has completed
processing Online command from Windows Server Failover Clustering (WSFC).  For
more information, see the SQL Server error log, Windows Server Failover
Clustering (WSFC) management console, or WSFC log.

2018-03-13 16:21:13.310 AlwaysOn: The local replica of availability group
'SQLDB' is preparing to transition to the primary role in response to a request
from the Windows Server Failover Clustering (WSFC) cluster. This is an
informational message only. No user action is required.

2018-03-13 16:20:55.550 (x 50) AlwaysOn Availability Groups connection with
secondary database terminated for primary database '' on the availability
replica '' with Replica ID: {659b20d9-f54d-4ce3-a492-b1cfd4d9bc22}. This is an
informational message only. No user action is required.

2018-03-13 16:20:55.540 (x 1000) The client was unable to reuse a session with
SPID 276, which had been reset for connection pooling. The failure ID is 46.
This error may have been caused by an earlier operation failing. Check the
error logs for failed AAAA immediately before this error message.

3/13/2018 4:20:44 PM: The state of the local availability replica in
availability group 'SQLDB' has changed from 'PRIMARY_NORMAL' to
'RESOLVING_NORMAL'

3/13/2018 4:20:44 PM: AlwaysOn: The local replica of availability group 'SQLDB'
is preparing to transition to the resolving role in response to a request from
the Windows Server Failover Clustering (WSFC) cluster. This is an informational
message only. No user action is required.

3/13/2018 4:17:07 PM DBCC SHRINKFILE for file ID 1 is waiting for the snapshot
transaction with timestamp 22194769242 and other snapshot transactions linked
to timestamp 22194769242 or with timestamps older than 22194835703 to finish.

Secondary log entries of note

2018-03-13 17:02:00.990 spid63s The database 'AAAA' is marked RESTORING
and is in a state that does not allow recovery to be run.

2018-03-13 17:02:00.980 spid63s Starting up database 'AAAA'.

2018-03-13 17:02:00.970 spid63s State information for database 'AAAA' -
Hardended Lsn: '(2665414:190311:1)'    Commit LSN: '(2665414:190298:27)'
Commit Time: 'Mar 13 2018  4:20PM'

2018-03-13 17:02:00.940 spid63s State information for database 'AAAA' -
Hardended Lsn: '(2665414:190311:1)'    Commit LSN: '(2665414:190298:27)'
Commit Time: 'Mar 13 2018  4:20PM'

2018-03-13 17:02:00.810 spid63s Nonqualified transactions are being rolled back
in database AAAA for an AlwaysOn Availability Groups state change. Estimated
rollback completion: 100%. This is an informational message only. No user
action is required.

2018-03-13 17:02:00.810 spid112s    AlwaysOn Availability Groups connection
with primary database terminated for secondary database 'AAAA' on the
availability replica 'SQLDB02' with Replica ID:
{90c6c7e0-e41b-45d6-92d0-1cdcd33f49bf}. This is an informational message only.
No user action is required.

We drop AAAA from the AG on the primary at this point. (5:02, not 4:43)

2018-03-13 16:43:22.760 Logon   Error: 976, Severity: 14, State: 1.

2018-03-13 16:43:22.760 Logon   The target database, 'AAAA', is participating
in an availability group and is currently not accessible for queries. Either
data movement is suspended or the availability replica is not enabled for read
access. To allow read-only access to this and other databases in the
availability group, enable read access to one or more secondary availability
replicas in the group.  For more information, see the ALTER AVAILABILITY GROUP
statement in SQL Server Books Online.

2018-03-13 16:42:02.740 spid27s CHECKDB for database 'AAAA' finished without
errors on 2018-02-24 23:18:46.247 (local time). This is an informational
message only; no user action is required.

2018-03-13 16:42:02.690 spid27s Recovery completed for database AAAA
(database ID 6) in 26 second(s) (analysis 38 ms, redo 1040 ms, undo 0 ms.) This
is an informational message only. No user action is required.

2018-03-13 16:42:02.650 spid27s 5977 transactions rolled forward in database
'AAAA' (6:0). This is an informational message only. No user action is
required.

2018-03-13 16:41:34.400 spid27s State information for database 'AAAA' -
Hardended Lsn: '(0:0:0)'    Commit LSN: '(0:0:0)'    Commit Time: 'Jan  1 1900
12:00AM'

2018-03-13 16:41:34.400 spid50s AlwaysOn Availability Groups connection with
primary database terminated for secondary database 'AAAA' on the availability
replica 'SQLDB02' with Replica ID: {90c6c7e0-e41b-45d6-92d0-1cdcd33f49bf}. This
is an informational message only. No user action is required.

2018-03-13 16:41:34.400 spid27s State information for database 'AAAA' -
Hardended Lsn: '(0:0:0)'    Commit LSN: '(0:0:0)'    Commit Time: 'Jan  1 1900
12:00AM'

2018-03-13 16:41:34.390 spid27s Nonqualified transactions are being rolled back
in database AAAA for an AlwaysOn Availability Groups state change. Estimated
rollback completion: 100%. This is an informational message only. No user
action is required.

2018-03-13 16:41:34.110 spid22s The state of the local availability replica in
availability group 'SQLDB' has changed from 'NOT_AVAILABLE' to
'RESOLVING_NORMAL'.  The state changed because the local instance of SQL Server
is starting up.  For more information, see the SQL Server error log, Windows
Server Failover Clustering (WSFC) management console, or WSFC log.

2018-03-13 16:41:34.100 Server  The SQL Server Network Interface library could
not register the Service Principal Name (SPN) [ MSSQLSvc/$DB.$DOMAIN ] for the
SQL Server service. Windows return code: 0x2098, state: 15. Failure to register
a SPN might cause integrated authentication to use NTLM instead of Kerberos.
This is an informational message. Further action is only required if Kerberos
authentication is required by authentication policies and if the SPN has not
been manually registered.

...normal startup stuff.

2018-03-13 16:41:09.050 spid9s  SQL Server is terminating in response to a
'stop' request from Service Control Manager. This is an informational message
only. No user action is required.

2018-03-13 16:20:43.880 (x 50) spid169s AlwaysOn Availability Groups connection
with primary database terminated for secondary database 'AAAA' on
the availability replica 'SQLDB02' with Replica ID:
{90c6c7e0-e41b-45d6-92d0-1cdcd33f49bf}. This is an informational message only.
No user action is required.

2018-03-13 16:20:43.880 (x 50) spid169s AlwaysOn Availability Groups connection
with primary database terminated for secondary database 'AAAA' on
the availability replica 'SQLDB02' with Replica ID:
{90c6c7e0-e41b-45d6-92d0-1cdcd33f49bf}. This is an informational message only.
No user action is required.

2018-03-13 16:20:43.880 (x 50)  spid170s    The availability group database
"AAAA" is changing roles from "SECONDARY" to "RESOLVING" because
the mirroring session or availability group failed over due to role
synchronization. This is an informational message only. No user action is
required.

2018-03-13 11:57:34.050 spid106 Server process ID is 2268.

Best Answer

WHY did everything break?

Based on the curated errorlog output you gave us, shrinkfile was waiting on a current open snapshot transaction to finish:

3/13/2018 4:17:07 PM DBCC SHRINKFILE for file ID 1 is waiting for the snapshot transaction with timestamp 22194769242 and other snapshot transactions linked to timestamp 22194769242 or with timestamps older than 22194835703 to finish.

Fairly certain it was stuck this way for a while because there is another entry later on with the same transaction timestamp.

After that first logged message, we see...

3/13/2018 4:20:44 PM: AlwaysOn: The local replica of availability group 'SQLDB' is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required.

We'd need to see the cluster logs to see what it thought happened. Other logs of use would be the output of sp_server_diagnostics in the logs folder, the failover policy level, and potentially the always on health extended events files.

There also started to be issues with the hardening of logs:

Remote harden of transaction 'user_transaction' (ID 0x000000887e6bc779 0000:002ede71) started at Mar 13 2018 4:20PM in database 'AAAA' at LSN (296:17783:1) failed.

Hard to say why this happened but it should have left the primary is a fairly precarious state.

Why would a shrinkfile operation on a single database lock up every single database on the secondary?

I'm not sure why it would "lock up every single database" as we only see error messages and issues for a single database and I'd expect a long running shrinkfile holding an EX latch on the database file being shrunk to eventually cause issues given enough time.

Unfortunately there isn't enough data to give a valid assumption as to what happened but at least this will give you some things to investigate/look into.