Sql-server – How AlwaysOn Availability Group secondary replica catches up with primary after secondary server long downtime

availability-groupssql server

Can someone point me to the MS article or blog post that explains in details how AlwaysOn Availability Group secondary replica catches up with primary after secondary server long downtime?

I did below tests with AAG (async, manual failover, read-only configuration).
A) Killed secondary instance during continuos insert into primary and started secondary instance few minutes after. AAG dashboard turned into a green almost immediately after secondary restart and started to catch up with primary until number of rows became the same in both instances. No transaction log backup was done.
B) Same as A) but few transaction logs were done from primary during the test.

Questions are:

1) What is the size of log cache/messaging framework etc that are used to keep tran log blocks (which are sent to secondary replica)

2) Can above structure (log cache/send queue etc – whatever is used as transport for AAG replication) sizes be configured/increased (similar to encrease of tran log backup retention period in log shipping, for example)?

3) As I backed up (truncated) tran log in test B) and secondary replica was syncronised with primary automatically what was used to find row difference between primary and secondary (apparently not tran log as it was truncated) and then bring then in sync?

4) How does this automatic catch up process work and what are its limitations?

Best Answer

If the secondary is up and running, when the log block is flushed to disk (either because it is full or a commit), the record gets pushed to the log writer on the primary and to the log scanner (log reader) process on the primary simultaneously. Then the log scanner communicates with the secondary and the secondary then pulls the transaction from the log scanner on the primary to the secondary and processes the log record. The primary log writer doesn't push transactions across, it just communicates with the secondary, it only does that to see if it is up so that it knows it doesn't have to mark the replica as NOT SYNCRONIZED.

When the secondary is not up, then the log writer cant communicate with the secondary so it marks it as NOT SYNCHRONIZED and stores the records in the tran log on the primary. If you look at sys.databases.log_reuse_wait_desc column it should show AVAILABILITY_REPLICA which means the primary is hanging on to all the records.

Once the secondary is up, it will communicate with the primary to request a log scan, it then processes the transactions and communicates with the primary using progress messages to indicate the hardened LSN, presumably the primary is then adjusting its MinLSN, which in turn means the records prior to MinLSN will get deleted as checkpoints happen and hence VLFs will get truncated releasing space when you do a log backup.

But yes short answer is, if your secondary is down you need as big a log file as you need for as long as it is down. Once it is backup and synched at some time you may need to remove the db from the always on group to shrink the log if it is humungous and you dont want it that big.