Sql-server – How AlwaysOn Availability Group secondary replica catches up with primary after secondary server long downtime

availability-groupssql server

Can someone point me to the MS article or blog post that explains in details how AlwaysOn Availability Group secondary replica catches up with primary after secondary server long downtime?

I did below tests with AAG (async, manual failover, read-only configuration).
A) Killed secondary instance during continuos insert into primary and started secondary instance few minutes after. AAG dashboard turned into a green almost immediately after secondary restart and started to catch up with primary until number of rows became the same in both instances. No transaction log backup was done.
B) Same as A) but few transaction logs were done from primary during the test.

Questions are:

1) What is the size of log cache/messaging framework etc that are used to keep tran log blocks (which are sent to secondary replica)

2) Can above structure (log cache/send queue etc – whatever is used as transport for AAG replication) sizes be configured/increased (similar to encrease of tran log backup retention period in log shipping, for example)?

3) As I backed up (truncated) tran log in test B) and secondary replica was syncronised with primary automatically what was used to find row difference between primary and secondary (apparently not tran log as it was truncated) and then bring then in sync?

4) How does this automatic catch up process work and what are its limitations?

Best Answer

If the secondary is up and running, when the log block is flushed to disk (either because it is full or a commit), the record gets pushed to the log writer on the primary and to the log scanner (log reader) process on the primary simultaneously. Then the log scanner communicates with the secondary and the secondary then pulls the transaction from the log scanner on the primary to the secondary and processes the log record. The primary log writer doesn't push transactions across, it just communicates with the secondary, it only does that to see if it is up so that it knows it doesn't have to mark the replica as NOT SYNCRONIZED.

When the secondary is not up, then the log writer cant communicate with the secondary so it marks it as NOT SYNCHRONIZED and stores the records in the tran log on the primary. If you look at sys.databases.log_reuse_wait_desc column it should show AVAILABILITY_REPLICA which means the primary is hanging on to all the records.

Once the secondary is up, it will communicate with the primary to request a log scan, it then processes the transactions and communicates with the primary using progress messages to indicate the hardened LSN, presumably the primary is then adjusting its MinLSN, which in turn means the records prior to MinLSN will get deleted as checkpoints happen and hence VLFs will get truncated releasing space when you do a log backup.

But yes short answer is, if your secondary is down you need as big a log file as you need for as long as it is down. Once it is backup and synched at some time you may need to remove the db from the always on group to shrink the log if it is humungous and you dont want it that big.

Related Solutions

Sql-server – Transaction Log Maintanance While Using AlwaysOn Availability Group

You can backup the transaction log from either of the replicas. Doing the transaction log backup on either the primary replica or the secondary replica will mark both replicas' transaction logs as reusable (provided no other stoppers are in place like active transactions, etc.).

To do a test, in a non-production environment, setup an availability group just as you have it in your production system (asynchronous commit to the secondary replica).

In my test environment I have a test database, TestBackupDatabase, and I bloated it with logged transactions through a dummy table:

use TestBackupDatabase;
go

create table dbo.TestTable
(
    id int identity(1, 1) not null,
    some_int int not null
        default 1
);
go

insert into dbo.TestTable
default values;
go 1000

Now when I do a transaction log backup on my primary, using DBCC SQLPERF(LOGSPACE) I see on both transaction logs (primary and secondary) that space used has dropped due to log truncation. Bloating the transaction log back up with the same test on the primary:

insert into dbo.TestTable
default values;
go 1000

I now do a transaction log backup on the secondary async replica. Running DBCC SQLPERF(LOGSPACE) again on each replica I see the same behavior: transaction log reuse.

BOL Reference: Active Secondaries: Backup on Secondary Replicas (AlwaysOn Availability Groups)

Sql-server – Shrink Transaction Log While Using AlwaysOn Availability Group

In AGs writes can only occur on the primary. Shrink operations are writes. Therefore you must do the shrink on the primary. Note that the shrink may not shrink as much as you expect, your test on the restored DB had probably leveraged simple recovery model. Read How to shrink the SQL Server log for more info.

Do not shrink to 160MB. Determine why did the log grow to 121Gb so it does not repeat (you have a suspicion, would be nice to confirm if possible). Size the log to a size appropriate for your operational needs. Log growth is a serious problem, it cannot use instant file initialization and all your database activity will freeze while the log grows and is being 0-initialized. Users and apps hate it when it occurs. If you understand the impact and your users are OK, you can shrink once to a small amount (160MB is probably too small though) and let it grow until it stabilizes.

Best Answer

Related Solutions

Sql-server – Transaction Log Maintanance While Using AlwaysOn Availability Group

Sql-server – Shrink Transaction Log While Using AlwaysOn Availability Group

Related Question