Sql-server – AlwaysON : Log_Send_Queue_Size is increasing but the log_send_rate keeps decreasing

availability-groupsperformancesql-server-2012

Question

Why, despite the increasingly log_send_queue_size for these two databases, did the log_send_rate keep decreasing?
There were no bandwidth issues on the network at this time, and no other databases experienced this issue. If this happens again, is there a recommended fix, apart from having to manually restore the primary database over the secondary to re-synchronize the pair?

Environment :

SQL 2012, SP1 CU7 (Build 3393)

Windows Server 2012 Standard (Build 9200)

Availability Group of 10 databases (PRDDB1-AG1)

2 AG replicas, one in London and one in New York (LDSERVER1 & NYSERVER1), primary in NY, secondary in London.

2 databases within AG1, E-DB1 (50GB log file) and T-DB2 (250GB log file)

The T-DB2 database imports files from clients, processes them (lots of log activity) and then outputs to/updates data in the E-DB1 database.

This process generates lots of data churn and log activity against both databases.
We have occasional spikes of latency between London and New York database replicas maybe once or twice a week maximum but these always clear within a few hours.

Issue :

Last week we saw an increasing log_send_queue_size, and a decreasing log_send_rate. This started on the Monday, and carried on through to Friday night, when it was manually resolved (see Fix section below).
At its lowest, the log_send_rate of the E-DB1 database was just over 100KB/sec with a log_send_queue of over 40GB.
The T-DB2 database had a log_send_rate of 2000KB/sec shrinking down to 300KB/sec, with a log_send_queue of over 300GB.

This led to an increasing amount of latency between the primary and secondary replicas for these two databases within the availability group.
This was characterised by a build-up of log activity within the transaction log for each database affected, which is expected.
Because of this latency, the logs of each database affected expanded to the point where the log drive was in danger of running out of space.

This latency only occurred on these two databases, despite some quite large spikes in transactional activity across all databases within the availability group, as is normal.

Throughout this issue, there was no build-up in the redo queue on the secondary, and the redo_rate stayed high. This would imply that the issue was due to the low send rate for both the affected databases.

Steps attempted

Suspend data movement of the T-DB2 database. I hoped this would free up network bandwidth for the priority database, E-DB1. No effect.
Rebooted the secondary node (LDPRDENTDB1). No effect.

Fix

The following steps resolved the issue. As the log files had grown to over 300GB, I needed to purge and shrink them soon before we ran out of disk space.

a. Removed databases from the availability group.

b. Dropped databases on secondary.

c. Re-added databases back into availability group on primary (NYSERVER1, manual synchronisation option).

d. Backed up the databases on primary and restored to secondary (70GB copied from NY to LD, over just under 24 hours)

e. Re-added databases back into availability group on secondary.

Best Answer

Answering my own question, as future readers will benefit from it :

Seems like we might be hitting Longer latency for SQL Server 2012 database when you use Service Broker, database mirroring, and Availability Groups. This is fixed in SQL server 2012 SP2 CU1. The KB 2976982 has a typo(AlawysOn). So if you are searching by AlwaysON, it wont show up.

After the patch was applied, the issue was fixed.

Related Solutions

Sql-server – Transaction Log Maintanance While Using AlwaysOn Availability Group

You can backup the transaction log from either of the replicas. Doing the transaction log backup on either the primary replica or the secondary replica will mark both replicas' transaction logs as reusable (provided no other stoppers are in place like active transactions, etc.).

To do a test, in a non-production environment, setup an availability group just as you have it in your production system (asynchronous commit to the secondary replica).

In my test environment I have a test database, TestBackupDatabase, and I bloated it with logged transactions through a dummy table:

use TestBackupDatabase;
go

create table dbo.TestTable
(
    id int identity(1, 1) not null,
    some_int int not null
        default 1
);
go

insert into dbo.TestTable
default values;
go 1000

Now when I do a transaction log backup on my primary, using DBCC SQLPERF(LOGSPACE) I see on both transaction logs (primary and secondary) that space used has dropped due to log truncation. Bloating the transaction log back up with the same test on the primary:

insert into dbo.TestTable
default values;
go 1000

I now do a transaction log backup on the secondary async replica. Running DBCC SQLPERF(LOGSPACE) again on each replica I see the same behavior: transaction log reuse.

BOL Reference: Active Secondaries: Backup on Secondary Replicas (AlwaysOn Availability Groups)

Sql-server – Managing logins with Availabilty Groups

Assuming these are not contained databases, you need to worry about the logins (server level), not the users (database level). Contained users will be transferred automatically without difficulty.

Manually creating a login with the same name on the other server is not good enough - the SID will not match.

Robert Davis has written a post that describes how to generate the login scripts with the right SID and the right password hash. This is what you need to run on each of the replicas:

Transferring Logins to a Database Mirror

(Also see the comment thread on this answer.)

In order to just script all of the logins on the local server, change the stored procedure by commenting out this section (and the End that goes with it):

--If Not Exists (Select 1 From sys.server_principals
--          Where name = @LoginName)
--  Begin

(There are other, similar IF NOT EXISTS sections you'll want to comment out in order to get roles and permissions.)

Then call the stored procedure with hard-coded @@SERVERNAME and @debug = 1. This will print out CREATE LOGIN commands with the password hash and you can pick the ones you need and run those commands on the replicas.

Jonathan Kehayias wrote an SSMS add-in that handles all of the uncontained objects (don't forget about things like jobs), but I don't know if it has been maintained for newer versions of SSMS:

Synchronize Availability Group Logins and Jobs

Best Answer

Related Solutions

Sql-server – Transaction Log Maintanance While Using AlwaysOn Availability Group

Sql-server – Managing logins with Availabilty Groups

Related Question