Sql-server – AlwaysON : Log_Send_Queue_Size is increasing but the log_send_rate keeps decreasing

availability-groupsperformancesql-server-2012

Question

  1. Why, despite the increasingly log_send_queue_size for these two databases, did the log_send_rate keep decreasing?
  2. There were no bandwidth issues on the network at this time, and no other databases experienced this issue. If this happens again, is there a recommended fix, apart from having to manually restore the primary database over the secondary to re-synchronize the pair?

Environment :

SQL 2012, SP1 CU7 (Build 3393)

Windows Server 2012 Standard (Build 9200)

Availability Group of 10 databases (PRDDB1-AG1)

2 AG replicas, one in London and one in New York (LDSERVER1 & NYSERVER1), primary in NY, secondary in London.

2 databases within AG1, E-DB1 (50GB log file) and T-DB2 (250GB log file)

The T-DB2 database imports files from clients, processes them (lots of log activity) and then outputs to/updates data in the E-DB1 database.

This process generates lots of data churn and log activity against both databases.
We have occasional spikes of latency between London and New York database replicas maybe once or twice a week maximum but these always clear within a few hours.

Issue :

Last week we saw an increasing log_send_queue_size, and a decreasing log_send_rate. This started on the Monday, and carried on through to Friday night, when it was manually resolved (see Fix section below).
At its lowest, the log_send_rate of the E-DB1 database was just over 100KB/sec with a log_send_queue of over 40GB.
The T-DB2 database had a log_send_rate of 2000KB/sec shrinking down to 300KB/sec, with a log_send_queue of over 300GB.

This led to an increasing amount of latency between the primary and secondary replicas for these two databases within the availability group.
This was characterised by a build-up of log activity within the transaction log for each database affected, which is expected.
Because of this latency, the logs of each database affected expanded to the point where the log drive was in danger of running out of space.

This latency only occurred on these two databases, despite some quite large spikes in transactional activity across all databases within the availability group, as is normal.

Throughout this issue, there was no build-up in the redo queue on the secondary, and the redo_rate stayed high. This would imply that the issue was due to the low send rate for both the affected databases.

Steps attempted

  1. Suspend data movement of the T-DB2 database. I hoped this would free up network bandwidth for the priority database, E-DB1. No effect.

  2. Rebooted the secondary node (LDPRDENTDB1). No effect.

Fix

  1. The following steps resolved the issue. As the log files had grown to over 300GB, I needed to purge and shrink them soon before we ran out of disk space.

    a. Removed databases from the availability group.

    b. Dropped databases on secondary.

    c. Re-added databases back into availability group on primary (NYSERVER1, manual synchronisation option).

    d. Backed up the databases on primary and restored to secondary (70GB copied from NY to LD, over just under 24 hours)

    e. Re-added databases back into availability group on secondary.

Best Answer

Answering my own question, as future readers will benefit from it :

Seems like we might be hitting Longer latency for SQL Server 2012 database when you use Service Broker, database mirroring, and Availability Groups. This is fixed in SQL server 2012 SP2 CU1. The KB 2976982 has a typo(AlawysOn). So if you are searching by AlwaysON, it wont show up.

After the patch was applied, the issue was fixed.