SQL Server Availability Groups – Delay Between Primary and Secondary DB

availability-groupssql server

How to measure the delay between secondary node to catch up with the primary node?

I am using red gate to monitor the databases and this is what mentioned there

Guideline values: To estimate how long a secondary replica will take to catch up with the primary replica, divide the log send queue by the rate of log bytes received.

Based on this graph can you advise me how can I measure this delay ? it is not clear for me how to divide these 2 values

If there is another way to measure the delay I will appreciate to share

Best Answer

I'm not experienced with RedGate's monitoring tools, but SSMS has an Availability Group Dashboard out-of-the-box that I've always found particularly helpful with basic metrics like this. Specifically the metric in the dashboard that would be helpful to you is called the Estimated Recovery Time (seconds):

Indicates the time in seconds it takes to redo the catch-up time. The catch-up time is the time it will take for the secondary replica to catch up with the primary replica. This value is hidden by default.

I also found the Redo Queue Size (KB):

Indicates the number of log records in the log files of the secondary replica that have not yet been redone. This value is hidden by default.

And the Redo Rate (KB/sec) particularly helpful in monitoring the health of my secondary replicas:

Indicates the rate in KB per second at which the log records are being redone. This value is hidden by default.

If you want to access this information more natively so you can collect the metrics over time, the above Availability Group Dashboard just leverages the following DMVs:

You can find more information on querying the DMVs for this information in SQLPerformance's Monitoring Availability Group Replica Synchronization. This is the query from the article (with one additional calculated column to get you the Estimated Recovery Time):

SELECT 
    ar.replica_server_name, 
    adc.database_name, 
    ag.name AS ag_name, 
    drs.is_local, 
    drs.is_primary_replica, 
    drs.synchronization_state_desc, 
    drs.is_commit_participant, 
    drs.synchronization_health_desc, 
    drs.recovery_lsn, 
    drs.truncation_lsn, 
    drs.last_sent_lsn, 
    drs.last_sent_time, 
    drs.last_received_lsn, 
    drs.last_received_time, 
    drs.last_hardened_lsn, 
    drs.last_hardened_time, 
    drs.last_redone_lsn, 
    drs.last_redone_time, 
    drs.log_send_queue_size, 
    drs.log_send_rate, 
    drs.redo_queue_size, 
    drs.redo_rate, 
    drs.redo_queue_size / see.redo_rate AS EstimatedRecoveryTime -- Additional helpful calculated column
    drs.filestream_send_rate, 
    drs.end_of_log_lsn, 
    drs.last_commit_lsn, 
    drs.last_commit_time
FROM sys.dm_hadr_database_replica_states AS drs
INNER JOIN sys.availability_databases_cluster AS adc 
    ON drs.group_id = adc.group_id AND 
    drs.group_database_id = adc.group_database_id
INNER JOIN sys.availability_groups AS ag
    ON ag.group_id = drs.group_id
INNER JOIN sys.availability_replicas AS ar 
    ON drs.group_id = ar.group_id AND 
    drs.replica_id = ar.replica_id
ORDER BY 
    ag.name, 
    ar.replica_server_name, 
    adc.database_name;

You can create a SQL job to routinely log this information to a table so you have a historical comparison to achieve your goals with. (This is essentially what RedGate's Performance Monitor is likely doing under the hood.)

The above article also briefly mentions a third way to monitor these metrics via Perfmon Counters.

And finally a third and equally effective way to log your Availability Groups health metrics is via the built-in Extended Events.

Related Solutions

Sql-server – Check the data latency between two Always On Availability Group servers in ASYNC mode

I used this script in a custom report once.

;WITH AG_Stats AS (
            SELECT AGS.name                       AS AGGroupName, 
                   AR.replica_server_name         AS InstanceName, 
                   HARS.role_desc, 
                   Db_name(DRS.database_id)       AS DBName, 
                   DRS.database_id, 
                   AR.availability_mode_desc      AS SyncMode, 
                   DRS.synchronization_state_desc AS SyncState, 
                   DRS.last_hardened_lsn, 
                   DRS.end_of_log_lsn, 
                   DRS.last_redone_lsn, 
                   DRS.last_hardened_time, -- On a secondary database, time of the log-block identifier for the last hardened LSN (last_hardened_lsn).
                   DRS.last_redone_time, -- Time when the last log record was redone on the secondary database.
                   DRS.log_send_queue_size, 
                   DRS.redo_queue_size,
                    --Time corresponding to the last commit record.
                    --On the secondary database, this time is the same as on the primary database.
                    --On the primary replica, each secondary database row displays the time that the secondary replica that hosts that secondary database 
                    --   has reported back to the primary replica. The difference in time between the primary-database row and a given secondary-database 
                    --   row represents approximately the recovery time objective (RPO), assuming that the redo process is caught up and that the progress 
                    --   has been reported back to the primary replica by the secondary replica.
                   DRS.last_commit_time
            FROM   sys.dm_hadr_database_replica_states DRS 
            LEFT JOIN sys.availability_replicas AR 
            ON DRS.replica_id = AR.replica_id 
            LEFT JOIN sys.availability_groups AGS 
            ON AR.group_id = AGS.group_id 
            LEFT JOIN sys.dm_hadr_availability_replica_states HARS ON AR.group_id = HARS.group_id 
            AND AR.replica_id = HARS.replica_id 
            ),
    Pri_CommitTime AS 
            (
            SELECT  DBName
                    , last_commit_time
            FROM    AG_Stats
            WHERE   role_desc = 'PRIMARY'
            ),
    Rpt_CommitTime AS 
            (
            SELECT  DBName, last_commit_time
            FROM    AG_Stats
            WHERE   role_desc = 'SECONDARY' AND [InstanceName] = 'InstanceNameB-PrimaryDataCenter'
            ),
    FO_CommitTime AS 
            (
            SELECT  DBName, last_commit_time
            FROM    AG_Stats
            WHERE   role_desc = 'SECONDARY' AND ([InstanceName] = 'InstanceNameC-SecondaryDataCenter' OR [InstanceName] = 'InstanceNameD-SecondaryDataCenter')
            )
SELECT p.[DBName] AS [DatabaseName], p.last_commit_time AS [Primary_Last_Commit_Time]
    , r.last_commit_time AS [Reporting_Last_Commit_Time]
    , DATEDIFF(ss,r.last_commit_time,p.last_commit_time) AS [Reporting_Sync_Lag_(secs)]
    , f.last_commit_time AS [FailOver_Last_Commit_Time]
    , DATEDIFF(ss,f.last_commit_time,p.last_commit_time) AS [FailOver_Sync_Lag_(secs)]
FROM Pri_CommitTime p
LEFT JOIN Rpt_CommitTime r ON [r].[DBName] = [p].[DBName]
LEFT JOIN FO_CommitTime f ON [f].[DBName] = [p].[DBName]

SQL Server – How AlwaysOn Availability Group Secondary Replica Catches Up After Downtime

If the secondary is up and running, when the log block is flushed to disk (either because it is full or a commit), the record gets pushed to the log writer on the primary and to the log scanner (log reader) process on the primary simultaneously. Then the log scanner communicates with the secondary and the secondary then pulls the transaction from the log scanner on the primary to the secondary and processes the log record. The primary log writer doesn't push transactions across, it just communicates with the secondary, it only does that to see if it is up so that it knows it doesn't have to mark the replica as NOT SYNCRONIZED.

When the secondary is not up, then the log writer cant communicate with the secondary so it marks it as NOT SYNCHRONIZED and stores the records in the tran log on the primary. If you look at sys.databases.log_reuse_wait_desc column it should show AVAILABILITY_REPLICA which means the primary is hanging on to all the records.

Once the secondary is up, it will communicate with the primary to request a log scan, it then processes the transactions and communicates with the primary using progress messages to indicate the hardened LSN, presumably the primary is then adjusting its MinLSN, which in turn means the records prior to MinLSN will get deleted as checkpoints happen and hence VLFs will get truncated releasing space when you do a log backup.

But yes short answer is, if your secondary is down you need as big a log file as you need for as long as it is down. Once it is backup and synched at some time you may need to remove the db from the always on group to shrink the log if it is humungous and you dont want it that big.

Best Answer

Related Solutions

Sql-server – Check the data latency between two Always On Availability Group servers in ASYNC mode

SQL Server – How AlwaysOn Availability Group Secondary Replica Catches Up After Downtime

Related Question