Sql-server – The curious case of HADR_SYNC_COMMIT waits

sql serversql server 2014

We are noticing an interesting pattern for HADR_SYNC_COMMIT waits in our environment. We have a three replica; one primary, one sync secondary and one async secondary in a datacenter and we just added three more ASYNC replicas in another datacenter (~2400 miles apart).

Ever since, we have started to notice an enormous increase in HADR_SYNC_COMMIT waits. When we look at the active sessions, we see a bunch of COMMIT TRANSACTION queries waiting on the SYNC replica

From the screenshot, we can clearly see there is a jump in HADR_SYNC_COMMIT wait on June 29, and we eventually dropped 'two' of the three async replica in the remote datacenter sometime in the noon on July 1st. That dropped the wait times considerably along with it.

image

What we have checked so far – Log send queue, Redo queue, last hardened time and last commit time on the remote replicas. We have continuous bursts of small transactions during the business hours, and therefore the send queues are pretty small at a given timestamp (anywhere between 60KB and 1MB).
The remote replicas are almost in sync, there is very little difference between the last commit time and last hardened time for any individual lsn on the replicas.

The network pipe is 10G and we modified the transmit buffer size from 256 megs to 2 gigs, this was made under the assumption that the network was dropping packets and re-transmitting them; either way that didn’t seem to help much.

So, I’m wondering what does the ASYNC replicas have to do with HADR_SYNC_COMMIT waits? Shouldn’t the SYNC replica depend alone on this wait type, what am I missing here?

Best Answer

First the description of the wait event that your question is regarding is:

Waiting for transaction commit processing for the synchronized secondary databases to harden the log. This wait is also reflected by the Transaction Delay performance counter. This wait type is expected for synchronized availability groups and indicates the time to send, write, and acknowledge log to the secondary databases.

https://msdn.microsoft.com/en-us/library/ms179984.aspx

Digging into the mechanics of this wait you have the log blocks being transmitted and hardened but recovery not completed on the remote servers. With this being the case and given that you added additional replicas it stands to reason that your HADR_SYNC_COMMIT may increase due to the increase in bandwidth requirements. In this case Aaron Bertrand is exactly correct in his comments on the question.

Source: http://blogs.msdn.com/b/psssql/archive/2013/04/26/alwayson-hadron-learning-series-hadr-sync-commit-vs-writelog-wait.aspx

Digging into the second part of your question about how this wait could be related to application slowdowns. This I believe is a causality issue. You are looking at your waits increasing and a recent user complaint and drawing the conclusion potentially incorrectly that the two have a relationship when this may not be the case at all. The fact that you added tempdb files and your application became more responsive to me indicates that you may have had some underlying contention issues that could have been exacerbated by the additional overhead of the implicit snapshot isolation level overhead when a database is in an availability group. This may have had little or nothing to do with your HADR_SYNC_COMMIT waits.

If you wanted to test this you could utilize an extended event trace that looks at the hadr_db_commit_mgr_update_harden XEvent on your primary replica and get a baseline. Once you have your baseline you can then add your replicas back in one at a time and see how the trace changes. I would strongly encourage you to use a file that resides on a volume that does not contain any databases and set a rollover and maximum size. Please adjust the duration filter as needed to gather events that match up with your waits so that you can further troubleshoot and correlate this with any other teams that need to be involved.

CREATE EVENT SESSION [HADR_SYNC_COMMIT-Monitor] ON SERVER  -- Run this on the primary replica 
ADD EVENT sqlserver.hadr_db_commit_mgr_update_harden(
    WHERE ([delay]>(10))) -- I strongly encourage you to use the delay filter to avoid getting too many events back, this is measured in milliseconds
ADD TARGET package0.event_file(SET filename=N'<YourFilePathHere>')
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=OFF)
GO