We are noticing an interesting pattern for HADR_SYNC_COMMIT
waits in our environment. We have a three replica; one primary, one sync secondary and one async secondary in a datacenter and we just added three more ASYNC replicas in another datacenter (~2400 miles apart).
Ever since, we have started to notice an enormous increase in HADR_SYNC_COMMIT
waits. When we look at the active sessions, we see a bunch of COMMIT TRANSACTION
queries waiting on the SYNC replica
From the screenshot, we can clearly see there is a jump in HADR_SYNC_COMMIT
wait on June 29, and we eventually dropped 'two' of the three async replica in the remote datacenter sometime in the noon on July 1st. That dropped the wait times considerably along with it.
What we have checked so far – Log send queue, Redo queue, last hardened time and last commit time on the remote replicas. We have continuous bursts of small transactions during the business hours, and therefore the send queues are pretty small at a given timestamp (anywhere between 60KB and 1MB).
The remote replicas are almost in sync, there is very little difference between the last commit time and last hardened time for any individual lsn on the replicas.
The network pipe is 10G and we modified the transmit buffer size from 256 megs to 2 gigs, this was made under the assumption that the network was dropping packets and re-transmitting them; either way that didn’t seem to help much.
So, I’m wondering what does the ASYNC replicas have to do with HADR_SYNC_COMMIT
waits? Shouldn’t the SYNC replica depend alone on this wait type, what am I missing here?
Best Answer
First the description of the wait event that your question is regarding is:
Digging into the mechanics of this wait you have the log blocks being transmitted and hardened but recovery not completed on the remote servers. With this being the case and given that you added additional replicas it stands to reason that your HADR_SYNC_COMMIT may increase due to the increase in bandwidth requirements. In this case Aaron Bertrand is exactly correct in his comments on the question.
Source: http://blogs.msdn.com/b/psssql/archive/2013/04/26/alwayson-hadron-learning-series-hadr-sync-commit-vs-writelog-wait.aspx
Digging into the second part of your question about how this wait could be related to application slowdowns. This I believe is a causality issue. You are looking at your waits increasing and a recent user complaint and drawing the conclusion potentially incorrectly that the two have a relationship when this may not be the case at all. The fact that you added tempdb files and your application became more responsive to me indicates that you may have had some underlying contention issues that could have been exacerbated by the additional overhead of the implicit snapshot isolation level overhead when a database is in an availability group. This may have had little or nothing to do with your HADR_SYNC_COMMIT waits.
If you wanted to test this you could utilize an extended event trace that looks at the hadr_db_commit_mgr_update_harden XEvent on your primary replica and get a baseline. Once you have your baseline you can then add your replicas back in one at a time and see how the trace changes. I would strongly encourage you to use a file that resides on a volume that does not contain any databases and set a rollover and maximum size. Please adjust the duration filter as needed to gather events that match up with your waits so that you can further troubleshoot and correlate this with any other teams that need to be involved.