SQL Server Availability Groups – Dumping HADR Log Block Msg Pool After Network Issues

availability-groupsclusteringmemorysql serversql-server-2016

We have a four node Availability Group, two nodes in one site, two nodes off site in another data center. I have noticed after every WAN issue where the WAN connection is flapping, and the off site nodes constantly disconnect and reconnect (using AOAG health from the AOAG dashboard), the memory of the primary server gets consumed by the "HADR Log Block Msg Pool"

SELECT  *
FROM    sys.dm_os_memory_clerks
ORDER   BY pages_kb DESC

type: OBJECTSTORE_SERVICE_BROKER
name: HADR Log Block Msg Pool

In the worst case, when the network was flapping for hours, this memory clerk will end up consuming over 90% of the memory of the SQL Server causing the SQL Server to stop functioning (SQL has 10GB of memory "HADR Log Block Msg Pool" was using 9.8GB).

Is there any way to dump this HADR Log Block Msg Pool? Or stop it from growing so large in the first place? Our only solution so far has been to failover and restart the box.

There are no errors, just the logs for the node disconnects and reconnects and logs for the DBs re-hardening after the reconnects.

As more and more memory gets eaten by the "HADR Log Block Msg Pool" the memory available for everything else drops, affecting performance. Normally this 10GB of RAM is fine for this AOAG group and usage. It's only when the WAN flaps for a while that we have this issue.

We could throw more memory at the server, but I don't think that will solve the underlying problem, it would just buy us more time before it severely hurts performance.

I agree the network is the root cause, but it seems strange that after the issue is resolved and the AOAG is back in sync that SQL would not recover/reallocate RAM back to other SQL memory clerks like most SQL memory clerks do.

Log shipping won't work; it is a transactional environment, we need near real-time, preferably real-time, offsite DR. The AOAG group works 99% of the time and is almost always real time in sync. We are trying to work with the network team to improve connectivity, and/or maybe make it so it would just disconnect instead of flapping.

System Info
SQL version: SQL 2016 SP1 CU6 13.0.4457.0
OS version: Windows 2012 R2 6.3.9600
Server MEM: 12GB
SQL Max MEM: 10GB

Availability Group config info
Four databases are in the AOAG
The AOAG databases all together are 364GB
The two local nodes are in sync mode with one vote each
The two remote nodes are in async mode with zero votes
There is also a local file witness with one vote.

Best Answer

I have noticed after every WAN issue where the WAN connection is flapping, and the off site nodes constantly disconnect and reconnect, the memory of the primary server gets consumed by the "HADR Log Block Msg Pool"

Yes, this is currently by design. It is expected that the network between the two sites can handle the traffic and is available. Since it seems that's not the case, SQL Server really isn't the problem here but is manifesting as an issue. If you're going to continue to work over an unreliable and possibly extremely high latency low bandwidth connection then I wouldn't use availability groups. In fact, I'm not sure what you'd want to use as nothing would have a solid and reliable connection which seems to be the root cause of the issue.

Is there any way to dump this HADR Log Block Msg Pool?

Inside of SQL Server? No.

Or stop it from growing so large in the first place?

Yes, fix the connectivity issue and it won't grow. If it's prolonged connectivity issues, then remove the remote replicas from the AG and it'll stop growing. Since there are two remote replicas, the data will be sent twice which could be exacerbating the issue as it might not have been taken into account for the infrastructure available when architected.

Server MEM: 12GB

This is an awful small amount of server memory for 364 GB of databases + OS + Cluster + AG + all of the antivirus and agents installed.