SQL Server – Troubleshooting Log File Truncation in AlwaysOn Group

availability-groupssql serversql-server-2016transaction-log

I have been stuck with this issue for two days now and am not hoping someone here might know the answer.

I have multiple databases in one availability group, all of the databases follow the same backup plan. (Full Recovery, Full Backups & hourly Log backups).

One database, and only one, is currently refusing to truncate (reuse) it's log file. It has now grown slowly to 33GBs where the actual database itself is less than 512MB (so tiny).

I checked if there are any running transactions (there aren't) if the log backups run (they do) and everytime I manually do a log backup to check why the truncate isn't happening I get; AVAILABILITY_REPLICA as the reason.

Point is that when looking at the availability dashboard everything is green, there are no Log Queues, no Redo Queues, everything looks as if it is good to go.

As these are hosted on a managed environment where I cannot add/remove databases from the AG myself I've created a ticket to ask for this particular database to be removed from it and then added again. However I'm not sure if this;

a) will fix the problem
b) will not just be a temp fix (truncate once, then slowly start growing again)

Does anyone here have any suggestions as to what to look at?

Extra info; SQL Server 2016, Running on Windows Server

Best Answer

I would like to comment but couldn't due to the lack of reputation.

Is your secondary node being enabled read-only? And if so, did you check if there's any blockings on the secondary node?

Another thing that you could check is if there's any maintenance jobs running.

And... could you try to make everything asynchronous mode to see if it helps? Once log reduced, you may set it back to synchronous mode.

... I've created a ticket to ask for this particular database to be removed from it and then added again. However I'm not sure if this; a) will fix the problem

Last but not least if everything doesn't work, yes, you can remove that particular DB from the AG group. It should fix the issue. Until the log has reduced, you may add it back to the AG group. Removing DB from the AG group shouldn't cause to any impact as applications should be connecting using the Listener.

b) a temp fix (truncate once, then slowly start growing again)

Yes, it might occur again if we don't figure out the root cause to it. And you probably would have to repeat the whole procedure in removing/adding back to AG group, though it shouldn't cause to any production impact.