SQL Server – Troubleshooting Log File Truncation in AlwaysOn Group

availability-groupssql serversql-server-2016transaction-log

I have been stuck with this issue for two days now and am not hoping someone here might know the answer.

I have multiple databases in one availability group, all of the databases follow the same backup plan. (Full Recovery, Full Backups & hourly Log backups).

One database, and only one, is currently refusing to truncate (reuse) it's log file. It has now grown slowly to 33GBs where the actual database itself is less than 512MB (so tiny).

I checked if there are any running transactions (there aren't) if the log backups run (they do) and everytime I manually do a log backup to check why the truncate isn't happening I get; AVAILABILITY_REPLICA as the reason.

Point is that when looking at the availability dashboard everything is green, there are no Log Queues, no Redo Queues, everything looks as if it is good to go.

As these are hosted on a managed environment where I cannot add/remove databases from the AG myself I've created a ticket to ask for this particular database to be removed from it and then added again. However I'm not sure if this;

a) will fix the problem
b) will not just be a temp fix (truncate once, then slowly start growing again)

Does anyone here have any suggestions as to what to look at?

Extra info; SQL Server 2016, Running on Windows Server

Best Answer

I would like to comment but couldn't due to the lack of reputation.

Is your secondary node being enabled read-only? And if so, did you check if there's any blockings on the secondary node?

Another thing that you could check is if there's any maintenance jobs running.

And... could you try to make everything asynchronous mode to see if it helps? Once log reduced, you may set it back to synchronous mode.

... I've created a ticket to ask for this particular database to be removed from it and then added again. However I'm not sure if this; a) will fix the problem

Last but not least if everything doesn't work, yes, you can remove that particular DB from the AG group. It should fix the issue. Until the log has reduced, you may add it back to the AG group. Removing DB from the AG group shouldn't cause to any impact as applications should be connecting using the Listener.

b) a temp fix (truncate once, then slowly start growing again)

Yes, it might occur again if we don't figure out the root cause to it. And you probably would have to repeat the whole procedure in removing/adding back to AG group, though it shouldn't cause to any production impact.

Related Solutions

SQL Server – Unable to Truncate Transaction Log, log_reuse_wait_desc – AVAILABILITY_REPLICA

If you do this:

SELECT * FROM sys.databases

And the log_reuse_wait_desc shows AVAILABILITY_REPLICA, that means SQL Server is waiting to send log data to one of your Always On Availability Group replicas. One of the replicas may be lagging behind due to a slow network, or it may be down altogether.

If you check the AG dashboard and it shows no queues, you may have been a victim of thread exhaustion. It's a known issue that the AG dashboard stops updating after worker thread exhaustion. You'll need to check the status on each replica directly rather than relying on the primary. Nick's note in that Connect item says that you can just alter a replica's properties to restart replication, but that doesn't always work (especially if you have hundreds of databases on a replica with a large amount of data that needs to be sent, and restarting replication can just cause the worker thread exhaustion again.)

If the last guy set up an AG replica and it's not supposed to exist anymore, then it's time to remove that AG and/or replica. Just be careful that apps aren't pointing to the listener name in order to connect to your SQL Server.

Sql-server – At which point in a log backup does SQL Server truncate the log file

Technically, the virtual log files (VLFs) are only attempted to be marked as inactive (can be reused) when they have been successfully backed up and are no longer needed by internal (and sometimes external) processes - such as replication or availability groups.

What happens is that a log backup begins on schedule, but on some days the connection with the backup server is lost during the backup.

This shouldn't cause any issues with SQL Server. SQL Server will see that the connection was broken and kill the session plus whatever was executing. This means that backup wasn't successful so the next log backup should start at the same place because it hasn't yet finished successfully.

The job sits in limbo and when it restarts with the same job id there is a broken chain detected and it begins a full backup of affected databases.

Sounds like the application thinks it ran a backup successfully even though it didn't and "detects" there is a problem... but there isn't. It really sounds like the application logic in the backup program is either flawed or running into a config issue... or it might be their standard logic (by design).

Now if it takes a full backup and walks away... that's not going to affect the log reuse.

But on a certain day of the week, at a certain time of day half of the jobs hang because the backup server goes down for a few minutes and when they restart again they begin a full backup on some databases. And from what I can tell it looks for the LSN, runs a backup, job fails, restarts looking for the same LSN.

Sounds like you have other infrastructure issues that may be contributing. Assuming they are not, though, it sounds like the backup application is confused. If the job fails to backup the log successfully it should start at the same place because it hasn't yet been successfully backed up. If the backup application chooses to see that as an issue and take a full backup to reset itself (the backup application) internally, that's on the application vendor to fix/decide/working as intended/whatever after you tell them about it.

However, since this only happens when the backup application server goes down... you may want to also have a stern talking to your infrastructure team and get that issue situated - or at the very least, if it's downtime because of patching or something, that all jobs be held until the patching (or whatever) is completed.

Best Answer

Related Solutions

SQL Server – Unable to Truncate Transaction Log, log_reuse_wait_desc – AVAILABILITY_REPLICA

Sql-server – At which point in a log backup does SQL Server truncate the log file

Related Question